Page MenuHomeVyOS Platform

Kernel issues with 1.2.0 & 1.2.0-rolling+201903060337 causing lockup
Closed, ResolvedPublicBUG

Description

We have been getting random freeze ups for the past week or so since upgrading to 1.2.0. Initially we thought it was a hardware issue on our primary firewall, but after bringing the secondary firewall online it has been giving the same errors.

Once this happens, the firewall stops giving out DHCP leases, and /etc/hosts ends up with garbage written to it. Rebooting will bring everything back online, but the DNS recursor won't come back online until the garbage is removed from /etc/hosts

Mar  5 10:06:01 firewall-2 kernel: [ 2176.183169] INFO: task kworker/u16:4:165 blocked for more than 120 seconds.
Mar  5 10:06:01 firewall-2 kernel: [ 2176.266599]       Not tainted 4.19.12-amd64-vyos #1
Mar  5 10:06:01 firewall-2 kernel: [ 2176.325029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418735] Call Trace:
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418738]  ? 0xffffffff8a5cd1a0
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418739]  ? 0xffffffff8a6001a4
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418740]  0xffffffff8a5cd5fd
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418742]  0xffffffffc0161b24
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418743]  ? 0xffffffff8a0a5130
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418744]  0xffffffffc0167409
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418745]  0xffffffffc0167bde
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418747]  0xffffffffc0178227
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418748]  0xffffffffc0178ebc
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418749]  0xffffffff8a2831cd
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418750]  0xffffffff8a2837a1
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418751]  0xffffffff8a1f1319
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418753]  ? 0xffffffff8a1f1962
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418754]  0xffffffff8a1f1962
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418755]  0xffffffff8a07e8ea
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418756]  0xffffffff8a07fd42
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418757]  ? 0xffffffff8a07fce0
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418758]  0xffffffff8a084495
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418759]  ? 0xffffffff8a084390
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418760]  0xffffffff8a600215
Mar  5 10:06:01 firewall-2 kernel: [ 2176.418763] INFO: task md127_raid1:198 blocked for more than 120 seconds.
Mar  5 10:06:01 firewall-2 kernel: [ 2176.500057]       Not tainted 4.19.12-amd64-vyos #1
Mar  5 10:06:01 firewall-2 kernel: [ 2176.558399] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652105] Call Trace:
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652106]  ? 0xffffffff8a5cd1a0
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652107]  ? 0xffffffff8a6001b0
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652108]  0xffffffff8a5cd5fd
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652109]  0xffffffffc0161b24
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652110]  ? 0xffffffff8a0a5130
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652111]  0xffffffffc0167409
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652112]  ? 0xffffffffc016829b
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652113]  0xffffffffc0169f14
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652114]  0xffffffffc0164c4d
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652115]  0xffffffffc017cc4b
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652117]  ? 0xffffffff8a0cd0a8
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652117]  ? 0xffffffff8a0cd100
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652118]  ? 0xffffffff8a5d177f
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652119]  ? 0xffffffffc015a0d7
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652120]  0xffffffffc015a0d7
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652121]  ? 0xffffffff8a0a5130
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652122]  ? 0xffffffffc0159fb0
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652122]  0xffffffff8a084495
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652123]  ? 0xffffffff8a084390
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652124]  0xffffffff8a600215
Mar  5 10:06:01 firewall-2 kernel: [ 2176.652126] INFO: task jbd2/md127-8:253 blocked for more than 120 seconds.
Mar  5 10:06:01 firewall-2 kernel: [ 2176.734474]       Not tainted 4.19.12-amd64-vyos #1
Mar  5 10:06:01 firewall-2 kernel: [ 2176.792814] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886520] Call Trace:
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886521]  ? 0xffffffff8a5cd1a0
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886522]  ? 0xffffffff8a5cdde0
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886523]  0xffffffff8a5cd5fd
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886524]  0xffffffff8a08ed8d
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886525]  0xffffffff8a5cdde8
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886526]  0xffffffff8a5cd999
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886527]  0xffffffff8a5cda56
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886528]  ? 0xffffffff8a0a5990
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886529]  0xffffffffc0186cd2
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886531]  ? 0xffffffffc0189df5
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886532]  0xffffffffc0189df5
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886533]  ? 0xffffffff8a0a5130
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886534]  ? 0xffffffffc0189d20
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886535]  0xffffffff8a084495
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886536]  ? 0xffffffff8a084390
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886536]  0xffffffff8a600215
Mar  5 10:06:01 firewall-2 kernel: [ 2176.886543] INFO: task dhcpd:7207 blocked for more than 120 seconds.
Mar  5 10:06:01 firewall-2 kernel: [ 2176.962648]       Not tainted 4.19.12-amd64-vyos #1
Mar  5 10:06:01 firewall-2 kernel: [ 2177.020988] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114695] Call Trace:
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114697]  ? 0xffffffff8a5cd1a0
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114698]  ? 0xffffffff8a0a5832
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114698]  0xffffffff8a5cd5fd
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114700]  0xffffffffc01899f9
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114701]  ? 0xffffffff8a0a5130
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114702]  0xffffffffc030cb4a
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114703]  0xffffffff8a1f5303
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114704]  0xffffffff8a1f55db
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114705]  0xffffffff8a003319
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114706]  0xffffffff8a600088
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114708] RIP: 0033:0x00007fea5a4a9a20
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114712] Code: Bad RIP value.
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114713] RSP: 002b:00007fff0ff3a888 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114715] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fea5a4a9a20
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114716] RDX: 0000000000000000 RSI: 000055f737327220 RDI: 0000000000000006
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114716] RBP: 000055f735ff6ae0 R08: 00007fea5b115700 R09: 000055f735d94160
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114717] R10: 706368642d6e6f2f R11: 0000000000000246 R12: 0000000000000001
Mar  5 10:06:01 firewall-2 kernel: [ 2177.114718] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000001
Mar  5 10:08:03 firewall-2 kernel: [ 2299.062410] INFO: task kworker/u16:4:165 blocked for more than 120 seconds.
Mar  5 10:08:03 firewall-2 kernel: [ 2299.146122]       Not tainted 4.19.12-amd64-vyos #1
Mar  5 10:08:03 firewall-2 kernel: [ 2299.204547] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298356] Call Trace:
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298359]  ? 0xffffffff8a5cd1a0
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298359]  ? 0xffffffff8a6001a4
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298360]  0xffffffff8a5cd5fd
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298362]  0xffffffffc0161b24
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298364]  ? 0xffffffff8a0a5130
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298365]  0xffffffffc0167409
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298366]  0xffffffffc0167bde
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298367]  0xffffffffc0178227
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298369]  0xffffffffc0178ebc
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298370]  0xffffffff8a2831cd
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298371]  0xffffffff8a2837a1
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298372]  0xffffffff8a1f1319
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298374]  ? 0xffffffff8a1f1962
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298374]  0xffffffff8a1f1962
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298381]  0xffffffff8a07e8ea
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298382]  0xffffffff8a07fd42
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298389]  ? 0xffffffff8a07fce0
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298389]  0xffffffff8a084495
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298395]  ? 0xffffffff8a084390
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298400]  0xffffffff8a600215
Mar  5 10:08:03 firewall-2 kernel: [ 2299.298408] INFO: task md127_raid1:198 blocked for more than 120 seconds.
Mar  5 10:08:04 firewall-2 kernel: [ 2299.379676]       Not tainted 4.19.12-amd64-vyos #1
Mar  5 10:08:04 firewall-2 kernel: [ 2299.438018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531830] Call Trace:
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531833]  ? 0xffffffff8a5cd1a0
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531833]  ? 0xffffffff8a6001b0
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531834]  0xffffffff8a5cd5fd
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531836]  0xffffffffc0161b24
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531842]  ? 0xffffffff8a0a5130
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531853]  0xffffffffc0167409
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531861]  ? 0xffffffffc016829b
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531868]  0xffffffffc0169f14
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531871]  0xffffffffc0164c4d
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531873]  0xffffffffc017cc4b
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531874]  ? 0xffffffff8a0cd0a8
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531875]  ? 0xffffffff8a0cd100
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531876]  ? 0xffffffff8a5d177f
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531877]  ? 0xffffffffc015a0d7
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531878]  0xffffffffc015a0d7
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531879]  ? 0xffffffff8a0a5130
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531880]  ? 0xffffffffc0159fb0
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531880]  0xffffffff8a084495
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531881]  ? 0xffffffff8a084390
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531882]  0xffffffff8a600215
Mar  5 10:08:04 firewall-2 kernel: [ 2299.531885] INFO: task jbd2/md127-8:253 blocked for more than 120 seconds.

I managed to pull these messages from a running firewall after it had stopped working. Unfortunately i've not been able to save them out as trying to write to the disk completely freezes up whichever process attempts it.

Details

Difficulty level
Unknown (require assessment)
Version
1.2.0
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Unspecified (please specify)

Event Timeline

Its a pair of supermicro servers, each containing:

  • X10SDV-4C+-TP4F mainboard
  • Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz (4 core)
  • 2x SAMSUNG MZ7LM240 204Q SSDs in software raid
  • 2x Micron 18ASF1G72PZ-2G3B1 8GB DIMM

it seems somehow related to supermicro servers
can you also provide a storage controller model, please

Its the intel broadwell one:

00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)

Same hardware worked perfectly up until 1.2.0-rc11

Full lspci -vv

00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05) (prog-if 01 [AHCI 1.0])
	Subsystem: Super Micro Computer Inc Device 0921
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 27
	Region 0: I/O ports at f070 [size=8]
	Region 1: I/O ports at f060 [size=4]
	Region 2: I/O ports at f050 [size=8]
	Region 3: I/O ports at f040 [size=4]
	Region 4: I/O ports at f020 [size=32]
	Region 5: Memory at fb512000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee002d8  Data: 0000
	Capabilities: [70] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
	Kernel driver in use: ahci

This debian bug shows the same issue:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913138

According to the following post, a fix for the bug has been included in 4.19.24 and 4.20.11

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=913138#91

Any chance of a kernel update for the rolling releases? :)

I've put in the suggested kernel parameters for my install to disable the broken functionality. Hopefully this will keep it stable until a version with an upgraded kernel is available :)

dm_mod.use_blk_mq=0 scsi_mod.use_blk_mq=0

Please test latest rolling release

c-po changed the task status from Open to Needs testing.Mar 12 2019, 6:27 PM
c-po claimed this task.
syncer triaged this task as Normal priority.Mar 16 2019, 6:15 PM

VyOS runs on 4.19.28 for a week now. Update can be used for both crux and current.

No further reply recieved, assuming this is fixed.

c-po moved this task from Backlog to Finished on the VyOS 1.2 Crux (VyOS 1.2.1) board.
c-po moved this task from In Progress to Finished on the VyOS 1.3 Equuleus board.
dmbaturin set Is it a breaking change? to Unspecified (possibly destroys the router).
dmbaturin set Issue type to Unspecified (please specify).