6.9.2 Getting freezes/unresponsive system during parity swap

Vcent · December 6, 2021

Right. I recently pretty much filled my initial uNRAID array, and had a disk that was throwing some errors, although has been stable for a fair while (I'm not entirely convinced the disk is the problem, rather than a temporary issue where I bumped the cable with the array up, reseated it, and it racked up a ton of errors in the meantime), but I digress - I figured I'd swap that disk for a bigger one, and since I got a deal on a 14Tb drive, with my current 12Tb parity, I had to upgrade parity first.

I had however learnt about the parity swap procedure before that, so figured it would make sense to do that instead, and use the old disk with the errors as a scratch drive for something. Anyhow, I ran a pre-clear on the new parity drive, which came up fine, and so follow the process for the parity swap, which goes fine ... until it doesn't.

The progress will at some point just stop, and stay stuck wherever it got to, never progressing. Specifically once this

Quote

Dec 6 15:43:50 Tower kernel: general protection fault, probably for non-canonical address 0xf7ff8883d1ed11a0: 0000 [#1] SMP NOPTI

Dec 6 15:43:50 Tower kernel: CPU: 3 PID: 5729 Comm: kworker/u8:4 Not tainted 5.10.28-Unraid #1

appears in the syslog, there's pretty much a 100% chance of parts of the server being locked up, the parity swap being stuck, the relevant disks spinning down at their designated spin-down time, and zero chance of doing a clean shutdown or anything like that. Most functionality is still retained, at least insofar as a server with no mounted array can be said to have functionality - usually the webGUI works, although it can crash as well. Logs are accessible, but shutdown commands just get logged, without shutting down.

So far I've tried several times to get the process completed, most successfully it got to 100%, then ..got stuck, of course. The syslog did actually manage to capture the successful termination of the old->new parity copy, but since whatever needed to be run afterward wasn't run, this was overall not a success, and the array didn't recognize the new parity drive as correct.

Things I've done to try to keep it from happening:

Run memtest. Several times, at varying lengths(including just running it all night), all came up clear.

Most attempts were made while running in safe mode, to ensure that the problem didn't come from a plugin.

So far I'm at my wit's end, as I can't find any clear indication of what or why something is trying to access places it shouldn't, or how to prevent it - the PID listed in any error messages is long dead by the time I see it.

For some interesting reason, the drives will all show as "Device encrypted and unlocked" once the server fails in this manner, regardless of whether or not I've actually mounted/unlocked the array before initiating the parity swap.

tower-diagnostics-20211206-1716.zip

Edited December 6, 2021 by Vcent
Additional info.

Vcent · December 23, 2021

Update, I guess. Not going to be helpful for anyone with a similar problem, I suspect.

Got parity upgraded by rebuilding it, then upgraded drive by rebuilding onto the old parity. So parity swap, manual style.

Quite annoying, but worked first time I tried it.

Currently the server is busy crashing/freezing itself about every 1-2 days, killing some flavour of php (php-7 I think?) for using too much memory, due to a pathetically low limit being set ..somewhere, that I can't find. Apparently it can only use ~270 something mb of ram, despite the system having 16gb available, yet that is above the limit, so it gets reaped by the oom killer. I'm also getting errors for a DM-3 device, which is curious, as I don't have a cache drive installed (never have), and I can only find mentions of that address/designation in threads about cache SSDs.

Dockers have been nerfed, all running memory limits, none of them approaching them, only running a handful of dockers changes ..nothing, and at this point essentially all plugins have been uninstalled, to no help. System still kills itself randomly, with regularity, and displaying the same symptoms each time it happens:

Fans are running at a decent clip, pumping hot air out of the case (despite unraid being frozen and unresponsive to even a basic ping), interestingly the network interface lights still light up, but overall apart from functioning as a space heater, the server is not functional.

Vcent · January 4, 2022

And I guess more bumping. System is by now killing itself multiple times a day, although I did manage to trace the php killing to a docker container, which has been removed. At this point even thinking about a parity check is a joke, as it just slows down the server before the next kill restarts the process.

There's nothing particularly consistent about it - sometimes I'm doing something that uses a good amount of resources and it works fine, other times the server dies, often it's just left on its own, then dies, one time I've even just left it at the unRaid login prompt, with nothing mounted or done (not even logged in), and yet it's managed to kill itself in the ~10 hours or so it was just ..standing at an idle login prompt, with no workload at all.

The logs are of little help to me, and I can't claim to be able to decipher the kernel panic log that remains on screen whenever the server crashes - it's not even the exact same every time either, although it does consistently appear to be of "Not syncing: fatal exception in interrupt" type, whatever that means beyond a fatal error in an interrupt though, 'i haven't the foggiest.

JorgeB · January 4, 2022

Enable the syslog server and post that after a crash, it may help showing what the problem is.

Vcent · January 9, 2022

Been having the syslog server (or rather mirror) up for a while now, problem is that it rarely captures anything particularly interesting - a docker will drop a net connection, then make up a new one, and so on, until eventually .. the server just stops responding to anything, usually starts producing heat, and spins up the fans. Oftentimes the last message will either be about the dropped (local) IPv6 address, or about the disks spinning down, then nothing else gets logged.

Quote

<---cut more of the same--->

Jan 9 11:51:14 Tower kernel: docker0: port 9(vethb2a9002) entered disabled state
Jan 9 11:51:24 Tower avahi-daemon[7024]: Interface vethb2a9002.IPv6 no longer relevant for mDNS.
Jan 9 11:51:24 Tower kernel: docker0: port 9(vethb2a9002) entered disabled state
Jan 9 11:51:24 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface vethb2a9002.IPv6 with address fe80::3003:f3ff:fe5d:320e.
Jan 9 11:51:24 Tower kernel: device vethb2a9002 left promiscuous mode
Jan 9 11:51:24 Tower kernel: docker0: port 9(vethb2a9002) entered disabled state
Jan 9 11:51:24 Tower avahi-daemon[7024]: Withdrawing address record for fe80::3003:f3ff:fe5d:320e on vethb2a9002.
Jan 9 11:51:36 Tower kernel: veth2b6032c: renamed from eth0
Jan 9 11:51:36 Tower kernel: docker0: port 6(veth74d3a5b) entered disabled state
Jan 9 11:51:39 Tower kernel: veth0a68062: renamed from eth0
Jan 9 11:51:39 Tower kernel: docker0: port 5(vethdbaf6a3) entered disabled state
Jan 9 11:51:44 Tower avahi-daemon[7024]: Interface vethdbaf6a3.IPv6 no longer relevant for mDNS.
Jan 9 11:51:44 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface vethdbaf6a3.IPv6 with address fe80::d851:52ff:feff:b08.
Jan 9 11:51:44 Tower kernel: docker0: port 5(vethdbaf6a3) entered disabled state
Jan 9 11:51:44 Tower kernel: device vethdbaf6a3 left promiscuous mode
Jan 9 11:51:44 Tower kernel: docker0: port 5(vethdbaf6a3) entered disabled state
Jan 9 11:51:44 Tower avahi-daemon[7024]: Withdrawing address record for fe80::d851:52ff:feff:b08 on vethdbaf6a3.
Jan 9 11:51:50 Tower avahi-daemon[7024]: Interface veth74d3a5b.IPv6 no longer relevant for mDNS.
Jan 9 11:51:50 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface veth74d3a5b.IPv6 with address fe80::6c06:e8ff:fee2:1e24.
Jan 9 11:51:50 Tower kernel: docker0: port 6(veth74d3a5b) entered disabled state
Jan 9 11:51:50 Tower kernel: device veth74d3a5b left promiscuous mode
Jan 9 11:51:50 Tower kernel: docker0: port 6(veth74d3a5b) entered disabled state
Jan 9 11:51:50 Tower avahi-daemon[7024]: Withdrawing address record for fe80::6c06:e8ff:fee2:1e24 on veth74d3a5b.
Jan 9 11:51:54 Tower kernel: veth032db53: renamed from eth0
Jan 9 11:51:54 Tower kernel: docker0: port 7(veth59e8068) entered disabled state
Jan 9 11:51:55 Tower kernel: veth38d78d4: renamed from eth0
Jan 9 11:51:55 Tower kernel: docker0: port 4(vethc70f7d4) entered disabled state
Jan 9 11:52:03 Tower kernel: veth115e8f9: renamed from eth0
Jan 9 11:52:03 Tower kernel: docker0: port 3(veth55bc601) entered disabled state
Jan 9 11:52:11 Tower avahi-daemon[7024]: Interface veth59e8068.IPv6 no longer relevant for mDNS.
Jan 9 11:52:11 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface veth59e8068.IPv6 with address fe80::809e:85ff:fe4c:823e.
Jan 9 11:52:11 Tower kernel: docker0: port 7(veth59e8068) entered disabled state
Jan 9 11:52:11 Tower kernel: device veth59e8068 left promiscuous mode
Jan 9 11:52:11 Tower kernel: docker0: port 7(veth59e8068) entered disabled state
Jan 9 11:52:11 Tower avahi-daemon[7024]: Withdrawing address record for fe80::809e:85ff:fe4c:823e on veth59e8068.
Jan 9 11:52:14 Tower avahi-daemon[7024]: Interface vethc70f7d4.IPv6 no longer relevant for mDNS.
Jan 9 11:52:14 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface vethc70f7d4.IPv6 with address fe80::f8a1:beff:fe37:281b.
Jan 9 11:52:14 Tower kernel: docker0: port 4(vethc70f7d4) entered disabled state
Jan 9 11:52:14 Tower kernel: device vethc70f7d4 left promiscuous mode
Jan 9 11:52:14 Tower kernel: docker0: port 4(vethc70f7d4) entered disabled state
Jan 9 11:52:14 Tower avahi-daemon[7024]: Withdrawing address record for fe80::f8a1:beff:fe37:281b on vethc70f7d4.
Jan 9 11:52:15 Tower avahi-daemon[7024]: Interface veth55bc601.IPv6 no longer relevant for mDNS.
Jan 9 11:52:15 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface veth55bc601.IPv6 with address fe80::3cf9:2bff:fe92:60d.
Jan 9 11:52:15 Tower kernel: docker0: port 3(veth55bc601) entered disabled state
Jan 9 11:52:15 Tower kernel: device veth55bc601 left promiscuous mode
Jan 9 11:52:15 Tower kernel: docker0: port 3(veth55bc601) entered disabled state
Jan 9 11:52:15 Tower avahi-daemon[7024]: Withdrawing address record for fe80::3cf9:2bff:fe92:60d on veth55bc601.
Jan 9 12:00:02 Tower dhcpcd[1565]: br0: failed to renew DHCP, rebinding
Jan 9 12:44:12 Tower kernel: BTRFS warning (device dm-3): csum failed root 5 ino 4668494 off 1588256768 csum 0xc582cc78 expected csum 0xf93b897d mirror 1
Jan 9 12:44:12 Tower kernel: BTRFS error (device dm-3): bdev /dev/mapper/md4 errs: wr 0, rd 0, flush 0, corrupt 157, gen 0
Jan 9 13:16:37 Tower kernel: docker0: port 3(vethba51b15) entered blocking state
Jan 9 13:16:37 Tower kernel: docker0: port 3(vethba51b15) entered disabled state
Jan 9 13:16:37 Tower kernel: device vethba51b15 entered promiscuous mode
Jan 9 13:16:37 Tower kernel: docker0: port 3(vethba51b15) entered blocking state
Jan 9 13:16:37 Tower kernel: docker0: port 3(vethba51b15) entered forwarding state
Jan 9 13:16:37 Tower kernel: docker0: port 3(vethba51b15) entered disabled state
Jan 9 13:17:02 Tower kernel: eth0: renamed from veth153f703
Jan 9 13:17:02 Tower kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethba51b15: link becomes ready
Jan 9 13:17:02 Tower kernel: docker0: port 3(vethba51b15) entered blocking state
Jan 9 13:17:02 Tower kernel: docker0: port 3(vethba51b15) entered forwarding state
Jan 9 13:17:04 Tower avahi-daemon[7024]: Joining mDNS multicast group on interface vethba51b15.IPv6 with address fe80::3410:79ff:fe93:be73.
Jan 9 13:17:04 Tower avahi-daemon[7024]: New relevant interface vethba51b15.IPv6 for mDNS.
Jan 9 13:17:04 Tower avahi-daemon[7024]: Registering new address record for fe80::3410:79ff:fe93:be73 on vethba51b15.*.
Jan 9 13:17:17 Tower kernel: veth389a0a7: renamed from eth0
Jan 9 13:17:17 Tower kernel: docker0: port 8(veth91f37a1) entered disabled state
Jan 9 13:17:18 Tower avahi-daemon[7024]: Interface veth91f37a1.IPv6 no longer relevant for mDNS.
Jan 9 13:17:18 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface veth91f37a1.IPv6 with address fe80::58c6:28ff:fea3:2840.
Jan 9 13:17:18 Tower kernel: docker0: port 8(veth91f37a1) entered disabled state
Jan 9 13:17:18 Tower kernel: device veth91f37a1 left promiscuous mode
Jan 9 13:17:18 Tower kernel: docker0: port 8(veth91f37a1) entered disabled state
Jan 9 13:17:18 Tower avahi-daemon[7024]: Withdrawing address record for fe80::58c6:28ff:fea3:2840 on veth91f37a1.
Jan 9 13:25:58 Tower kernel: veth153f703: renamed from eth0
Jan 9 13:25:58 Tower kernel: docker0: port 3(vethba51b15) entered disabled state
Jan 9 13:25:59 Tower avahi-daemon[7024]: Interface vethba51b15.IPv6 no longer relevant for mDNS.
Jan 9 13:25:59 Tower avahi-daemon[7024]: Leaving mDNS multicast group on interface vethba51b15.IPv6 with address fe80::3410:79ff:fe93:be73.
Jan 9 13:25:59 Tower kernel: docker0: port 3(vethba51b15) entered disabled state
Jan 9 13:25:59 Tower kernel: device vethba51b15 left promiscuous mode
Jan 9 13:25:59 Tower kernel: docker0: port 3(vethba51b15) entered disabled state
Jan 9 13:25:59 Tower avahi-daemon[7024]: Withdrawing address record for fe80::3410:79ff:fe93:be73 on vethba51b15.
Jan 9 13:30:14 Tower kernel: BUG: Bad page state in process kswapd0 pfn:36fecb
Jan 9 13:30:14 Tower kernel: page:0000000091ed811b refcount:0 mapcount:-64 mapping:0000000000000000 index:0x1 pfn:0x36fecb
Jan 9 13:30:14 Tower kernel: flags: 0x2ffff0000000000()
Jan 9 13:30:14 Tower kernel: raw: 02ffff0000000000 dead000000000100 dead000000000122 0000000000000000
Jan 9 13:30:14 Tower kernel: raw: 0000000000000001 0000000000000000 00000000ffffffbf 0000000000000000
Jan 9 13:30:14 Tower kernel: page dumped because: nonzero mapcount
Jan 9 13:30:14 Tower kernel: Modules linked in: tun veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter dm_crypt dm_mod dax md_mod amdgpu gpu_sched i2c_algo_bit drm_kms_helper ttm drm agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops it87 hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding edac_mce_amd kvm_amd ccp kvm mpt3sas crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd sr_mod raid_class cdrom scsi_transport_sas r8169 cryptd i2c_piix4 realtek i2c_core ahci video glue_helper k10temp backlight acpi_cpufreq libahci button
Jan 9 13:30:14 Tower kernel: CPU: 1 PID: 652 Comm: kswapd0 Not tainted 5.10.28-Unraid #1
Jan 9 13:30:14 Tower kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88X-D3H, BIOS F7 12/25/2015
Jan 9 13:30:14 Tower kernel: Call Trace:
Jan 9 13:30:14 Tower kernel: dump_stack+0x6b/0x83
Jan 9 13:30:14 Tower kernel: bad_page+0xcb/0xe3
Jan 9 13:30:14 Tower kernel: check_free_page+0x70/0x76
Jan 9 13:30:14 Tower kernel: free_pcppages_bulk+0xd0/0x205
Jan 9 13:30:14 Tower kernel: free_unref_page_list+0xbe/0xf4
Jan 9 13:30:14 Tower kernel: shrink_page_list+0x8e3/0x924
Jan 9 13:30:14 Tower kernel: shrink_inactive_list+0x1d6/0x2e2
Jan 9 13:30:14 Tower kernel: shrink_lruvec+0x369/0x4e7
Jan 9 13:30:14 Tower kernel: ? __default_send_IPI_shortcut+0x1b/0x26
Jan 9 13:30:14 Tower kernel: ? setup_local_APIC+0x20f/0x248
Jan 9 13:30:14 Tower kernel: ? update_load_avg+0x2aa/0x2c4
Jan 9 13:30:14 Tower kernel: mem_cgroup_shrink_node+0xa1/0xc9
Jan 9 13:30:14 Tower kernel: mem_cgroup_soft_limit_reclaim+0x13c/0x237
Jan 9 13:30:14 Tower kernel: balance_pgdat+0x1fc/0x3dc
Jan 9 13:30:14 Tower kernel: kswapd+0x240/0x28c
Jan 9 13:30:14 Tower kernel: ? init_wait_entry+0x24/0x24
Jan 9 13:30:14 Tower kernel: ? balance_pgdat+0x3dc/0x3dc
Jan 9 13:30:14 Tower kernel: kthread+0xe5/0xea
Jan 9 13:30:14 Tower kernel: ? __kthread_bind_mask+0x57/0x57
Jan 9 13:30:14 Tower kernel: ret_from_fork+0x22/0x30
Jan 9 13:30:14 Tower kernel: Disabling lock debugging due to kernel taint

And that's it, nothing more was logged. For once it actually managed to log ..something, rather than stop around 13:25:59 (withdrawing address, making a new one, yadda yadda, server unresponsive). DM-3 shows up in the log, which is as interesting as it is annoying, since I still don't have a cache device, although it might come from the system having a misconception somewhere about having a swap file (which it doesn't anymore, and swap is indicated at 0kb)?

I actually thought for a while that I had figured out the problem, as a misbehaving docker got corrupted, and started spewing out files into its config directory, and since I was running minimal plugins, nothing ever reported that it was doing so, or filling up. Fairly sure I've fixed said docker, or at least directed it at a suitable target, but still - server goes space heater less often now, but it's still happening far too often to be useful.

I'm guessing that the almost constant "Old network address died, making a new one" is due to a docker VPN, which itself keeps detecting a loopback which makes it kill off the packet

Quote

2022-01-09 19:35:07,885 DEBG 'start-script' stdout output:
2022-01-09 19:35:07 us=885547 Recursive routing detected, drop tun packet to [AF_INET]<--VPN Provider IP-->

..this happens not infrequently, but interestingly enough everything works ok(or at least as expected), and I can't quite figure out how to stop it from happening - the issue is that a client ends up trying to send packets to itself, through the tunnel, which uhh.. OpenVPN obviously doesn't like, and kills the packets.

Currently I'm guessing the shutdown is due to a thermal issue, but I can't conclusively say that it is - There's fairly decent cooling overall, although the area around the USB slots/North bridge does seem to get fairly hot for some reason.

I do however have some pictures of the on screen/console output once the server goes down, sadly they're pretty much all the end of a trace, with a bunch of register addresses which mean ..nothing to me.

Most understandable was

> Kernel panic - not syncing: fatal exception in interrupt

> Kernel Offset: disabled

> ---[end kernel panic - not syncing: Fatal exception in interrupt ]---

Edited January 9, 2022 by Vcent

JorgeB · January 10, 2022

16 hours ago, Vcent said:

bdev /dev/mapper/md4 errs: wr 0, rd 0, flush 0, corrupt 157, gen 0

btrfs is detecting data corruption, you should run memtest.

Vcent · January 10, 2022

It's currently on the last 25% of the final (fourth) pass of MemTest86 Free Version 9.3, which much like the included memtest86+ on unRaid has found .. diddly squat, except that there are no issues with the ram.

I would be highly surprised if that changes during the last hour of the test, seeing as the included memtest was run several times previously, and also didn't find anything during any of the passes either. Unless unRaid is somehow significantly harder on the ram, since it manages to kill the machine in a much shorter period of time than any of the test runs.

It's done with its run, finding nothing. The advice it gives to run it again in multi-CPU mode is nice, but not possible on my motherboard/CPU combo, due to some UEFI limitation on it. Don't have any other AMD boards lying around to test with either. I'm letting it run once more through the night, but I'd be gobsmacked if that changed the result.

Edited January 10, 2022 by Vcent

Vcent · October 8, 2022

Update, in case someone else hits this lovely snag in the future: The problem was indeed a ram-stick, which worked perfectly fine until stressed juuust right, at which point it started outputting gibberish garbage, which crashed the system. Unfortunately, none of the memtests identified the ram as being faulty, passing with no errors every time (unless I started faffing about in the options while a test ran, which would sometimes make a ton of errors appear - I'm guessing that's a bug in memtest though, rather than indicative of this particular flaw).

I ended up finding it by removing one stick, running it for a day, stressing the system, and when it didn't crash, tried swapping the sticks to verify the problem - and indeed, it promptly wound up crashing the system once everything got loaded hard enough.

Curiously all memtests regardless of runtime still insist that everything is peachy, and that there's nothing wrong with the working, or the faulty stick. System has been both stable and dependable after removal of the problematic stick.

6.9.2 Getting freezes/unresponsive system during parity swap

Recommended Posts

Vcent

Link to comment

Vcent

Link to comment

Vcent

Link to comment

JorgeB

Link to comment

Vcent

Link to comment

JorgeB

Link to comment

Vcent

Link to comment

Vcent

Link to comment

Join the conversation