hive_minded Posted September 30, 2021 Share Posted September 30, 2021 (edited) So I have been running into some unfortunate issues lately. I built my server back in April or so - and up until around a month ago it ran flawlessly, no issues at all. Around a month ago I got my first 'freeze'. Where I am unable to access the server - WebGUI, and docker services running, can't ping, etc. Totally unreachable. But the server is still powered on. It requires a force shut off and on to 'fix'. Around 2 weeks later it happened again. A week after that it happened again. And it's happened twice in the last 5 days now. So it's definitely becoming more regular. After the last time it did this I set up syslog, so I will attach that. I'm really at a loss here (and the most recent freeze happened mid preclearing 2 new HDDs, so that is also a bummer. Any advice would be much appreciated. syslog-192.168.0.000.log Edited September 30, 2021 by hive_minded Quote Link to comment
JorgeB Posted September 30, 2021 Share Posted September 30, 2021 See if this applies to you: https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/ If yes upgrading to v6.10 and switching to ipvlan might fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)). There are also issues with the pool, see below for better pool monitoring: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582 Quote Link to comment
hive_minded Posted September 30, 2021 Author Share Posted September 30, 2021 (edited) I don't know that the first thread 100% applies to me - searching through my syslog I don't have any 'kernel panic' errors. Looking through the last few hours the few that caught my (very untrained eye) are Sep 30 04:08:21 alexandria kernel: WARNING: CPU: 7 PID: 0 at net/netfilter/nf_conntrack_core.c:1120 __nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack] Sep 30 04:08:21 alexandria kernel: cpuidle_enter_state+0x101/0x1c4 Sep 30 04:08:21 alexandria kernel: cpuidle_enter+0x25/0x31 Sep 30 04:08:21 alexandria kernel: do_idle+0x1a6/0x214 Sep 30 04:08:21 alexandria kernel: cpu_startup_entry+0x18/0x1a Sep 30 04:33:53 alexandria kernel: BTRFS error (device nvme0n1p1): bad tree block start, want 1823784960 have 0 I also see a lot of 'netfilter', which the OP in the first thread you linked mentioned. Two of my dockers - Netdata and Tailscale are both set to use 'host' network (the rest of my dockers are using 'bridge'). The link you posted also mentioned that host access could be causing the issues as well. So I will switch both of those to bridge. ------------------------------- Running the command in the second link to check btrfs pools for errors spits out this [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 0 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 0 [/dev/nvme0n1p1].generation_errs 0 [/dev/nvme1n1p1].write_io_errs 161864479 [/dev/nvme1n1p1].read_io_errs 129699470 [/dev/nvme1n1p1].flush_io_errs 2518939 [/dev/nvme1n1p1].corruption_errs 164095 [/dev/nvme1n1p1].generation_errs 0 So it definitely looks like there are a lot of errors. The link says that all values should be 0. I'll run a scrub and an extended SMART test on my cache SSDs. Edited September 30, 2021 by hive_minded Quote Link to comment
hive_minded Posted September 30, 2021 Author Share Posted September 30, 2021 (edited) Running a scrub on my cache drives spits out this: Error summary: verify=54 csum=76277 Corrected: 0 Uncorrectable: 0 Unverified: 0 **edit** Running it again, clicking the 'Repair corrupted blocks' box gives a slightly different result, and it looks like some were corrected Error summary: verify=53 csum=76256 Corrected: 76309 Uncorrectable: 0 Unverified: 0 I tried to run an extended SMART self-test on both of them, but when I click 'START' nothing appears to happen. However both of them say 'PASSED' on the 'SMART overall-health' field at the bottom of the drive information. **edit #2** After forcing a reset of the stats by running "btrfs dev stats -z /mnt/cache", and scrubbing - this is my new output: [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 0 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 0 [/dev/nvme0n1p1].generation_errs 0 [/dev/nvme1n1p1].write_io_errs 0 [/dev/nvme1n1p1].read_io_errs 0 [/dev/nvme1n1p1].flush_io_errs 0 [/dev/nvme1n1p1].corruption_errs 152533 [/dev/nvme1n1p1].generation_errs 107 It appears that the scrub added 107 generation errors. The link you posted mentioned that it can often be related to cables. I actually am not running any cables though - both SSDs are in a QNAP 2x m2 PCIe card. So it looks like either the drive itself may be going bad (which is weird, because its a WD Blue which I think are usually pretty reliable, and its only 6 months old) - or maybe the QNAP PCIe card has something wrong with it? Edited September 30, 2021 by hive_minded Quote Link to comment
JorgeB Posted September 30, 2021 Share Posted September 30, 2021 Those errors are normal after the correcting scrub, reset them again and it should be fine now, errors were caused by one of the NVMe devices dropping offline some time in the past. Quote Link to comment
hive_minded Posted September 30, 2021 Author Share Posted September 30, 2021 Ok gotcha, thanks for the responses. I'll keep a close eye on it now to make sure new errors don't pop up. Quote Link to comment
hive_minded Posted October 2, 2021 Author Share Posted October 2, 2021 36 hours later and running "btrfs dev stats /mnt/cache" shows 0 errors. I am going to tentatively consider this one solved. Though I will be keeping a close eye on errors in the future, and am willing to try 1 drive xfs cache if btrfs continues to have issues. Right now, everything appears stable though. Thanks again for your help @JorgeB Quote Link to comment
JonathanM Posted October 2, 2021 Share Posted October 2, 2021 16 minutes ago, hive_minded said: am willing to try 1 drive xfs cache if btrfs continues to have issues. That may not be the best idea, it seems that BTRFS is indeed more fragile, and XFS doesn't react as poorly to hardware issues, but that doesn't mean the hardware issues are gone. Properly behaved hardware doesn't have issues with either filesystem, so it's best to solve the hardware issues instead of masking it with a more forgiving filesystem. Quote Link to comment
hive_minded Posted October 4, 2021 Author Share Posted October 4, 2021 (edited) Fair enough - and the scrubbing unfortunately did not work. I've had 2 freezes since trying that. I guess I'll to reseat the m2 drives and PCIe card - as well as run a memory test. And then try upgrading to 6.10.0-rc1? Just odd that it worked flawlessly for the first 4-5 months, and the issues just started without any hardware change/distruption. All of the hardware was purchased new - so everything is pretty fresh. Edited October 4, 2021 by hive_minded Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.