Server randomly started freezing, after months of running flawlessly

hive_minded · September 30, 2021

So I have been running into some unfortunate issues lately. I built my server back in April or so - and up until around a month ago it ran flawlessly, no issues at all.

Around a month ago I got my first 'freeze'. Where I am unable to access the server - WebGUI, and docker services running, can't ping, etc. Totally unreachable. But the server is still powered on. It requires a force shut off and on to 'fix'. Around 2 weeks later it happened again. A week after that it happened again. And it's happened twice in the last 5 days now. So it's definitely becoming more regular.

After the last time it did this I set up syslog, so I will attach that.

I'm really at a loss here (and the most recent freeze happened mid preclearing 2 new HDDs, so that is also a bummer.

Any advice would be much appreciated.

syslog-192.168.0.000.log

Edited September 30, 2021 by hive_minded

JorgeB · September 30, 2021

See if this applies to you:

https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/

If yes upgrading to v6.10 and switching to ipvlan might fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)).

There are also issues with the pool, see below for better pool monitoring:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

hive_minded · September 30, 2021

I don't know that the first thread 100% applies to me - searching through my syslog I don't have any 'kernel panic' errors.

Looking through the last few hours the few that caught my (very untrained eye) are

Sep 30 04:08:21 alexandria kernel: WARNING: CPU: 7 PID: 0 at net/netfilter/nf_conntrack_core.c:1120 __nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack]

Sep 30 04:08:21 alexandria kernel: cpuidle_enter_state+0x101/0x1c4
Sep 30 04:08:21 alexandria kernel: cpuidle_enter+0x25/0x31
Sep 30 04:08:21 alexandria kernel: do_idle+0x1a6/0x214
Sep 30 04:08:21 alexandria kernel: cpu_startup_entry+0x18/0x1a

Sep 30 04:33:53 alexandria kernel: BTRFS error (device nvme0n1p1): bad tree block start, want 1823784960 have 0

I also see a lot of 'netfilter', which the OP in the first thread you linked mentioned.

Two of my dockers - Netdata and Tailscale are both set to use 'host' network (the rest of my dockers are using 'bridge'). The link you posted also mentioned that host access could be causing the issues as well. So I will switch both of those to bridge.

-------------------------------

Running the command in the second link to check btrfs pools for errors spits out this

[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  0
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    161864479
[/dev/nvme1n1p1].read_io_errs     129699470
[/dev/nvme1n1p1].flush_io_errs    2518939
[/dev/nvme1n1p1].corruption_errs  164095
[/dev/nvme1n1p1].generation_errs  0

So it definitely looks like there are a lot of errors. The link says that all values should be 0. I'll run a scrub and an extended SMART test on my cache SSDs.

Edited September 30, 2021 by hive_minded

hive_minded · September 30, 2021

Running a scrub on my cache drives spits out this:

Error summary:    verify=54 csum=76277
  Corrected:      0
  Uncorrectable:  0
  Unverified:     0

**edit**

Running it again, clicking the 'Repair corrupted blocks' box gives a slightly different result, and it looks like some were corrected

Error summary:    verify=53 csum=76256
  Corrected:      76309
  Uncorrectable:  0
  Unverified:     0

I tried to run an extended SMART self-test on both of them, but when I click 'START' nothing appears to happen. However both of them say 'PASSED' on the 'SMART overall-health' field at the bottom of the drive information.

**edit #2**

After forcing a reset of the stats by running "btrfs dev stats -z /mnt/cache", and scrubbing - this is my new output:

[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  0
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  152533
[/dev/nvme1n1p1].generation_errs  107

It appears that the scrub added 107 generation errors. The link you posted mentioned that it can often be related to cables. I actually am not running any cables though - both SSDs are in a QNAP 2x m2 PCIe card.

So it looks like either the drive itself may be going bad (which is weird, because its a WD Blue which I think are usually pretty reliable, and its only 6 months old) - or maybe the QNAP PCIe card has something wrong with it?

Edited September 30, 2021 by hive_minded

JorgeB · September 30, 2021

Those errors are normal after the correcting scrub, reset them again and it should be fine now, errors were caused by one of the NVMe devices dropping offline some time in the past.

hive_minded · September 30, 2021

Ok gotcha, thanks for the responses. I'll keep a close eye on it now to make sure new errors don't pop up.

hive_minded · October 2, 2021

36 hours later and running "btrfs dev stats /mnt/cache" shows 0 errors.

I am going to tentatively consider this one solved. Though I will be keeping a close eye on errors in the future, and am willing to try 1 drive xfs cache if btrfs continues to have issues.

Right now, everything appears stable though. Thanks again for your help @JorgeB

JonathanM · October 2, 2021

16 minutes ago, hive_minded said:

am willing to try 1 drive xfs cache if btrfs continues to have issues.

That may not be the best idea, it seems that BTRFS is indeed more fragile, and XFS doesn't react as poorly to hardware issues, but that doesn't mean the hardware issues are gone. Properly behaved hardware doesn't have issues with either filesystem, so it's best to solve the hardware issues instead of masking it with a more forgiving filesystem.

hive_minded · October 4, 2021

Fair enough - and the scrubbing unfortunately did not work. I've had 2 freezes since trying that.

I guess I'll to reseat the m2 drives and PCIe card - as well as run a memory test. And then try upgrading to 6.10.0-rc1?

Just odd that it worked flawlessly for the first 4-5 months, and the issues just started without any hardware change/distruption. All of the hardware was purchased new - so everything is pretty fresh.

Edited October 4, 2021 by hive_minded

Server randomly started freezing, after months of running flawlessly

Recommended Posts

hive_minded

Link to comment

JorgeB

Link to comment

hive_minded

Link to comment

hive_minded

Link to comment

JorgeB

Link to comment

hive_minded

Link to comment

hive_minded

Link to comment

JonathanM

Link to comment

hive_minded

Link to comment

Join the conversation