Jump to content

Server randomly started freezing, after months of running flawlessly


Recommended Posts

So I have been running into some unfortunate issues lately. I built my server back in April or so - and up until around a month ago it ran flawlessly, no issues at all. 

 

Around a month ago I got my first 'freeze'. Where I am unable to access the server - WebGUI, and docker services running, can't ping, etc. Totally unreachable. But the server is still powered on. It requires a force shut off and on to 'fix'. Around 2 weeks later it happened again. A week after that it happened again. And it's happened twice in the last 5 days now. So it's definitely becoming more regular. 

 

After the last time it did this I set up syslog, so I will attach that. 

 

I'm really at a loss here (and the most recent freeze happened mid preclearing 2 new HDDs, so that is also a bummer. 

 

Any advice would be much appreciated. 

syslog-192.168.0.000.log

Edited by hive_minded
Link to comment

See if this applies to you:

https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/

 

If yes upgrading to v6.10 and switching to ipvlan might fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)).

 

There are also issues with the pool, see below for better pool monitoring:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

 

Link to comment

I don't know that the first thread 100% applies to me - searching through my syslog I don't have any 'kernel panic' errors. 

 

Looking through the last few hours the few that caught my (very untrained eye) are

 

Sep 30 04:08:21 alexandria kernel: WARNING: CPU: 7 PID: 0 at net/netfilter/nf_conntrack_core.c:1120 __nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack]

Sep 30 04:08:21 alexandria kernel: cpuidle_enter_state+0x101/0x1c4
Sep 30 04:08:21 alexandria kernel: cpuidle_enter+0x25/0x31
Sep 30 04:08:21 alexandria kernel: do_idle+0x1a6/0x214
Sep 30 04:08:21 alexandria kernel: cpu_startup_entry+0x18/0x1a

 

Sep 30 04:33:53 alexandria kernel: BTRFS error (device nvme0n1p1): bad tree block start, want 1823784960 have 0

 

I also see a lot of 'netfilter', which the OP in the first thread you linked mentioned. 

 

Two of my dockers - Netdata and Tailscale are both set to use 'host' network (the rest of my dockers are using 'bridge'). The link you posted also mentioned that host access could be causing the issues as well. So I will switch both of those to bridge. 

 

-------------------------------

 

Running the command in the second link to check btrfs pools for errors spits out this

 

[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  0
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    161864479
[/dev/nvme1n1p1].read_io_errs     129699470
[/dev/nvme1n1p1].flush_io_errs    2518939
[/dev/nvme1n1p1].corruption_errs  164095
[/dev/nvme1n1p1].generation_errs  0

 

So it definitely looks like there are a lot of errors. The link says that all values should be 0. I'll run a scrub and an extended SMART test on my cache SSDs. 

Edited by hive_minded
Link to comment

Running a scrub on my cache drives spits out this:

 

Error summary:    verify=54 csum=76277
  Corrected:      0
  Uncorrectable:  0
  Unverified:     0

 

**edit**

 

Running it again, clicking the 'Repair corrupted blocks' box gives a slightly different result, and it looks like some were corrected

 

Error summary:    verify=53 csum=76256
  Corrected:      76309
  Uncorrectable:  0
  Unverified:     0

 

I tried to run an extended SMART self-test on both of them, but when I click 'START' nothing appears to happen. However both of them say 'PASSED' on the 'SMART overall-health' field at the bottom of the drive information. 

 

**edit #2**

 

After forcing a reset of the stats by running "btrfs dev stats -z /mnt/cache",  and scrubbing - this is my new output:

 

[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  0
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  152533
[/dev/nvme1n1p1].generation_errs  107

 

It appears that the scrub added 107 generation errors. The link you posted mentioned that it can often be related to cables. I actually am not running any cables though - both SSDs are in a QNAP 2x m2 PCIe card. 

 

So it looks like either the drive itself may be going bad (which is weird, because its a WD Blue which I think are usually pretty reliable, and its only 6 months old) - or maybe the QNAP PCIe card has something wrong with it?

Edited by hive_minded
Link to comment

36 hours later and running "btrfs dev stats /mnt/cache" shows 0 errors.

 

I am going to tentatively consider this one solved. Though I will be keeping a close eye on errors in the future, and am willing to try 1 drive xfs cache if btrfs continues to have issues. 

 

Right now, everything appears stable though. Thanks again for your help @JorgeB

Link to comment
16 minutes ago, hive_minded said:

am willing to try 1 drive xfs cache if btrfs continues to have issues. 

That may not be the best idea, it seems that BTRFS is indeed more fragile, and XFS doesn't react as poorly to hardware issues, but that doesn't mean the hardware issues are gone. Properly behaved hardware doesn't have issues with either filesystem, so it's best to solve the hardware issues instead of masking it with a more forgiving filesystem.

 

 

Link to comment

Fair enough - and the scrubbing unfortunately did not work. I've had 2 freezes since trying that. 

 

I guess I'll to reseat the m2 drives and PCIe card - as well as run a memory test. And then try upgrading to 6.10.0-rc1?

 

Just odd that it worked flawlessly for the first 4-5 months, and the issues just started without any hardware change/distruption. All of the hardware was purchased new - so everything is pretty fresh. 

Edited by hive_minded
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...