Networking/NAS goes down randomly

atlatonin · May 28

Hello, so my newly built NAS has been running perfectly fine for about a month or so. However, since 3 days ago I've started getting random "crashes" about once per day. It's in quotation mark because it's not really a crash, but I usually have to hard shutdown and reboot the server. Basically it seems like the networking goes down, but only part of it? So for example the Unraid web gui is inaccessible and the docker containers can't communicate with each other anymore, but I can access some of the docker containers web ui's individually.

Today for some reason the web gui was working (but only under the xxx.local domain, not the actual IP weirdly enough) but none of the docker containers were accessible, so I finally managed to get diagnostics, they are attached (crash is roughly 2-3 hours before reboot, around 20:00 if my notifications are working correctly). I'm running Unraid 6.12.10.

Also today after I rebooted the server the SSD cache drive was missing. After another shutdown + hard start it showed up again, still I'm kinda worried now.

mausflix-diagnostics-20240528-2244.zip

Edited May 28 by atlatonin

bmartino1 · May 28

reviewing diagn now...

weird go file:
# Give perms to Intel QuickSync
chmod -R 777 /dev/dri

^not sure thats correct...

per system log:

May 28 20:38:52 Mausflix kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 48773416, 8 blocks, I/O Error (sct 0x3 / sc 0x71)
May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 48773416 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 48773440, 8 blocks, I/O Error (sct 0x3 / sc 0x71)
May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 48773440 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 18392152, 24 blocks, I/O Error (sct 0x3 / sc 0x71)
May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 18392152 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 2
May 28 20:39:24 Mausflix kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x3
May 28 20:39:24 Mausflix kernel: nvme nvme0: Removing after probe failure status: -19
May 28 20:39:24 Mausflix kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0

...
...
May 28 22:44:54 Mausflix kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 54, rd 1619, flush 0, corrupt 0, gen 0
May 28 22:44:54 Mausflix kernel: I/O error, dev loop2, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 22:44:54 Mausflix kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 54, rd 1620, flush 0, corrupt 0, gen 0
May 28 22:44:54 Mausflix kernel: I/O error, dev loop2, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 22:44:54 Mausflix kernel: Buffer I/O error on dev loop2, logical block 0, async page read

^Looks like the nvme drive is failing or going out. you will need to run check disk...

atlatonin · May 29

@bmartino1 thank you for the quick answer. It's strange since the SSD is brand new (along with everything else).

The go file line is from when I was trying to get Intel QuickSync to work, I removed it.

I've run Check Filesystem Status - no errors:

[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p1
UUID: 73696a74-c953-4034-a51e-ec08d71305c7
found 100921585664 bytes used, no error found
total csum bytes: 72548668
total tree bytes: 129875968
total fs tree bytes: 25231360
total extent tree bytes: 9207808
btree space waste bytes: 28074520
file data blocks allocated: 1955216793600
 referenced 96725143552

I've run scrubbing - no errors:

    UUID:             73696a74-c953-4034-a51e-ec08d71305c7
    Scrub started:    Wed May 29 09:11:51 2024
    Status:           finished
    Duration:         0:00:30
    Total to scrub:   94.10GiB
    Rate:             3.14GiB/s
    Error summary:    no errors found

And I've run an extended SMART test, results are attached (again, no errors)

Anything else I should check or run?

CT1000P3PSSD8_234544DB96F6-20240529-0909.txt

Edited May 29 by atlatonin

JorgeB · May 29

Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Reboot and see if it makes a difference.

bmartino1 · May 29

8 hours ago, JorgeB said:
Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
e.g.:
append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
Reboot and see if it makes a difference.

My Samsung 980 NVME is affected by the firmware bug.

In my case, I run this latency:

Quote

kernel /bzimage
append initrd=/bzroot nvme_core.default_ps_max_latency_us=5500

I don't trust runnign firmware upgrades on HDD for other reasons...
https://github.com/lolyinseo/samsung-nvme-firmware

You may have a similar drive affected.

Edited May 29 by bmartino1

atlatonin · May 31

On 5/29/2024 at 10:24 AM, JorgeB said:
Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
e.g.:
append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
Reboot and see if it makes a difference.

Ok so far 48h without the issue appearing, thank you! I'm gonna monitor for a little while longer before I mark it solved though.

If you don't mind me asking what does adding this do, and whats the underlying problem here?

JorgeB · May 31

Some NVMe devices have issues with lower power states on Linux, that disables them.

atlatonin · June 15

Unfortunately I have to re-open this thread, since the problem occurred again today. I tried to run diagnostics with a keyboard and monitor connected to the NAS but I just got a "diagnostics: command not found". I've then tried to copy at least the syslog manually with cp /var/log/syslog /boot/syslog.txt which failed with a "cannot create regular file /boot/syslog.txt": Input/output error" (these errors in general were popping up a lot, even when running the shutdown command).
Any advice on how to get any kind of diagnostics in this situation?

I have the syslog server running, unfortunately there is a big gap in the log from when I last used the server until I restarted the server. There is just one single line at 3:40am (when the mover ran I assume) which has shfs copy_file fail copying a file from cache to disk because the filename is too long. This happens apparently everytime the mover is running though so I don't think its the culprit (I'm gonna fix it anyway of course).

itimpi · June 15

38 minutes ago, atlatonin said:

"diagnostics: command not found".

Difficult to see how this can happen unless you have a RAM issue so that OS that was loaded into RAM at boot time has since become corrupt.

atlatonin · June 15

7 minutes ago, itimpi said:

Difficult to see how this can happen unless you have a RAM issue so that OS that was loaded into RAM at boot time has since become corrupt.

Hmm I've run a standard memtest when building the NAS without problems, I can run another later overnight. Could this maybe also happen because of the boot USB stick somehow? That one is already pretty old.

itimpi · June 15

13 minutes ago, atlatonin said:

Could this maybe also happen because of the boot USB stick somehow? That one is already pretty old

Not really. I could see the diagnostics command failing to run to completion if there is an issue with the flash drive, but not the command not being found.

JorgeB · June 15

6 hours ago, itimpi said:

Difficult to see how this can happen

Same, never saw that before, are you sure no typos?

atlatonin · June 16

18 hours ago, JorgeB said:

Same, never saw that before, are you sure no typos?

I made a picture of the screen: https://i.imgur.com/nSJxZ92.jpeg (not very helpful I know but just so you guys know its not a typo haha). The crond messages in there arent occuring because of any input, they would just pop up every couple of seconds.

JorgeB · June 16

I would suggest recreating the flash drive.

atlatonin · June 20

On 6/16/2024 at 1:23 PM, JorgeB said:

I would suggest recreating the flash drive.

Just did that with a newly bought usb stick. I'm gonna run a memtest when I get to it, otherwise I'll mark this as solved for now and update when the issue occurs again (hopefully this time with a working diagnostics command...)

Networking/NAS goes down randomly

Recommended Posts

atlatonin

Link to comment

bmartino1

Link to comment

atlatonin

Link to comment

JorgeB

Link to comment

bmartino1

Link to comment

atlatonin

Link to comment

JorgeB

Link to comment

atlatonin

Link to comment

itimpi

Link to comment

atlatonin

Link to comment

itimpi

Link to comment

JorgeB

Link to comment

atlatonin

Link to comment

JorgeB

Link to comment

atlatonin

Link to comment

Join the conversation