Jump to content

Networking/NAS goes down randomly


Go to solution Solved by JorgeB,

Recommended Posts

Posted (edited)

Hello, so my newly built NAS has been running perfectly fine for about a month or so. However, since 3 days ago I've started getting random "crashes" about once per day. It's in quotation mark because it's not really a crash, but I usually have to hard shutdown and reboot the server. Basically it seems like the networking goes down, but only part of it? So for example the Unraid web gui is inaccessible and the docker containers can't communicate with each other anymore, but I can access some of the docker containers web ui's individually.

 

Today for some reason the web gui was working (but only under the xxx.local domain, not the actual IP weirdly enough) but none of the docker containers were accessible, so I finally managed to get diagnostics, they are attached (crash is roughly 2-3 hours before reboot, around 20:00 if my notifications are working correctly). I'm running Unraid 6.12.10.

 

Also today after I rebooted the server the SSD cache drive was missing. After another shutdown + hard start it showed up again, still I'm kinda worried now.

mausflix-diagnostics-20240528-2244.zip

Edited by atlatonin
Link to comment

reviewing diagn now...

weird go file:
# Give perms to Intel QuickSync
chmod -R 777 /dev/dri

^not sure thats correct...

per system log:

May 28 20:38:52 Mausflix kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 48773416, 8 blocks, I/O Error (sct 0x3 / sc 0x71) 
May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 48773416 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 48773440, 8 blocks, I/O Error (sct 0x3 / sc 0x71) 
May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 48773440 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 18392152, 24 blocks, I/O Error (sct 0x3 / sc 0x71) 
May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 18392152 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 2
May 28 20:39:24 Mausflix kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x3
May 28 20:39:24 Mausflix kernel: nvme nvme0: Removing after probe failure status: -19
May 28 20:39:24 Mausflix kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0

...
...
May 28 22:44:54 Mausflix kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 54, rd 1619, flush 0, corrupt 0, gen 0
May 28 22:44:54 Mausflix kernel: I/O error, dev loop2, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 22:44:54 Mausflix kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 54, rd 1620, flush 0, corrupt 0, gen 0
May 28 22:44:54 Mausflix kernel: I/O error, dev loop2, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 22:44:54 Mausflix kernel: Buffer I/O error on dev loop2, logical block 0, async page read

^Looks like the nvme drive is failing or going out. you will need to run check disk...

Link to comment
Posted (edited)

@bmartino1 thank you for the quick answer. It's strange since the SSD is brand new (along with everything else).

The go file line is from when I was trying to get Intel QuickSync to work, I removed it.

 

I've run Check Filesystem Status - no errors:

[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p1
UUID: 73696a74-c953-4034-a51e-ec08d71305c7
found 100921585664 bytes used, no error found
total csum bytes: 72548668
total tree bytes: 129875968
total fs tree bytes: 25231360
total extent tree bytes: 9207808
btree space waste bytes: 28074520
file data blocks allocated: 1955216793600
 referenced 96725143552


I've run scrubbing - no errors:

    UUID:             73696a74-c953-4034-a51e-ec08d71305c7
    Scrub started:    Wed May 29 09:11:51 2024
    Status:           finished
    Duration:         0:00:30
    Total to scrub:   94.10GiB
    Rate:             3.14GiB/s
    Error summary:    no errors found


And I've run an extended SMART test, results are attached (again, no errors)

Anything else I should check or run?

CT1000P3PSSD8_234544DB96F6-20240529-0909.txt

Edited by atlatonin
Link to comment

Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off


Reboot and see if it makes a difference.

  • Like 1
Link to comment
Posted (edited)
8 hours ago, JorgeB said:

Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off


Reboot and see if it makes a difference.

My Samsung 980 NVME is affected by the firmware bug.

In my case, I run this latency:

 

Quote

kernel /bzimage
append initrd=/bzroot nvme_core.default_ps_max_latency_us=5500


I don't trust runnign firmware upgrades on HDD for other reasons...
https://github.com/lolyinseo/samsung-nvme-firmware
 


You may have a similar drive affected.

Edited by bmartino1
Link to comment
On 5/29/2024 at 10:24 AM, JorgeB said:

Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off


Reboot and see if it makes a difference.


Ok so far 48h without the issue appearing, thank you! I'm gonna monitor for a little while longer before I mark it solved though.

If you don't mind me asking what does adding this do, and whats the underlying problem here?

Link to comment
  • 3 weeks later...

Unfortunately I have to re-open this thread, since the problem occurred again today. I tried to run diagnostics with a keyboard and monitor connected to the NAS but I just got a "diagnostics: command not found". I've then tried to copy at least the syslog manually with cp /var/log/syslog /boot/syslog.txt which failed with a "cannot create regular file /boot/syslog.txt": Input/output error" (these errors in general were popping up a lot, even when running the shutdown command).
Any advice on how to get any kind of diagnostics in this situation?

 

I have the syslog server running, unfortunately there is a big gap in the log from when I last used the server until I restarted the server. There is just one single line at 3:40am (when the mover ran I assume) which has shfs copy_file fail copying a file from cache to disk because the filename is too long. This happens apparently everytime the mover is running though so I don't think its the culprit (I'm gonna fix it anyway of course).

Link to comment
7 minutes ago, itimpi said:

Difficult to see how this can happen unless you have a RAM issue so that OS that was loaded into RAM at boot time has since become corrupt.

Hmm I've run a standard memtest when building the NAS without problems, I can run another later overnight. Could this maybe also happen because of the boot USB stick somehow? That one is already pretty old.

Link to comment
13 minutes ago, atlatonin said:

Could this maybe also happen because of the boot USB stick somehow? That one is already pretty old

Not really.   I could see the diagnostics command failing to run to completion if there is an issue with the flash drive, but not the command not being found.

Link to comment
On 6/16/2024 at 1:23 PM, JorgeB said:

I would suggest recreating the flash drive.

 

Just did that with a newly bought usb stick. I'm gonna run a memtest when I get to it, otherwise I'll mark this as solved for now and update when the issue occurs again (hopefully this time with a working diagnostics command...)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...