May 28, 20242 yr Hello, so my newly built NAS has been running perfectly fine for about a month or so. However, since 3 days ago I've started getting random "crashes" about once per day. It's in quotation mark because it's not really a crash, but I usually have to hard shutdown and reboot the server. Basically it seems like the networking goes down, but only part of it? So for example the Unraid web gui is inaccessible and the docker containers can't communicate with each other anymore, but I can access some of the docker containers web ui's individually. Today for some reason the web gui was working (but only under the xxx.local domain, not the actual IP weirdly enough) but none of the docker containers were accessible, so I finally managed to get diagnostics, they are attached (crash is roughly 2-3 hours before reboot, around 20:00 if my notifications are working correctly). I'm running Unraid 6.12.10. Also today after I rebooted the server the SSD cache drive was missing. After another shutdown + hard start it showed up again, still I'm kinda worried now. mausflix-diagnostics-20240528-2244.zip Edited May 28, 20242 yr by atlatonin
May 28, 20242 yr Community Expert reviewing diagn now... weird go file: # Give perms to Intel QuickSync chmod -R 777 /dev/dri ^not sure thats correct... per system log: May 28 20:38:52 Mausflix kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 48773416, 8 blocks, I/O Error (sct 0x3 / sc 0x71) May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 48773416 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 48773440, 8 blocks, I/O Error (sct 0x3 / sc 0x71) May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 48773440 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 18392152, 24 blocks, I/O Error (sct 0x3 / sc 0x71) May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 18392152 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 2 May 28 20:39:24 Mausflix kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x3 May 28 20:39:24 Mausflix kernel: nvme nvme0: Removing after probe failure status: -19 May 28 20:39:24 Mausflix kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0 ... ... May 28 22:44:54 Mausflix kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 54, rd 1619, flush 0, corrupt 0, gen 0 May 28 22:44:54 Mausflix kernel: I/O error, dev loop2, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 May 28 22:44:54 Mausflix kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 54, rd 1620, flush 0, corrupt 0, gen 0 May 28 22:44:54 Mausflix kernel: I/O error, dev loop2, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 May 28 22:44:54 Mausflix kernel: Buffer I/O error on dev loop2, logical block 0, async page read ^Looks like the nvme drive is failing or going out. you will need to run check disk...
May 29, 20242 yr Author @bmartino1 thank you for the quick answer. It's strange since the SSD is brand new (along with everything else). The go file line is from when I was trying to get Intel QuickSync to work, I removed it. I've run Check Filesystem Status - no errors: [1/7] checking root items [2/7] checking extents [3/7] checking free space tree [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) Opening filesystem to check... Checking filesystem on /dev/nvme0n1p1 UUID: 73696a74-c953-4034-a51e-ec08d71305c7 found 100921585664 bytes used, no error found total csum bytes: 72548668 total tree bytes: 129875968 total fs tree bytes: 25231360 total extent tree bytes: 9207808 btree space waste bytes: 28074520 file data blocks allocated: 1955216793600 referenced 96725143552 I've run scrubbing - no errors: UUID: 73696a74-c953-4034-a51e-ec08d71305c7 Scrub started: Wed May 29 09:11:51 2024 Status: finished Duration: 0:00:30 Total to scrub: 94.10GiB Rate: 3.14GiB/s Error summary: no errors found And I've run an extended SMART test, results are attached (again, no errors) Anything else I should check or run? CT1000P3PSSD8_234544DB96F6-20240529-0909.txt Edited May 29, 20242 yr by atlatonin
May 29, 20242 yr Community Expert Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 pcie_aspm=off e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Reboot and see if it makes a difference.
May 29, 20242 yr Community Expert 8 hours ago, JorgeB said: Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 pcie_aspm=off e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Reboot and see if it makes a difference. My Samsung 980 NVME is affected by the firmware bug. In my case, I run this latency: Quote kernel /bzimage append initrd=/bzroot nvme_core.default_ps_max_latency_us=5500 I don't trust runnign firmware upgrades on HDD for other reasons... https://github.com/lolyinseo/samsung-nvme-firmware You may have a similar drive affected. Edited May 29, 20242 yr by bmartino1
May 31, 20242 yr Author On 5/29/2024 at 10:24 AM, JorgeB said: Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 pcie_aspm=off e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Reboot and see if it makes a difference. Ok so far 48h without the issue appearing, thank you! I'm gonna monitor for a little while longer before I mark it solved though. If you don't mind me asking what does adding this do, and whats the underlying problem here?
May 31, 20242 yr Community Expert Some NVMe devices have issues with lower power states on Linux, that disables them.
June 15, 20242 yr Author Unfortunately I have to re-open this thread, since the problem occurred again today. I tried to run diagnostics with a keyboard and monitor connected to the NAS but I just got a "diagnostics: command not found". I've then tried to copy at least the syslog manually with cp /var/log/syslog /boot/syslog.txt which failed with a "cannot create regular file /boot/syslog.txt": Input/output error" (these errors in general were popping up a lot, even when running the shutdown command). Any advice on how to get any kind of diagnostics in this situation? I have the syslog server running, unfortunately there is a big gap in the log from when I last used the server until I restarted the server. There is just one single line at 3:40am (when the mover ran I assume) which has shfs copy_file fail copying a file from cache to disk because the filename is too long. This happens apparently everytime the mover is running though so I don't think its the culprit (I'm gonna fix it anyway of course).
June 15, 20242 yr Community Expert 38 minutes ago, atlatonin said: "diagnostics: command not found". Difficult to see how this can happen unless you have a RAM issue so that OS that was loaded into RAM at boot time has since become corrupt.
June 15, 20242 yr Author 7 minutes ago, itimpi said: Difficult to see how this can happen unless you have a RAM issue so that OS that was loaded into RAM at boot time has since become corrupt. Hmm I've run a standard memtest when building the NAS without problems, I can run another later overnight. Could this maybe also happen because of the boot USB stick somehow? That one is already pretty old.
June 15, 20242 yr Community Expert 13 minutes ago, atlatonin said: Could this maybe also happen because of the boot USB stick somehow? That one is already pretty old Not really. I could see the diagnostics command failing to run to completion if there is an issue with the flash drive, but not the command not being found.
June 15, 20242 yr Community Expert 6 hours ago, itimpi said: Difficult to see how this can happen Same, never saw that before, are you sure no typos?
June 16, 20242 yr Author 18 hours ago, JorgeB said: Same, never saw that before, are you sure no typos? I made a picture of the screen: https://i.imgur.com/nSJxZ92.jpeg (not very helpful I know but just so you guys know its not a typo haha). The crond messages in there arent occuring because of any input, they would just pop up every couple of seconds.
June 20, 20242 yr Author On 6/16/2024 at 1:23 PM, JorgeB said: I would suggest recreating the flash drive. Just did that with a newly bought usb stick. I'm gonna run a memtest when I get to it, otherwise I'll mark this as solved for now and update when the issue occurs again (hopefully this time with a working diagnostics command...)
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.