atlatonin Posted May 28 Posted May 28 (edited) Hello, so my newly built NAS has been running perfectly fine for about a month or so. However, since 3 days ago I've started getting random "crashes" about once per day. It's in quotation mark because it's not really a crash, but I usually have to hard shutdown and reboot the server. Basically it seems like the networking goes down, but only part of it? So for example the Unraid web gui is inaccessible and the docker containers can't communicate with each other anymore, but I can access some of the docker containers web ui's individually. Â Today for some reason the web gui was working (but only under the xxx.local domain, not the actual IP weirdly enough) but none of the docker containers were accessible, so I finally managed to get diagnostics, they are attached (crash is roughly 2-3 hours before reboot, around 20:00 if my notifications are working correctly). I'm running Unraid 6.12.10. Â Also today after I rebooted the server the SSD cache drive was missing. After another shutdown + hard start it showed up again, still I'm kinda worried now. mausflix-diagnostics-20240528-2244.zip Edited May 28 by atlatonin Quote
bmartino1 Posted May 28 Posted May 28 reviewing diagn now... weird go file: # Give perms to Intel QuickSync chmod -R 777 /dev/dri ^not sure thats correct... per system log: May 28 20:38:52 Mausflix kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 48773416, 8 blocks, I/O Error (sct 0x3 / sc 0x71)Â May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 48773416 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 48773440, 8 blocks, I/O Error (sct 0x3 / sc 0x71)Â May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 48773440 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 May 28 20:38:52 Mausflix kernel: nvme0n1: I/O Cmd(0x2) @ LBA 18392152, 24 blocks, I/O Error (sct 0x3 / sc 0x71)Â May 28 20:38:52 Mausflix kernel: I/O error, dev nvme0n1, sector 18392152 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 2 May 28 20:39:24 Mausflix kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x3 May 28 20:39:24 Mausflix kernel: nvme nvme0: Removing after probe failure status: -19 May 28 20:39:24 Mausflix kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0 ... ... May 28 22:44:54 Mausflix kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 54, rd 1619, flush 0, corrupt 0, gen 0 May 28 22:44:54 Mausflix kernel: I/O error, dev loop2, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 May 28 22:44:54 Mausflix kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 54, rd 1620, flush 0, corrupt 0, gen 0 May 28 22:44:54 Mausflix kernel: I/O error, dev loop2, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 May 28 22:44:54 Mausflix kernel: Buffer I/O error on dev loop2, logical block 0, async page read ^Looks like the nvme drive is failing or going out. you will need to run check disk... Quote
atlatonin Posted May 29 Author Posted May 29 (edited) @bmartino1Â thank you for the quick answer. It's strange since the SSD is brand new (along with everything else). The go file line is from when I was trying to get Intel QuickSync to work, I removed it. Â I've run Check Filesystem Status - no errors: [1/7] checking root items [2/7] checking extents [3/7] checking free space tree [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) Opening filesystem to check... Checking filesystem on /dev/nvme0n1p1 UUID: 73696a74-c953-4034-a51e-ec08d71305c7 found 100921585664 bytes used, no error found total csum bytes: 72548668 total tree bytes: 129875968 total fs tree bytes: 25231360 total extent tree bytes: 9207808 btree space waste bytes: 28074520 file data blocks allocated: 1955216793600 referenced 96725143552 I've run scrubbing - no errors: UUID: 73696a74-c953-4034-a51e-ec08d71305c7 Scrub started: Wed May 29 09:11:51 2024 Status: finished Duration: 0:00:30 Total to scrub: 94.10GiB Rate: 3.14GiB/s Error summary: no errors found And I've run an extended SMART test, results are attached (again, no errors) Anything else I should check or run? CT1000P3PSSD8_234544DB96F6-20240529-0909.txt Edited May 29 by atlatonin Quote
JorgeB Posted May 29 Posted May 29 Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 pcie_aspm=off e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Reboot and see if it makes a difference. 1 Quote
bmartino1 Posted May 29 Posted May 29 (edited) 8 hours ago, JorgeB said: Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 pcie_aspm=off e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Reboot and see if it makes a difference. My Samsung 980 NVME is affected by the firmware bug. In my case, I run this latency:  Quote kernel /bzimage append initrd=/bzroot nvme_core.default_ps_max_latency_us=5500 I don't trust runnign firmware upgrades on HDD for other reasons... https://github.com/lolyinseo/samsung-nvme-firmware  You may have a similar drive affected. Edited May 29 by bmartino1 Quote
atlatonin Posted May 31 Author Posted May 31 On 5/29/2024 at 10:24 AM, JorgeB said: Yry this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 pcie_aspm=off e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Reboot and see if it makes a difference. Ok so far 48h without the issue appearing, thank you! I'm gonna monitor for a little while longer before I mark it solved though. If you don't mind me asking what does adding this do, and whats the underlying problem here? Quote
JorgeB Posted May 31 Posted May 31 Some NVMe devices have issues with lower power states on Linux, that disables them. 1 Quote
atlatonin Posted June 15 Author Posted June 15 Unfortunately I have to re-open this thread, since the problem occurred again today. I tried to run diagnostics with a keyboard and monitor connected to the NAS but I just got a "diagnostics: command not found". I've then tried to copy at least the syslog manually with cp /var/log/syslog /boot/syslog.txt which failed with a "cannot create regular file /boot/syslog.txt": Input/output error" (these errors in general were popping up a lot, even when running the shutdown command). Any advice on how to get any kind of diagnostics in this situation? Â I have the syslog server running, unfortunately there is a big gap in the log from when I last used the server until I restarted the server. There is just one single line at 3:40am (when the mover ran I assume) which has shfs copy_file fail copying a file from cache to disk because the filename is too long. This happens apparently everytime the mover is running though so I don't think its the culprit (I'm gonna fix it anyway of course). Quote
itimpi Posted June 15 Posted June 15 38 minutes ago, atlatonin said: "diagnostics: command not found". Difficult to see how this can happen unless you have a RAM issue so that OS that was loaded into RAM at boot time has since become corrupt. Quote
atlatonin Posted June 15 Author Posted June 15 7 minutes ago, itimpi said: Difficult to see how this can happen unless you have a RAM issue so that OS that was loaded into RAM at boot time has since become corrupt. Hmm I've run a standard memtest when building the NAS without problems, I can run another later overnight. Could this maybe also happen because of the boot USB stick somehow? That one is already pretty old. Quote
itimpi Posted June 15 Posted June 15 13 minutes ago, atlatonin said: Could this maybe also happen because of the boot USB stick somehow? That one is already pretty old Not really.  I could see the diagnostics command failing to run to completion if there is an issue with the flash drive, but not the command not being found. Quote
JorgeB Posted June 15 Posted June 15 6 hours ago, itimpi said: Difficult to see how this can happen Same, never saw that before, are you sure no typos? Quote
atlatonin Posted June 16 Author Posted June 16 18 hours ago, JorgeB said: Same, never saw that before, are you sure no typos? I made a picture of the screen: https://i.imgur.com/nSJxZ92.jpeg (not very helpful I know but just so you guys know its not a typo haha). The crond messages in there arent occuring because of any input, they would just pop up every couple of seconds. Quote
Solution JorgeB Posted June 16 Solution Posted June 16 I would suggest recreating the flash drive. Quote
atlatonin Posted June 20 Author Posted June 20 On 6/16/2024 at 1:23 PM, JorgeB said: I would suggest recreating the flash drive. Â Just did that with a newly bought usb stick. I'm gonna run a memtest when I get to it, otherwise I'll mark this as solved for now and update when the issue occurs again (hopefully this time with a working diagnostics command...) Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.