November 7, 20241 yr Hello, I've been running Unraid for a few months and have had no issues. This morning, I received an error from Unraid /var/log is getting full (currently 82 % used) When I took a look at the log files, I saw syslog was very large. Viewing the syslog shows lots of BTRFS errors such as these... Nov 7 11:36:06 NAS2 kernel: BTRFS warning (device nvme0n1p1): lost page write due to IO error on /dev/nvme0n1p1 (-5) Nov 7 11:36:06 NAS2 kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 27450484, rd 49269, flush 322751, corrupt 0, gen 0 Nov 7 11:36:06 NAS2 kernel: BTRFS error (device nvme0n1p1): error writing primary super block to device 1 After rebooting the server, the array didn't start because it said the nvme0n1p1 was missing. Cache pool BTRFS missing device(s) I unplugged the server, re-seated the SSD, rebooted, and the drive appeared again. I ran short and extended SMART self tests which completed without errors. But Docker will no longer start because cache is in read only mode? Unable to write to cache Unable to write to Docker Image Is my nvme0n1p1 drive failing or corrupted? I'm a bit lost on where to go and any help would be greatly appreciated. Thank you! nas2-diagnostics-20241107-1135.zip Samsung_SSD_990_PRO_with_Heatsink_2TB-20241107-1205-SMART.txt Edited November 7, 20241 yr by projectsunset
November 7, 20241 yr Author So I've got the array and docker back online by removing the nvme0n1 drive from the raid1 cache. I still don't have any idea how to diagnose what's wrong with the drive and if it's corrupted or failing? Should I format it and re-add it to the cache? I'm nervous to re-add it without knowing what the problem is.
November 8, 20241 yr Community Expert Solution The syslog rotated, so the NVMe was already offline in the diags, power cycle the server and it should come back. Would be better to see when it dropped, but in some cases this helps: Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off Reboot and see if it makes a difference.
November 8, 20241 yr Author Thank you Jorge. I've added that to the "Unraid OS" Syslinux Config and rebooted my server. I've also added your monitor a btrfs or zfs pool for errors script to my User Scripts to run hourly. Since the cache has been running from a single drive for the past 17+ hours, should I wipe the nvme0n1p1 drive before re-adding back into the cache pool? Or is it safe to add back as is? Thank you so much for the help!
November 8, 20241 yr Author Thanks Jorge. I've re-added the drive to the cache pool and everything is looking good so far. I'll mark this as solved and keep a closer eye on the logs to see if the problem resurfaces. Thank you so much for your help, I really appreciate it!
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.