September 25, 20241 yr I just noticed that my docker service failed to start. After trying the universal "have you tried to restart" solution I now see that my cache drive is missing. Could someone please help me debug? I have no idea where to start. I don't have physical access to my server until tomorrow evening. I've added the output from Tools -> Diagnostics after the reboot. Thanks! atlas-diagnostics-20240925-0825.zip Edited September 25, 20241 yr by Ra5mu5 Clarified attachement
September 25, 20241 yr Looks like your cache drive reset looking at the previous log Sep 25 03:04:51 Atlas kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 Sep 25 03:04:51 Atlas kernel: nvme0n1: I/O Cmd(0x2) @ LBA 15566688, 8 blocks, I/O Error (sct 0x3 / sc 0x71) Sep 25 03:04:51 Atlas kernel: I/O error, dev nvme0n1, sector 15566688 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 Sep 25 03:05:23 Atlas kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x3 Sep 25 03:05:23 Atlas kernel: nvme nvme0: Removing after probe failure status: -19 It is not being seen after the reboot and as a result there is no information about it in the syslog so not clear if it just dropped offline for some reason or really has failed. You probably need to power cycle the server (rather than a simple reboot) to see if it comes back online and if it does then fresh diagnostics might give a clue as to its health.
September 25, 20241 yr Author I was able to get my brother to power cycle it and the drive now shows up and everything works again. I've added Diagnostics just in case you want to have a look and see if there's any info what and why might have happened. Thanks! atlas-diagnostics-20240925-1543.zip Edited September 25, 20241 yr by Ra5mu5 Added Diagnostics
September 25, 20241 yr Solution Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 pcie_aspm=off e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Reboot and see if it makes a difference.
September 25, 20241 yr Author 24 minutes ago, JorgeB said: nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Do you think I should add this even if this was the first time it happened and was fixed with a power cycle? I've been running fine for 9 months without encountering this issue. My server had been without a restart for almost 2 months when this happened now.
September 25, 20241 yr 27 minutes ago, Ra5mu5 said: I've been running fine for 9 months without encountering this issue. In that case I would probably would not add it for now, only if it happens again.
October 2, 20241 yr Author @JorgeB@itimpi So it happened again. I noticed my services were down and saw that "docker service failed to start". Weirdly it was still showing the cache drive as working in the UI. I downloaded Diagnostics from this point. I then tried adding the change to the Syslinux Configuration and rebooted (from the GUI not a full power cycle) and it shows the drive as missing just like last time. I can have the server power cycled in a few hours but I imagine if it works the logs will be the same as last time.
October 2, 20241 yr Author 1 hour ago, JorgeB said: Post those diags. Can't believe I forgot to include them... atlas-diagnostics-20241002-1129.zip
October 2, 20241 yr Oct 2 11:24:41 Atlas kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x3 Oct 2 11:24:41 Atlas kernel: nvme nvme0: Removing after probe failure status: -19 Oct 2 11:24:41 Atlas kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0 Oct 2 11:24:41 Atlas kernel: nvme0n1: detected capacity change from 1953525168 to 0 Device dropped again, try adding those lines I mentioned above, and if it keeps dropping recommend trying a different one.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.