Cache drives keep dropping offline

scytherswings · September 28, 2022

I've been running Unraid for a handful of months now, and I can't for the life of me figure out why it is that my cache drives randomly go offline/unresponsive after a period of time. Here's what I've tried:

Use 1 500GB Samsung cache drive.

Use 2 500GB samsung/crucial cache drives.

Upgrade to 2 1TB brand new crucial cache drives with same old cables.

Change ancient power supply to brand new one.

Swap sata cables to different ones.

Change ports on motherboard.

I am about to crack open the server once again to try and replace these cables, but I'm at my wits end for getting this server stable. I never had issues with this hardware running ubuntu, but I guess it is getting old. What strikes me as odd is that it's only the cache drives I'm having trouble with. I have had absolutely 0 problems with my spinning disks, they just work - which is great.

I've attached a diagnostic file after rebooting this evening once I found my docker service unable to start (this is a new one!). After rebooting, both cache drives were offline even though nothing has physically changed with the machine.

Thanks much!

hathor-diagnostics-20220927-2048.zip

JorgeB · September 28, 2022

Were the drives online before the reboot? Not much to see in the diags when the devices are already not detected, check the board BIOS, if they are not detected there they also won't be by Unraid.

scytherswings · September 28, 2022

First, thank you for your quick response!

Yes they were online before rebooting and had been running for about 12 days - in my frustration I rebooted and got this diag. I then shut the machine down, changed the sata cables and the motherboard ports they were plugged into, and rebooted. The drives came online and my containers started up successfully, then I got a warning from the fix common problems plugin for this:

Event: Fix Common Problems - Hathor
Subject: Errors have been found with your server (Hathor).
Description: Investigate at Settings / User Utilities / Fix Common Problems
Importance: alert

* **Unable to write to safe_cache**
* **Unable to write to Docker Image**

And I noticed that my docker services were unresponsive or crashing. I rebooted again and the drives presented as offline and the array did not start. I went to bed.

This morning, after shutting down and then booting to the bios, I can see that the bios does see the drives. I exited the bios and let unraid boot and everything mounted without any issues. I'll be keeping an eye on how long it stays up and get diags before rebooting if the drives drop off again. I find it surprising that both drives went offline at the same time, but I know that without better diagnostic information it'll be difficult to establish a root cause. I will update here once I'm able to produce diag information.

Edited September 28, 2022 by scytherswings

scytherswings · October 9, 2022

Ok it's happened again, unfortunately it seems to have happened a few days ago and I wasn't able to get logs right away and /var/log seems to be full, regardless I've captured these diagnostic logs and plan to install the update to 6.11.0 and reboot after.

I've included a screenshot that highlights that one of my two cache drives have gone offline - the easiest way to see that this has happened is that the temperature readout shows * instead of an actual reading.

Thanks much!

hathor-diagnostics-20221009-1859.zip

scytherswings · October 10, 2022

After about an hour or so it looks like my docker service has had an issue now too - guess i'm forced to reboot. hathor-diagnostics-20221009-2009.zip

JorgeB · October 10, 2022

Cannot see the beginning of the problem but the device dropped offline, this is usually a power/connection problem, also see here for better pool monitoring so you are alerted as soon as there is a problem.

scytherswings · October 14, 2022

Huzzah! Your monitoring suggestion has paid off! I just got a notification and was able to capture these diagnostics! I haven't read any logs files yet myself so I'll be doing that now. Also I'm lucky that only one of the drives has knocked offline, it's a real pain when its both.

hathor-diagnostics-20221013-2048.zip

scytherswings · October 14, 2022

If I had to guess, this is where it happens?

Spoiler

Oct 13 20:33:07 Hathor kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 13 20:33:07 Hathor kernel: ata5.00: failed command: DATA SET MANAGEMENT
Oct 13 20:33:07 Hathor kernel: ata5.00: cmd 06/01:01:00:00:00/00:00:00:00:00/a0 tag 7 dma 512 out
Oct 13 20:33:07 Hathor kernel:         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 13 20:33:07 Hathor kernel: ata5.00: status: { DRDY }
Oct 13 20:33:07 Hathor kernel: ata5: hard resetting link
Oct 13 20:33:12 Hathor kernel: ata5: link is slow to respond, please be patient (ready=0)
Oct 13 20:33:17 Hathor kernel: ata5: COMRESET failed (errno=-16)
Oct 13 20:33:17 Hathor kernel: ata5: hard resetting link
Oct 13 20:33:22 Hathor kernel: ata5: link is slow to respond, please be patient (ready=0)
Oct 13 20:33:27 Hathor kernel: ata5: COMRESET failed (errno=-16)
Oct 13 20:33:27 Hathor kernel: ata5: hard resetting link
Oct 13 20:33:32 Hathor kernel: ata5: link is slow to respond, please be patient (ready=0)
Oct 13 20:34:02 Hathor kernel: ata5: COMRESET failed (errno=-16)
Oct 13 20:34:02 Hathor kernel: ata5: limiting SATA link speed to 3.0 Gbps
Oct 13 20:34:02 Hathor kernel: ata5: hard resetting link
Oct 13 20:34:07 Hathor kernel: ata5: COMRESET failed (errno=-16)
Oct 13 20:34:07 Hathor kernel: ata5: reset failed, giving up
Oct 13 20:34:07 Hathor kernel: ata5.00: disabled
Oct 13 20:34:07 Hathor kernel: ata5: EH complete
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=89s
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#6 CDB: opcode=0x2a 2a 00 03 a7 32 d0 00 00 10 00
Oct 13 20:34:07 Hathor kernel: blk_update_request: I/O error, dev sdd, sector 61289168 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
Oct 13 20:34:07 Hathor kernel: btrfs_dev_stat_print_on_error: 1848 callbacks suppressed
Oct 13 20:34:07 Hathor kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#7 CDB: opcode=0x2a 2a 00 01 89 1e a8 00 00 38 00
Oct 13 20:34:07 Hathor kernel: blk_update_request: I/O error, dev sdd, sector 25763496 op 0x1:(WRITE) flags 0x800 phys_seg 7 prio class 0
Oct 13 20:34:07 Hathor kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#11 CDB: opcode=0x2a 2a 00 05 1f 5a 78 00 00 40 00
Oct 13 20:34:07 Hathor kernel: blk_update_request: I/O error, dev sdd, sector 85940856 op 0x1:(WRITE) flags 0x8800 phys_seg 8 prio class 0
Oct 13 20:34:07 Hathor kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Oct 13 20:34:07 Hathor kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Oct 13 20:34:07 Hathor kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#12 CDB: opcode=0x2a 2a 00 05 2a 73 80 00 00 08 00
Oct 13 20:34:07 Hathor kernel: blk_update_request: I/O error, dev sdd, sector 86668160 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Oct 13 20:34:07 Hathor kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#13 CDB: opcode=0x2a 2a 00 05 2a 75 00 00 00 08 00
Oct 13 20:34:07 Hathor kernel: blk_update_request: I/O error, dev sdd, sector 86668544 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Oct 13 20:34:07 Hathor kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#14 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#14 CDB: opcode=0x2a 2a 00 05 2a 76 a8 00 00 10 00
Oct 13 20:34:07 Hathor kernel: blk_update_request: I/O error, dev sdd, sector 86668968 op 0x1:(WRITE) flags 0x8800 phys_seg 2 prio class 0
Oct 13 20:34:07 Hathor kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#15 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#15 CDB: opcode=0x2a 2a 00 05 39 67 20 00 00 20 00
Oct 13 20:34:07 Hathor kernel: blk_update_request: I/O error, dev sdd, sector 87648032 op 0x1:(WRITE) flags 0x8800 phys_seg 4 prio class 0
Oct 13 20:34:07 Hathor kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#16 CDB: opcode=0x2a 2a 00 05 39 68 20 00 00 20 00
Oct 13 20:34:07 Hathor kernel: blk_update_request: I/O error, dev sdd, sector 87648288 op 0x1:(WRITE) flags 0x8800 phys_seg 4 prio class 0
Oct 13 20:34:07 Hathor kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#17 CDB: opcode=0x2a 2a 00 05 39 68 c0 00 00 20 00
Oct 13 20:34:07 Hathor kernel: blk_update_request: I/O error, dev sdd, sector 87648448 op 0x1:(WRITE) flags 0x8800 phys_seg 4 prio class 0
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Oct 13 20:34:07 Hathor kernel: sd 6:0:0:0: [sdd] tag#18 CDB: opcode=0x2a 2a 00 05 39 6b 40 00 00 20 00
Oct 13 20:34:07 Hathor kernel: blk_update_request: I/O error, dev sdd, sector 87649088 op 0x1:(WRITE) flags 0x8800 phys_seg 4 prio class 0
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): direct IO failed ino 261 rw 0,0 sector 0x421a0f8 len 0 err no 10
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): direct IO failed ino 261 rw 0,0 sector 0x52734c8 len 0 err no 10
Oct 13 20:34:07 Hathor kernel: traps: lsof[12696] general protection fault ip:1539a4f706ae sp:5071544aa1e8b4a5 error:0 in libc-2.33.so[1539a4f57000+15e000]
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): direct IO failed ino 261 rw 0,0 sector 0x3b7f2c8 len 0 err no 10
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): direct IO failed ino 261 rw 0,0 sector 0x3b7f2d0 len 0 err no 10
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): direct IO failed ino 261 rw 0,0 sector 0x3b7f2d8 len 0 err no 10
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): direct IO failed ino 261 rw 0,0 sector 0x3b7f2e0 len 0 err no 10
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): direct IO failed ino 261 rw 0,0 sector 0x41f3c00 len 0 err no 10
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): direct IO failed ino 261 rw 0,0 sector 0x41f3c08 len 0 err no 10
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): direct IO failed ino 261 rw 0,0 sector 0x41f3c10 len 0 err no 10
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): direct IO failed ino 261 rw 0,0 sector 0x41f3c18 len 0 err no 10
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): lost page write due to IO error on /dev/sdd1 (-5)
Oct 13 20:34:07 Hathor kernel: BTRFS error (device sdd1): error writing primary super block to device 1
Oct 13 20:34:07 Hathor kernel: BTRFS warning (device sdd1): lost page write due to IO error on /dev/sdd1 (-5)
### [PREVIOUS LINE REPEATED 2 TIMES] ###

JorgeB · October 14, 2022

7 hours ago, scytherswings said:

If I had to guess, this is where it happens?

Yes, replace cable, both power and SATA.

scytherswings · October 25, 2022

After the last incident I had decided not to replace my cables (I've already done this a few times throughout this process) and instead removed the Dynamix SSD TRIM plugin which I had thought was causing issues by running the SET MANAGEMENT command. Unfortunately that looks like it was a red herring (I believe Unraid natively supports TRIM stuff nowadays, so this was not going to change anything). To nobody's surprise I've had both my drives drop off again today after updating my docker images.

What's frustrating is that Unraid actually doesn't mark these drives as offline, even though they have been disabled per the logs:

Oct 24 08:41:39 Hathor kernel: ata5: COMRESET failed (errno=-16)
Oct 24 08:41:39 Hathor kernel: ata5: hard resetting link
Oct 24 08:41:44 Hathor kernel: ata5: link is slow to respond, please be patient (ready=0)
Oct 24 08:42:08 Hathor kernel: ata3: COMRESET failed (errno=-16)
Oct 24 08:42:08 Hathor kernel: ata3: limiting SATA link speed to 3.0 Gbps
Oct 24 08:42:08 Hathor kernel: ata3: hard resetting link
Oct 24 08:42:13 Hathor kernel: ata3: COMRESET failed (errno=-16)
Oct 24 08:42:13 Hathor kernel: ata3: reset failed, giving up
Oct 24 08:42:13 Hathor kernel: ata3.00: disabled
Oct 24 08:42:13 Hathor kernel: ata3: EH complete
Oct 24 08:42:13 Hathor kernel: sd 3:0:0:0: [sdb] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=90s
Oct 24 08:42:13 Hathor kernel: sd 3:0:0:0: [sdb] tag#16 CDB: opcode=0x2a 2a 00 00 1a 80 80 00 00 40 00

See the attached image - the temperature being an * is the only indication that something is wrong.

I'm going to actually swap the cables as requested again and put the SSDs on my PCIe HBA instead of the motherboard controller. I'm running out of ideas and I'm reasonably convinced there's some sort of latent bug in the MX500 series firmware or the drivers unraid uses to control these drives is faulty.

..1 day later..

After swapping my SATA cables, controllers, and even power cables I have yet again had one of my cache drives drop offline. I connected my safe_cache drives, one to my motherboard controller and the other to my SAS2008 HBA and the mothboard connected drive has dropped. I'm going to swap the sata cable on that one again and see what happens. I have a hard time believing this issue is cable related as I've already used about 6 different SATA cables at this point for these drives, but I have more I can try.

JorgeB · October 26, 2022

11 hours ago, scytherswings said:

What's frustrating is that Unraid actually doesn't mark these drives as offline

See here for better pool monitoring.

itimpi · October 26, 2022

2 hours ago, JorgeB said:

here

Would it not be better to replace the ‘if’ statement in the script with one of the form

for p in /mnt/pool1 /mnt/pool2; do

….

done

as this makes it more obvious how to handle multiple pools?

thinking about it one could look at the entries in config/pools folder on the flash drive to make it completely generic? I might check that out on my own system as if it was completely generic it would be easier to add to a future Unraid release?

JorgeB · October 26, 2022

20 minutes ago, itimpi said:

thinking about it one could look at the entries in config/pools folder on the flash drive to make it completely generic? I might check that out on my own system as if it was completely generic it would be easier to add to a future Unraid release?

That would be great, I think LT is going to implement better pool monitoring in the future, possibly when zfs support is added, but not yet confirmed.

itimpi · October 26, 2022

1 hour ago, JorgeB said:

That would be great, I think LT is going to implement better pool monitoring in the future, possibly when zfs support is added, but not yet confirmed.

I will see if I can knock something together then and if I succeed to either make it a plugin or maybe issue a pull request to get it in the next Unraid release.

JorgeB · October 26, 2022

1 hour ago, itimpi said:

get it in the next Unraid release.

The idea for the when it gets implemented in Unraid is not to use just a script, though that would still be better than nothing, but basically monitor the output of "btrfs device stats /mountpoint" (or "zpool status poolname" for zfs) and use the existing GUI errors column for pools, which currently is just for show since it's always 0, to display the sum of all the possible device errors (there are 5 for btrfs and 3 for zfs), and if it's non zero for any pool device generate a system notification, optionally hovering the mouse over the errors would show the type and number of errors like we have now when hovering the mouse over the SMART thumbs down icon in the dash.

Additionally you'd also need a button or check mark to reset the errors, it would run "btrfs dev stats -z /mountpoint" or "zpool clear poolname".

scytherswings · November 16, 2022

To sum up, the issue happened about three more times and I've decided to give up on btrfs for cache drives. I decided to run memtest86+ for 40 hours or so and found no issues, so I am relatively confident that the drive issues were not caused by memory corruption. I have since reformatted each ssd with xfs and made each one its own cache drive, losing my redundancy. So far no issues but it has only been about a week.

JorgeB · November 16, 2022

Devices dropping offline has nothing to do with btrfs or RAM.

scytherswings · December 6, 2022

26 days and no issues with any drive dropping offline yet, I'm hopeful still that the system will be stable with this configuration but only time will tell.

Edit: 55 days and no issues!

Edited January 4, 2023 by scytherswings
Update

Cache drives keep dropping offline

Recommended Posts

scytherswings

Link to comment

JorgeB

Link to comment

scytherswings

Link to comment

scytherswings

Link to comment

scytherswings

Link to comment

JorgeB

Link to comment

scytherswings

Link to comment

scytherswings

Link to comment

JorgeB

Link to comment

scytherswings

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

JorgeB

Link to comment

scytherswings

Link to comment

JorgeB

Link to comment

scytherswings

Link to comment

Join the conversation