[SOLVED] cache drive errors - drive, cable or docker image? - General Support

August 19, 20205 yr

Hello all. Every few days, my dockers have dropped out and I am getting cache drive errors (I have rebooted to recover). After reading on this forum, I see that it could be a docker image corruption (I had a few dirty shutdowns in the past), a bad cable, or a bad cache drive. Wondering if I can get an assist in diagnosing? Logs are attached

tower-diagnostics-20200819-0807.zip

Quote

August 19, 20205 yr

Community Expert

Cache device dropped offline:

Aug 19 03:47:16 Tower kernel: ata5: hard resetting link
Aug 19 03:47:21 Tower kernel: ata5: COMRESET failed (errno=-16)
Aug 19 03:47:21 Tower kernel: ata5: reset failed, giving up
Aug 19 03:47:21 Tower kernel: ata5.00: disabled
Aug 19 03:47:21 Tower kernel: ata5: EH complete

With SSDs this is usually a connection/cable problem.

Quote

August 19, 20205 yr

Author

Thank you. I will give this a shot first and reply back here. Shout I try rebuilding the docker image also, as it seems susceptible to corruption?

Quote

August 19, 20205 yr

Community Expert

2 minutes ago, akshunj said:

Shout I try rebuilding the docker image also, as it seems susceptible to corruption?

It might not need rebuilding, but it won't hurt.

Quote

August 19, 20205 yr

Author

changed out sata and power cables. Will rebuild docker image next, if issues continue. Both the SSD's in the cache pool are new-ish, so I am hoping it's not a drive issue.

Quote

August 20, 20205 yr

Author

So, same error just happened. This time affecting both cache drives in the pool. I did extended SMART tests on both drives and they passed. Moving on to the docker rebuild. New diagnostics attached, in case anyone wants to take a stab...

tower-diagnostics-20200820-1649.zip

Quote

August 20, 20205 yr

Author

I am going to do a memtest to check the RAM first, then rebuild the docker image. There are so many posts about this type of issue. Not sure how to proceed

Quote

August 20, 20205 yr

Community Expert

1 hour ago, akshunj said:

rebuild the docker image

Here is how that goes in case you need it.

Settings - Docker, disable, delete. Enable again to recreate docker.img.

If you had any custom docker network such as proxynet you will have to recreate it.

Then be sure to use the Previous Apps feature on the Apps page, which will reinstall your dockers just as they were.

Quote

August 20, 20205 yr

Author

Thanks so much.

The more I investigate, the more I think this is a hardware issue or kernel bug. That ATA error is from the kernel. I know my way around linux, but I'm still an unraid noob. After some googling, it appears this particular kernel error seems related to SSD's specifically. It's driving me nuts. Memtest was fine. Cables have been switched out. I had them on a pci-e sata card before this, and I removed the card and attached them to the sata mainboard thinking that the pci-e card was the issue. The hard drives seem fine, so I am left with two new-ish SSD's that are throwing out this kernel error at random times after boot. And both drives pass SMART tests. I am flummoxed.

I am going to move everything back to the array, pull the SSD's from the sata connection, and try to use them as cache from a usb sata dock. I know throughput will be choked, but I am not sure how much that will translate to a real-world performance hit

Quote

August 20, 20205 yr

Community Expert

6 minutes ago, akshunj said:

try to use them as cache from a usb sata dock.

USB connections are not reliable enough for the permanent connections required in the array or pool.

Quote

August 21, 20205 yr

Author

I hear that. I am going to backup appdata regularly and give it a spin. I'm not wild about buying new SSD's

Quote

August 22, 20205 yr

Author

So, after changing power cables, sata cables, sata ports, power supply, UPS, and testing ram, I could only conclude that this linux kernel + these ssd drives (Crucial sata) do not play well together. I have switched the ssd drives to a usb3 sata dock and no issues so far. At some point, I will upgrade the ssd drives and try again via sata, but that time is not now.

Thinking out loud, I know unraid uses the older LTS linux kernel. But I have no idea how bug fixes and security updates to that kernel are handled. I hooked the ssd's up to my home theater plex box via sata, and not a single error (different mobo, so obviously not apples to apples). It runs the latest version of linux mint with kernel 5.4. I wonder if some of the ssd issues that many experience on this forum are related to the older kernel version used by unraid.

Quote

August 22, 20205 yr

Community Expert

25 minutes ago, akshunj said:

I know unraid uses the older LTS linux kernel

Latest beta is kernel 5.7

Quote

August 23, 20205 yr

Author

4 hours ago, trurl said:

Latest beta is kernel 5.7

Hmmm, I think I will check it out. How stable are the unraid beta releases?

Quote

August 23, 20205 yr

Community Expert

Many are already using latest beta. In addition to new kernel there are some very interesting new features. I seldom hesitate to run betas because I have a good (enough) backup plan.

Quote

August 31, 20205 yr

Author

update: getting the ssd drives off sata and onto usb3 has fixed the issue. Usb3 is NOT a great alternative at all, but everything is stable and cache performance is... fine. I just bought a used xeon server and intend to migrate my drives over to this and upgrade to unraid beta. I am curious to see if my SSD's will fare better on the new mobo and upgraded kernel.

Sidenote: Digging into unraid logs, I noticed a few Slackware references. I learned linux on Slackware 20 years ago. I think it's cool as hell that I'm using it today via unraid.

Quote

October 1, 20205 yr

Author

So, after migrating my unraid install to the Xeon server, and plugging the ssd cache drives into the sata bays, the same error has returned with a vengeance. I have attached my logs in case anyone wants to take a peek. The fun happens on 9/30 around 1530 or so, if memory serves. I realize this is not particularly an unraid issue, as it really has nothing to do with the array, the software or anything other than how the kernel interacts with the controller and the device.

I took the array offline, and ran an extended smart test on the drive that got kicked off (it passed). I pulled the drive last night to see if the remaining ssd would generate similar kernel errors on its own. Shortly after I restarted the array, I got this:

Sep 30 21:31:36 TChalla kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

Sep 30 21:31:36 TChalla kernel: sas: ata16: end_device-1:0: cmd error handler

Sep 30 21:31:36 TChalla kernel: sas: ata16: end_device-1:0: dev error handler

Sep 30 21:31:36 TChalla kernel: sas: ata9: end_device-1:2: dev error handler

Sep 30 21:31:36 TChalla kernel: sas: ata10: end_device-1:3: dev error handler

Sep 30 21:31:36 TChalla kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Drive remained alive and connected, and it's fine right now. Here's how I am thinking about this issues:

1. I am going to continue using the single ssd for now; maybe the drive that got kicked really was bad. However, if it was, I wonder why I worked fine in the sata dock

2. These were twin drives... same make and model. If there's a kernel issue with how this drive talks to the controller, maybe a new drive of a different make & model will fix the issue. Already ordered one.

3. If steps 1 & 2 don't work, I will upgrade to the latest unraid beta and try the new kernel.

That's every option I can think of, at this point. I have replaced cables (power & sata), moved them from one sata controller to another, and the only way I got them to behave was to use a sata to usb3 dock. I am open to any other diagnostic suggestions. I am going to hit the slackware subreddit, and see if there may be some insights from those guys. Thanks all.

tchalla-diagnostics-20200930-1535.zip tchalla-smart-20200930-1644.zip

Quote

October 1, 20205 yr

2 hours ago, akshunj said:

I am open to any other diagnostic suggestions

Have you tried the beta to see if the newer kernel fixes things?

Quote

October 1, 20205 yr

Author

Going to try that now. Server just crashed in spectacular fashion. First the ssd, then two more data drives. It was pretty unbelievable. Attached diagnostics from the latest fiasco

tchalla-diagnostics-20201001-1323.zip

Quote

October 1, 20205 yr

Community Expert

The Intel SAS controller sometimes has some issues with Linux, unlike the Intel SATA ports which are rock solid, if you can try for example using an LSI HBA instead.

Quote

October 1, 20205 yr

Author

16 minutes ago, JorgeB said:

The Intel SAS controller sometimes has some issues with Linux, unlike the Intel SATA ports which are rock solid, if you can try for example using an LSI HBA instead.

This system was my backup server for a month with zero issues. When I swapped in the drives from my main server, this began.

Quote

October 1, 20205 yr

Author

upgraded to beta and was immediately greeted by this:

Oct 1 13:59:50 TChalla kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Oct 1 13:59:50 TChalla kernel: sas: ata7: end_device-7:0: cmd error handler
Oct 1 13:59:50 TChalla kernel: sas: ata7: end_device-7:0: dev error handler
Oct 1 13:59:50 TChalla kernel: sas: ata8: end_device-7:1: dev error handler
Oct 1 13:59:50 TChalla kernel: sas: ata9: end_device-7:2: dev error handler
Oct 1 13:59:50 TChalla kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

I am going to remove the cache drive, and restore my appdata and dockers onto the array. This issue has never taken out my data drives before. Just need to reassure myself this is still an issue with the cache drive

Quote

October 1, 20205 yr

Community Expert

6 minutes ago, akshunj said:

When I swapped in the drives from my main server, this began.

It's possible that some drive models have compatibility issues with the controller.

Quote

October 1, 20205 yr

Author

Removed both ssd cache drives. Rebooted and restarted array. Logs are clean. No mention of the ata kernel error. Yet. Going to recreate my docker image and bring the dockers back up this eve. New ssd arrives next week. Will install as cache when it arrives and give it another go.

It looks like the controller passed the kernel incorrect state info about multiple drives, although it was instigated by the ssd drive. I am going to assume @JorgeB is correct that certain drive models are no good for certain controllers (at least the ones in my signature). I find this odd as one controller is for a server, and the other is for a desktop. It's also weird that the drives worked in my dock, but not attached via internal sata.

I looked up my purchase on ebay. These were *refurbished* drives. This was my mistake; I could have sworn they were new. I am hoping this is the root of my ACTUAL problem. Regardless, I will be using the warranty. The new drive I ordered is new, so maybe that plus the new make/model will work out.

Still going to monitor the array for ata errors. The crash a few hours ago was *rough*.

Quote

October 4, 20205 yr

Author

Issue is definitely related to SSD and controller interaction. 72hours running without ssd and no ata errors. All dockers running as before. Will update when my new ssd arrives in a couple of days

Quote

[SOLVED] cache drive errors - drive, cable or docker image?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)