[SOLVED] cache drive errors - drive, cable or docker image?

akshunj · August 19, 2020

Hello all. Every few days, my dockers have dropped out and I am getting cache drive errors (I have rebooted to recover). After reading on this forum, I see that it could be a docker image corruption (I had a few dirty shutdowns in the past), a bad cable, or a bad cache drive. Wondering if I can get an assist in diagnosing? Logs are attached

tower-diagnostics-20200819-0807.zip

JorgeB · August 19, 2020

Cache device dropped offline:

Aug 19 03:47:16 Tower kernel: ata5: hard resetting link
Aug 19 03:47:21 Tower kernel: ata5: COMRESET failed (errno=-16)
Aug 19 03:47:21 Tower kernel: ata5: reset failed, giving up
Aug 19 03:47:21 Tower kernel: ata5.00: disabled
Aug 19 03:47:21 Tower kernel: ata5: EH complete

With SSDs this is usually a connection/cable problem.

akshunj · August 19, 2020

Thank you. I will give this a shot first and reply back here. Shout I try rebuilding the docker image also, as it seems susceptible to corruption?

JorgeB · August 19, 2020

2 minutes ago, akshunj said:

Shout I try rebuilding the docker image also, as it seems susceptible to corruption?

It might not need rebuilding, but it won't hurt.

akshunj · August 19, 2020

changed out sata and power cables. Will rebuild docker image next, if issues continue. Both the SSD's in the cache pool are new-ish, so I am hoping it's not a drive issue.

akshunj · August 20, 2020

So, same error just happened. This time affecting both cache drives in the pool. I did extended SMART tests on both drives and they passed. Moving on to the docker rebuild. New diagnostics attached, in case anyone wants to take a stab...

tower-diagnostics-20200820-1649.zip

akshunj · August 20, 2020

I am going to do a memtest to check the RAM first, then rebuild the docker image. There are so many posts about this type of issue. Not sure how to proceed

trurl · August 20, 2020

1 hour ago, akshunj said:

rebuild the docker image

Here is how that goes in case you need it.

Settings - Docker, disable, delete. Enable again to recreate docker.img.

If you had any custom docker network such as proxynet you will have to recreate it.

Then be sure to use the Previous Apps feature on the Apps page, which will reinstall your dockers just as they were.

akshunj · August 20, 2020

Thanks so much.

The more I investigate, the more I think this is a hardware issue or kernel bug. That ATA error is from the kernel. I know my way around linux, but I'm still an unraid noob. After some googling, it appears this particular kernel error seems related to SSD's specifically. It's driving me nuts. Memtest was fine. Cables have been switched out. I had them on a pci-e sata card before this, and I removed the card and attached them to the sata mainboard thinking that the pci-e card was the issue. The hard drives seem fine, so I am left with two new-ish SSD's that are throwing out this kernel error at random times after boot. And both drives pass SMART tests. I am flummoxed.

I am going to move everything back to the array, pull the SSD's from the sata connection, and try to use them as cache from a usb sata dock. I know throughput will be choked, but I am not sure how much that will translate to a real-world performance hit

trurl · August 20, 2020

6 minutes ago, akshunj said:

try to use them as cache from a usb sata dock.

USB connections are not reliable enough for the permanent connections required in the array or pool.

akshunj · August 21, 2020

I hear that. I am going to backup appdata regularly and give it a spin. I'm not wild about buying new SSD's

akshunj · August 22, 2020

So, after changing power cables, sata cables, sata ports, power supply, UPS, and testing ram, I could only conclude that this linux kernel + these ssd drives (Crucial sata) do not play well together. I have switched the ssd drives to a usb3 sata dock and no issues so far. At some point, I will upgrade the ssd drives and try again via sata, but that time is not now.

Thinking out loud, I know unraid uses the older LTS linux kernel. But I have no idea how bug fixes and security updates to that kernel are handled. I hooked the ssd's up to my home theater plex box via sata, and not a single error (different mobo, so obviously not apples to apples). It runs the latest version of linux mint with kernel 5.4. I wonder if some of the ssd issues that many experience on this forum are related to the older kernel version used by unraid.

trurl · August 22, 2020

25 minutes ago, akshunj said:

I know unraid uses the older LTS linux kernel

Latest beta is kernel 5.7

akshunj · August 23, 2020

4 hours ago, trurl said:

Latest beta is kernel 5.7

Hmmm, I think I will check it out. How stable are the unraid beta releases?

trurl · August 23, 2020

Many are already using latest beta. In addition to new kernel there are some very interesting new features. I seldom hesitate to run betas because I have a good (enough) backup plan.

akshunj · August 31, 2020

update: getting the ssd drives off sata and onto usb3 has fixed the issue. Usb3 is NOT a great alternative at all, but everything is stable and cache performance is... fine. I just bought a used xeon server and intend to migrate my drives over to this and upgrade to unraid beta. I am curious to see if my SSD's will fare better on the new mobo and upgraded kernel.

Sidenote: Digging into unraid logs, I noticed a few Slackware references. I learned linux on Slackware 20 years ago. I think it's cool as hell that I'm using it today via unraid.

akshunj · October 1, 2020

So, after migrating my unraid install to the Xeon server, and plugging the ssd cache drives into the sata bays, the same error has returned with a vengeance. I have attached my logs in case anyone wants to take a peek. The fun happens on 9/30 around 1530 or so, if memory serves. I realize this is not particularly an unraid issue, as it really has nothing to do with the array, the software or anything other than how the kernel interacts with the controller and the device.

I took the array offline, and ran an extended smart test on the drive that got kicked off (it passed). I pulled the drive last night to see if the remaining ssd would generate similar kernel errors on its own. Shortly after I restarted the array, I got this:

Sep 30 21:31:36 TChalla kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

Sep 30 21:31:36 TChalla kernel: sas: ata16: end_device-1:0: cmd error handler

Sep 30 21:31:36 TChalla kernel: sas: ata16: end_device-1:0: dev error handler

Sep 30 21:31:36 TChalla kernel: sas: ata9: end_device-1:2: dev error handler

Sep 30 21:31:36 TChalla kernel: sas: ata10: end_device-1:3: dev error handler

Sep 30 21:31:36 TChalla kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Drive remained alive and connected, and it's fine right now. Here's how I am thinking about this issues:

1. I am going to continue using the single ssd for now; maybe the drive that got kicked really was bad. However, if it was, I wonder why I worked fine in the sata dock

2. These were twin drives... same make and model. If there's a kernel issue with how this drive talks to the controller, maybe a new drive of a different make & model will fix the issue. Already ordered one.

3. If steps 1 & 2 don't work, I will upgrade to the latest unraid beta and try the new kernel.

That's every option I can think of, at this point. I have replaced cables (power & sata), moved them from one sata controller to another, and the only way I got them to behave was to use a sata to usb3 dock. I am open to any other diagnostic suggestions. I am going to hit the slackware subreddit, and see if there may be some insights from those guys. Thanks all.

tchalla-diagnostics-20200930-1535.zip tchalla-smart-20200930-1644.zip

civic95man · October 1, 2020

2 hours ago, akshunj said:

I am open to any other diagnostic suggestions

Have you tried the beta to see if the newer kernel fixes things?

akshunj · October 1, 2020

Going to try that now. Server just crashed in spectacular fashion. First the ssd, then two more data drives. It was pretty unbelievable. Attached diagnostics from the latest fiasco

tchalla-diagnostics-20201001-1323.zip

JorgeB · October 1, 2020

The Intel SAS controller sometimes has some issues with Linux, unlike the Intel SATA ports which are rock solid, if you can try for example using an LSI HBA instead.

akshunj · October 1, 2020

16 minutes ago, JorgeB said:

The Intel SAS controller sometimes has some issues with Linux, unlike the Intel SATA ports which are rock solid, if you can try for example using an LSI HBA instead.

This system was my backup server for a month with zero issues. When I swapped in the drives from my main server, this began.

akshunj · October 1, 2020

upgraded to beta and was immediately greeted by this:

Oct 1 13:59:50 TChalla kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Oct 1 13:59:50 TChalla kernel: sas: ata7: end_device-7:0: cmd error handler
Oct 1 13:59:50 TChalla kernel: sas: ata7: end_device-7:0: dev error handler
Oct 1 13:59:50 TChalla kernel: sas: ata8: end_device-7:1: dev error handler
Oct 1 13:59:50 TChalla kernel: sas: ata9: end_device-7:2: dev error handler
Oct 1 13:59:50 TChalla kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

I am going to remove the cache drive, and restore my appdata and dockers onto the array. This issue has never taken out my data drives before. Just need to reassure myself this is still an issue with the cache drive

JorgeB · October 1, 2020

6 minutes ago, akshunj said:

When I swapped in the drives from my main server, this began.

It's possible that some drive models have compatibility issues with the controller.

akshunj · October 1, 2020

Removed both ssd cache drives. Rebooted and restarted array. Logs are clean. No mention of the ata kernel error. Yet. Going to recreate my docker image and bring the dockers back up this eve. New ssd arrives next week. Will install as cache when it arrives and give it another go.

It looks like the controller passed the kernel incorrect state info about multiple drives, although it was instigated by the ssd drive. I am going to assume @JorgeB is correct that certain drive models are no good for certain controllers (at least the ones in my signature). I find this odd as one controller is for a server, and the other is for a desktop. It's also weird that the drives worked in my dock, but not attached via internal sata.

I looked up my purchase on ebay. These were *refurbished* drives. This was my mistake; I could have sworn they were new. I am hoping this is the root of my ACTUAL problem. Regardless, I will be using the warranty. The new drive I ordered is new, so maybe that plus the new make/model will work out.

Still going to monitor the array for ata errors. The crash a few hours ago was *rough*.

akshunj · October 4, 2020

Issue is definitely related to SSD and controller interaction. 72hours running without ssd and no ata errors. All dockers running as before. Will update when my new ssd arrives in a couple of days

[SOLVED] cache drive errors - drive, cable or docker image?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation