[SOLVED] cache drive errors - drive, cable or docker image?


Recommended Posts

Cache device dropped offline:

 

Aug 19 03:47:16 Tower kernel: ata5: hard resetting link
Aug 19 03:47:21 Tower kernel: ata5: COMRESET failed (errno=-16)
Aug 19 03:47:21 Tower kernel: ata5: reset failed, giving up
Aug 19 03:47:21 Tower kernel: ata5.00: disabled
Aug 19 03:47:21 Tower kernel: ata5: EH complete

 

With SSDs this is usually a connection/cable problem.

Link to comment
1 hour ago, akshunj said:

rebuild the docker image

Here is how that goes in case you need it.

 

Settings - Docker, disable, delete. Enable again to recreate docker.img.

 

If you had any custom docker network such as proxynet you will have to recreate it.

 

Then be sure to use the Previous Apps feature on the Apps page, which will reinstall your dockers just as they were.

Link to comment

Thanks so much.

 

The more I investigate, the more I think this is a hardware issue or kernel bug.  That ATA error is from the kernel.  I know my way around linux, but I'm still an unraid noob.  After some googling, it appears this particular kernel error seems related to SSD's specifically. It's driving me nuts.  Memtest was fine.  Cables have been switched out.  I had them on a pci-e sata card before this, and I removed the card and attached them to the sata mainboard thinking that the pci-e card was the issue.  The hard drives seem fine, so I am left with two new-ish SSD's that are throwing out this kernel error at random times after boot.  And both drives pass SMART tests.  I am flummoxed.

 

I am going to move everything back to the array, pull the SSD's from the sata connection, and try to use them as cache from a usb sata dock.  I know throughput will be choked, but I am not sure how much that will translate to a real-world performance hit

Link to comment

So, after changing power cables, sata cables, sata ports, power supply, UPS, and testing ram, I could only conclude that this linux kernel + these ssd drives (Crucial sata) do not play well together.  I have switched the ssd drives to a usb3 sata dock and no issues so far.  At some point, I will upgrade the ssd drives and try again via sata, but that time is not now.

 

Thinking out loud, I know unraid uses the older LTS linux kernel.  But I have no idea how bug fixes and security updates to that kernel are handled. I hooked the ssd's up to my home theater plex box via sata, and not a single error (different mobo, so obviously not apples to apples).  It runs the latest version of linux mint with kernel 5.4.  I wonder if some of the ssd issues that many experience on this forum are related to the older kernel version used by unraid.

Link to comment

update: getting the ssd drives off sata and onto usb3 has fixed the issue.  Usb3 is NOT a great alternative at all, but everything is stable and cache performance is... fine.  I just bought a used xeon server and intend to migrate my drives over to this and upgrade to unraid beta.  I am curious to see if my SSD's will fare better on the new mobo and upgraded kernel.

 

Sidenote: Digging into unraid logs, I noticed a few Slackware references. I learned linux on Slackware 20 years ago. I think it's cool as hell that I'm using it today via unraid.

Link to comment
  • 1 month later...

So, after migrating my unraid install to the Xeon server, and plugging the ssd cache drives into the sata bays, the same error has returned with a vengeance.  I have attached my logs in case anyone wants to take a peek.  The fun happens on 9/30 around 1530 or so, if memory serves.  I realize this is not particularly an unraid issue, as it really has nothing to do with the array, the software or anything other than how the kernel interacts with the controller and the device.

 

I took the array offline, and ran an extended smart test on the drive that got kicked off (it passed).  I pulled the drive last night to see if the remaining ssd would generate similar kernel errors on its own.  Shortly after I restarted the array, I got this:

 

Sep 30 21:31:36 TChalla kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

Sep 30 21:31:36 TChalla kernel: sas: ata16: end_device-1:0: cmd error handler

Sep 30 21:31:36 TChalla kernel: sas: ata16: end_device-1:0: dev error handler

Sep 30 21:31:36 TChalla kernel: sas: ata9: end_device-1:2: dev error handler

Sep 30 21:31:36 TChalla kernel: sas: ata10: end_device-1:3: dev error handler

Sep 30 21:31:36 TChalla kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

 

Drive remained alive and connected, and it's fine right now.  Here's how I am thinking about this issues:

 

1. I am going to continue using the single ssd for now; maybe the drive that got kicked really was bad.  However, if it was, I wonder why I worked fine in the sata dock

2. These were twin drives... same make and model.  If there's a kernel issue with how this drive talks to the controller, maybe a new drive of a different make & model will fix the issue. Already ordered one.

3. If steps 1 & 2 don't work, I will upgrade to the latest unraid beta and try the new kernel.

 

That's every option I can think of, at this point.  I have replaced cables (power & sata), moved them from one sata controller to another, and the only way I got them to behave was to use a sata to usb3 dock.  I am open to any other diagnostic suggestions.  I am going to hit the slackware subreddit, and see if there may be some insights from those guys.  Thanks all.

tchalla-diagnostics-20200930-1535.zip tchalla-smart-20200930-1644.zip

Link to comment
16 minutes ago, JorgeB said:

The Intel SAS controller sometimes has some issues with Linux, unlike the Intel SATA ports which are rock solid, if you can try for example using an LSI HBA instead.

This system was my backup server for a month with zero issues.  When I swapped in the drives from my main server, this began.

Link to comment

upgraded to beta and was immediately greeted by this:

Oct 1 13:59:50 TChalla kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
Oct 1 13:59:50 TChalla kernel: sas: ata7: end_device-7:0: cmd error handler
Oct 1 13:59:50 TChalla kernel: sas: ata7: end_device-7:0: dev error handler
Oct 1 13:59:50 TChalla kernel: sas: ata8: end_device-7:1: dev error handler
Oct 1 13:59:50 TChalla kernel: sas: ata9: end_device-7:2: dev error handler
Oct 1 13:59:50 TChalla kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

 

I am going to remove the cache drive, and restore my appdata and dockers onto the array.  This issue has never taken out my data drives before.  Just need to reassure myself this is still an issue with the cache drive

Link to comment

Removed both ssd cache drives.  Rebooted and restarted array.  Logs are clean.  No mention of the ata kernel error.  Yet.  Going to recreate my docker image and bring the dockers back up this eve.  New ssd arrives next week.  Will install as cache when it arrives and give it another go.

 

It looks like the controller passed the kernel incorrect state info about multiple drives, although it was instigated by the ssd drive.  I am going to assume @JorgeB is correct that certain drive models are no good for certain controllers (at least the ones in my signature).  I find this odd as one controller is for a server, and the other is for a desktop.  It's also weird that the drives worked in my dock, but not attached via internal sata.

 

I looked up my purchase on ebay.  These were *refurbished* drives.  This was my mistake; I could have sworn they were new.  I am hoping this is the root of my ACTUAL problem. Regardless, I will be using the warranty.  The new drive I ordered is new, so maybe that plus the new make/model will work out.

 

Still going to monitor the array for ata errors.  The crash a few hours ago was *rough*.  

Link to comment
  • JorgeB changed the title to [SOLVED] cache drive errors - drive, cable or docker image?

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.