akshunj Posted August 19, 2020 Share Posted August 19, 2020 Hello all. Every few days, my dockers have dropped out and I am getting cache drive errors (I have rebooted to recover). After reading on this forum, I see that it could be a docker image corruption (I had a few dirty shutdowns in the past), a bad cable, or a bad cache drive. Wondering if I can get an assist in diagnosing? Logs are attached tower-diagnostics-20200819-0807.zip Quote Link to comment
JorgeB Posted August 19, 2020 Share Posted August 19, 2020 Cache device dropped offline: Aug 19 03:47:16 Tower kernel: ata5: hard resetting link Aug 19 03:47:21 Tower kernel: ata5: COMRESET failed (errno=-16) Aug 19 03:47:21 Tower kernel: ata5: reset failed, giving up Aug 19 03:47:21 Tower kernel: ata5.00: disabled Aug 19 03:47:21 Tower kernel: ata5: EH complete With SSDs this is usually a connection/cable problem. Quote Link to comment
akshunj Posted August 19, 2020 Author Share Posted August 19, 2020 Thank you. I will give this a shot first and reply back here. Shout I try rebuilding the docker image also, as it seems susceptible to corruption? Quote Link to comment
JorgeB Posted August 19, 2020 Share Posted August 19, 2020 2 minutes ago, akshunj said: Shout I try rebuilding the docker image also, as it seems susceptible to corruption? It might not need rebuilding, but it won't hurt. Quote Link to comment
akshunj Posted August 19, 2020 Author Share Posted August 19, 2020 changed out sata and power cables. Will rebuild docker image next, if issues continue. Both the SSD's in the cache pool are new-ish, so I am hoping it's not a drive issue. Quote Link to comment
akshunj Posted August 20, 2020 Author Share Posted August 20, 2020 So, same error just happened. This time affecting both cache drives in the pool. I did extended SMART tests on both drives and they passed. Moving on to the docker rebuild. New diagnostics attached, in case anyone wants to take a stab... tower-diagnostics-20200820-1649.zip Quote Link to comment
akshunj Posted August 20, 2020 Author Share Posted August 20, 2020 I am going to do a memtest to check the RAM first, then rebuild the docker image. There are so many posts about this type of issue. Not sure how to proceed Quote Link to comment
trurl Posted August 20, 2020 Share Posted August 20, 2020 1 hour ago, akshunj said: rebuild the docker image Here is how that goes in case you need it. Settings - Docker, disable, delete. Enable again to recreate docker.img. If you had any custom docker network such as proxynet you will have to recreate it. Then be sure to use the Previous Apps feature on the Apps page, which will reinstall your dockers just as they were. Quote Link to comment
akshunj Posted August 20, 2020 Author Share Posted August 20, 2020 Thanks so much. The more I investigate, the more I think this is a hardware issue or kernel bug. That ATA error is from the kernel. I know my way around linux, but I'm still an unraid noob. After some googling, it appears this particular kernel error seems related to SSD's specifically. It's driving me nuts. Memtest was fine. Cables have been switched out. I had them on a pci-e sata card before this, and I removed the card and attached them to the sata mainboard thinking that the pci-e card was the issue. The hard drives seem fine, so I am left with two new-ish SSD's that are throwing out this kernel error at random times after boot. And both drives pass SMART tests. I am flummoxed. I am going to move everything back to the array, pull the SSD's from the sata connection, and try to use them as cache from a usb sata dock. I know throughput will be choked, but I am not sure how much that will translate to a real-world performance hit Quote Link to comment
trurl Posted August 20, 2020 Share Posted August 20, 2020 6 minutes ago, akshunj said: try to use them as cache from a usb sata dock. USB connections are not reliable enough for the permanent connections required in the array or pool. Quote Link to comment
akshunj Posted August 21, 2020 Author Share Posted August 21, 2020 I hear that. I am going to backup appdata regularly and give it a spin. I'm not wild about buying new SSD's Quote Link to comment
akshunj Posted August 22, 2020 Author Share Posted August 22, 2020 So, after changing power cables, sata cables, sata ports, power supply, UPS, and testing ram, I could only conclude that this linux kernel + these ssd drives (Crucial sata) do not play well together. I have switched the ssd drives to a usb3 sata dock and no issues so far. At some point, I will upgrade the ssd drives and try again via sata, but that time is not now. Thinking out loud, I know unraid uses the older LTS linux kernel. But I have no idea how bug fixes and security updates to that kernel are handled. I hooked the ssd's up to my home theater plex box via sata, and not a single error (different mobo, so obviously not apples to apples). It runs the latest version of linux mint with kernel 5.4. I wonder if some of the ssd issues that many experience on this forum are related to the older kernel version used by unraid. Quote Link to comment
trurl Posted August 22, 2020 Share Posted August 22, 2020 25 minutes ago, akshunj said: I know unraid uses the older LTS linux kernel Latest beta is kernel 5.7 Quote Link to comment
akshunj Posted August 23, 2020 Author Share Posted August 23, 2020 4 hours ago, trurl said: Latest beta is kernel 5.7 Hmmm, I think I will check it out. How stable are the unraid beta releases? Quote Link to comment
trurl Posted August 23, 2020 Share Posted August 23, 2020 Many are already using latest beta. In addition to new kernel there are some very interesting new features. I seldom hesitate to run betas because I have a good (enough) backup plan. Quote Link to comment
akshunj Posted August 31, 2020 Author Share Posted August 31, 2020 update: getting the ssd drives off sata and onto usb3 has fixed the issue. Usb3 is NOT a great alternative at all, but everything is stable and cache performance is... fine. I just bought a used xeon server and intend to migrate my drives over to this and upgrade to unraid beta. I am curious to see if my SSD's will fare better on the new mobo and upgraded kernel. Sidenote: Digging into unraid logs, I noticed a few Slackware references. I learned linux on Slackware 20 years ago. I think it's cool as hell that I'm using it today via unraid. Quote Link to comment
akshunj Posted October 1, 2020 Author Share Posted October 1, 2020 So, after migrating my unraid install to the Xeon server, and plugging the ssd cache drives into the sata bays, the same error has returned with a vengeance. I have attached my logs in case anyone wants to take a peek. The fun happens on 9/30 around 1530 or so, if memory serves. I realize this is not particularly an unraid issue, as it really has nothing to do with the array, the software or anything other than how the kernel interacts with the controller and the device. I took the array offline, and ran an extended smart test on the drive that got kicked off (it passed). I pulled the drive last night to see if the remaining ssd would generate similar kernel errors on its own. Shortly after I restarted the array, I got this: Sep 30 21:31:36 TChalla kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Sep 30 21:31:36 TChalla kernel: sas: ata16: end_device-1:0: cmd error handler Sep 30 21:31:36 TChalla kernel: sas: ata16: end_device-1:0: dev error handler Sep 30 21:31:36 TChalla kernel: sas: ata9: end_device-1:2: dev error handler Sep 30 21:31:36 TChalla kernel: sas: ata10: end_device-1:3: dev error handler Sep 30 21:31:36 TChalla kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 Drive remained alive and connected, and it's fine right now. Here's how I am thinking about this issues: 1. I am going to continue using the single ssd for now; maybe the drive that got kicked really was bad. However, if it was, I wonder why I worked fine in the sata dock 2. These were twin drives... same make and model. If there's a kernel issue with how this drive talks to the controller, maybe a new drive of a different make & model will fix the issue. Already ordered one. 3. If steps 1 & 2 don't work, I will upgrade to the latest unraid beta and try the new kernel. That's every option I can think of, at this point. I have replaced cables (power & sata), moved them from one sata controller to another, and the only way I got them to behave was to use a sata to usb3 dock. I am open to any other diagnostic suggestions. I am going to hit the slackware subreddit, and see if there may be some insights from those guys. Thanks all. tchalla-diagnostics-20200930-1535.zip tchalla-smart-20200930-1644.zip Quote Link to comment
civic95man Posted October 1, 2020 Share Posted October 1, 2020 2 hours ago, akshunj said: I am open to any other diagnostic suggestions Have you tried the beta to see if the newer kernel fixes things? Quote Link to comment
akshunj Posted October 1, 2020 Author Share Posted October 1, 2020 Going to try that now. Server just crashed in spectacular fashion. First the ssd, then two more data drives. It was pretty unbelievable. Attached diagnostics from the latest fiasco tchalla-diagnostics-20201001-1323.zip Quote Link to comment
JorgeB Posted October 1, 2020 Share Posted October 1, 2020 The Intel SAS controller sometimes has some issues with Linux, unlike the Intel SATA ports which are rock solid, if you can try for example using an LSI HBA instead. Quote Link to comment
akshunj Posted October 1, 2020 Author Share Posted October 1, 2020 16 minutes ago, JorgeB said: The Intel SAS controller sometimes has some issues with Linux, unlike the Intel SATA ports which are rock solid, if you can try for example using an LSI HBA instead. This system was my backup server for a month with zero issues. When I swapped in the drives from my main server, this began. Quote Link to comment
akshunj Posted October 1, 2020 Author Share Posted October 1, 2020 upgraded to beta and was immediately greeted by this: Oct 1 13:59:50 TChalla kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Oct 1 13:59:50 TChalla kernel: sas: ata7: end_device-7:0: cmd error handler Oct 1 13:59:50 TChalla kernel: sas: ata7: end_device-7:0: dev error handler Oct 1 13:59:50 TChalla kernel: sas: ata8: end_device-7:1: dev error handler Oct 1 13:59:50 TChalla kernel: sas: ata9: end_device-7:2: dev error handler Oct 1 13:59:50 TChalla kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 I am going to remove the cache drive, and restore my appdata and dockers onto the array. This issue has never taken out my data drives before. Just need to reassure myself this is still an issue with the cache drive Quote Link to comment
JorgeB Posted October 1, 2020 Share Posted October 1, 2020 6 minutes ago, akshunj said: When I swapped in the drives from my main server, this began. It's possible that some drive models have compatibility issues with the controller. Quote Link to comment
akshunj Posted October 1, 2020 Author Share Posted October 1, 2020 Removed both ssd cache drives. Rebooted and restarted array. Logs are clean. No mention of the ata kernel error. Yet. Going to recreate my docker image and bring the dockers back up this eve. New ssd arrives next week. Will install as cache when it arrives and give it another go. It looks like the controller passed the kernel incorrect state info about multiple drives, although it was instigated by the ssd drive. I am going to assume @JorgeB is correct that certain drive models are no good for certain controllers (at least the ones in my signature). I find this odd as one controller is for a server, and the other is for a desktop. It's also weird that the drives worked in my dock, but not attached via internal sata. I looked up my purchase on ebay. These were *refurbished* drives. This was my mistake; I could have sworn they were new. I am hoping this is the root of my ACTUAL problem. Regardless, I will be using the warranty. The new drive I ordered is new, so maybe that plus the new make/model will work out. Still going to monitor the array for ata errors. The crash a few hours ago was *rough*. Quote Link to comment
akshunj Posted October 4, 2020 Author Share Posted October 4, 2020 Issue is definitely related to SSD and controller interaction. 72hours running without ssd and no ata errors. All dockers running as before. Will update when my new ssd arrives in a couple of days 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.