Kewjoe Posted October 12, 2013 Share Posted October 12, 2013 Found a drive red balled on my server. The weird thing is, it wasn't the drive with write errors. My Parity and disk 4 had many write errors but disk 6 has the Red ball. I changed the cable on disk 6, retrieved the smart report, no issues or indicators showing anything wrong. Ran short test, no issues. I'm currently running the long test. I also pulled the smart report on the Parity drive and drive 4, also came back clean. I haven't run short and long tests on them yet. Sorry for the novel, just wanted to check if I'm on the right track and any advice is appreciated. Edit: Added Syslog from yesterday and today. Added 3 smart reports. Version: 5.0 RC16c Information about Drives: SDB - Data Drive with many write errors, but green ball SDC - Data Drive with no write errors, but red ball (I've changed SATA cables on this drive so far) SDD - Parity Drive with many write errors, but green ball Don't mind the temps in the smart reports, they are always in the low 30's, i had my case out and all fans disconnected when i ran the tests (had a big room fan pointed at the server in the meantime). logs.zip smartsdb.txt smartsdc.txt smartsdd.txt Link to comment
dgaschk Posted October 12, 2013 Share Posted October 12, 2013 UnRAID takes a disk out of service, red-ball, when a write to the disk fails. Why do you think that the other drives have write errors? Post SMART reports and zipped syslog for advice. Link to comment
Kewjoe Posted October 12, 2013 Author Share Posted October 12, 2013 UnRAID takes a disk out of service, red-ball, when a write to the disk fails. Why do you think that the other drives have write errors? Post SMART reports and zipped syslog for advice. Done, Thanks Link to comment
Kewjoe Posted October 12, 2013 Author Share Posted October 12, 2013 I've found a common theme after tearing down the server, all 3 drives are connected to a rosewill RC-218 PCI-E SATA Controller. They are the only 3 drives connected. Seeing as the SMART reports came back good for all 3 drives, It's looking very likely that the culprit is the card. I don't have a spare right now, I can re-seat the card, re-seat the cables. Should the red ball drive go back to green if that fixes it? or would i have to force it somehow to go green? I've been reading the wiki, I'm not clear on that point. Link to comment
garycase Posted October 12, 2013 Share Posted October 12, 2013 Once the disk is red-balled, you need to replace it ... but you can force UnRAID to rebuild the disk "on itself" (i.e. using the same drive). Note which slot the red-balled disk is in. Stop the array and unassign that disk. Start the array -- it will now show a "missing" disk. Stop the array and assign the disk to that slot again. Start the array and it will rebuild the disk. Note that this only works if you have good parity. If that's not the case (i.e. you had parity errors but didn't do a correcting check to fix them), then the integrity of the data in the rebuild is compromised, so you should check all of the files against your backups. If you don't have good parity, but think the disk is okay, you can simply do a "New Config" and then let a new parity sync build your parity drive. Link to comment
Kewjoe Posted October 12, 2013 Author Share Posted October 12, 2013 Once the disk is red-balled, you need to replace it ... but you can force UnRAID to rebuild the disk "on itself" (i.e. using the same drive). Note which slot the red-balled disk is in. Stop the array and unassign that disk. Start the array -- it will now show a "missing" disk. Stop the array and assign the disk to that slot again. Start the array and it will rebuild the disk. Note that this only works if you have good parity. If that's not the case (i.e. you had parity errors but didn't do a correcting check to fix them), then the integrity of the data in the rebuild is compromised, so you should check all of the files against your backups. If you don't have good parity, but think the disk is okay, you can simply do a "New Config" and then let a new parity sync build your parity drive. Thanks for the reply. What if I wanted to move these 3 drives to a new controller? I also have a Supermicro AOC-SASLP-MV8. It has 4 ports open, I can migrate them there. What would be the steps to do it without causing any chance of losing my (hopefully) valid parity? Link to comment
garycase Posted October 12, 2013 Share Posted October 12, 2013 Just move them - v5 keeps track of the drives by serial number, so it doesn't matter which controller they're on. Link to comment
Kewjoe Posted October 12, 2013 Author Share Posted October 12, 2013 Just move them - v5 keeps track of the drives by serial number, so it doesn't matter which controller they're on. Awesome! Just to be clear, the steps are 1) move to other controller 2) follow your steps to rebuild the red ball drive 3) verify data with my backup in case any of the parity was damaged with the write errors. Sounds about right? Link to comment
garycase Posted October 12, 2013 Share Posted October 12, 2013 Yes, that's correct. If you're certain your parity is good, you can skip #3, although it's always a good idea to validate the data if there have been write errors at any point. Link to comment
Kewjoe Posted October 13, 2013 Author Share Posted October 13, 2013 Yes, that's correct. If you're certain your parity is good, you can skip #3, although it's always a good idea to validate the data if there have been write errors at any point. Last question before I take the plunge. The evidence I have is that the controller was the culprit. I've since moved the 3 drives to my other controller. Parity and Disk 4 had errors at some point this week. I run monthly parity checks and the errors could have been from then. I'm not sure at this point. Disk 6 was red balled with no errors. All 3 drives passed the short and long smart tests. Based on this, i feel more inclined to start a new config and rebuild the parity. Is that the right move here? I'm not sure if there is enough information to make a 100% correct decision, but based on what I have, what would you lean towards? rebuild parity? or try to rebuild disk 6? Thanks!! Link to comment
garycase Posted October 13, 2013 Share Posted October 13, 2013 From what you've outlined, I'd also lean towards just doing a new config. HOWEVER .. BEFORE doing that, I'd shut down; physically remove the red-balled disk (Disk6 ... be sure you know the serial # and take out the correct disk); connect it to a Windows PC (either internally or via a USB bridge); and then use the free LinuxReader to read all of the files. If you have space to copy them all, do that. If not, just copy a group at a time to a folder and periodically delete the contents of the folder ... the idea is simply to confirm that ALL of the files are readable (not that they're error-free). [ http://www.diskinternals.com/linux-reader/ ] Link to comment
garycase Posted October 13, 2013 Share Posted October 13, 2013 You could also confirm that all the files area readable by logging into the server and using the Linux cp command to copy all of the files to /dev/null (nowhere) Link to comment
Kewjoe Posted October 13, 2013 Author Share Posted October 13, 2013 If i do the copy to /dev/null the desired outcome is that it completes successfully. A bad drive would fail at some point, right? Thanks again for all the help! Link to comment
garycase Posted October 13, 2013 Share Posted October 13, 2013 Yes, if it couldn't read a file it would give you an error. But as I noted earlier, this just confirms that all the files are "good" from a readability standpoint -- not that they haven't at some point been corrupted. Regardless of what you do -- rebuild or new config -- it's a good idea to compare the files with your backups to confirm all is okay. Link to comment
Kewjoe Posted October 13, 2013 Author Share Posted October 13, 2013 Yes, if it couldn't read a file it would give you an error. But as I noted earlier, this just confirms that all the files are "good" from a readability standpoint -- not that they haven't at some point been corrupted. Regardless of what you do -- rebuild or new config -- it's a good idea to compare the files with your backups to confirm all is okay. Nothing is currently mounted, should i mount disk6 to try the copy to /dev/null? I assume yes. Link to comment
garycase Posted October 13, 2013 Share Posted October 13, 2013 As I noted, the purpose of doing something that forces every file to be read is simply to confirm the disk is completely readable BEFORE you destroy the ability to rebuild it (by doing a New Config). So yes, that's one way to check the readability. Link to comment
dgaschk Posted October 13, 2013 Share Posted October 13, 2013 The syslogs indicate that all drives are working. Any drive that has suffered a write error will show a red ball. Why do you believe that other drives have write errors? Based on the syslogs, the server is working correctly. I do not have a clear understanding of the state of your system and thus can offer no advice. Link to comment
Kewjoe Posted October 13, 2013 Author Share Posted October 13, 2013 The syslogs indicate that all drives are working. Any drive that has suffered a write error will show a red ball. Why do you believe that other drives have write errors? Based on the syslogs, the server is working correctly. I do not have a clear understanding of the state of your system and thus can offer no advice. Hi dgaschk, it looks like one of my SATA controllers went bad. The 3 drives in question were connected to it and nothing else was. I've since moved the 3 drives to my 8 port controller and followed Garycase's advice. I ended up verifying the red ball drive in my windows machine with the linux reader. Copied a huge portion of the data over and had no issues. Did some spot checks, no problems found. I ended up creating a new config and I'm at 60% of the parity sync. I'll have to try to verify if the controller is bad, but since I don't have an immediate need for it with my 10 drives spread over the 4 onboard and 8 on the other controller, I'll likely leave it out. When I need to expand, ill buy another AOC-SASLP-MV8. Thanks for all the help garycase and dgaschk! I'll change the subject to "Solved" when my parity is finished and i verify that I'm back up and running. Link to comment
Kewjoe Posted October 13, 2013 Author Share Posted October 13, 2013 The syslogs indicate that all drives are working. Any drive that has suffered a write error will show a red ball. Why do you believe that other drives have write errors? Based on the syslogs, the server is working correctly. I do not have a clear understanding of the state of your system and thus can offer no advice. I just realized I was calling it "write errors" and I should have just used "errors". I was basing on what I saw on the dashboard. My parity drive and disk4 showed a lot of errors, but disk6 was redballed with 0 errors. Unfortunately, I did not get the syslog from that event because when I tried troubleshooting, everything started locking up. GUI went down, even commands i was running on the console were just freezing, i had 4 sessions open trying to manually bring the array offline and have a clean shutdown. Ultimately it did not shut down gracefully. I should have manually extracted the syslog before attempting anything. My mistake. The syslogs I attached were after the fact. I'll continue to closely monitor things once I'm back up and running. I've been slacking lately on paying attention. It's been working so well the last 6+ months, I got lazy. Link to comment
garycase Posted October 13, 2013 Share Posted October 13, 2013 I just realized I was calling it "write errors" and I should have just used "errors". I meant to comment on that very early in the thread, but forgot to say anything -- but I knew what you meant. The errors column indicates an error report from the drive on a read -- the data is actually corrected by UnRAID in those cases; if a write error occurs, the drive will be red-balled. Link to comment
Kewjoe Posted October 13, 2013 Author Share Posted October 13, 2013 Parity is re-built. So far everything is working from both drives. Looks like I'm back up and running. I'll keep a close eye on things. Thanks again for all the help Garycase and dgaschk!!! Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.