Jump to content

[Solved] Drive is Red balled, not drive with errors


Kewjoe

Recommended Posts

Found a drive red balled on my server. The weird thing is, it wasn't the drive with write errors.

 

My Parity and disk 4 had many write errors but disk 6 has the Red ball. I changed the cable on disk 6, retrieved the smart report, no issues or indicators showing anything wrong. Ran short test, no issues. I'm currently running the long test.

 

I also pulled the smart report on the Parity drive and drive 4, also came back clean. I haven't run short and long tests on them yet.

 

Sorry for the novel, just wanted to check if I'm on the right track and any advice is appreciated.

 

Edit: Added Syslog from yesterday and today. Added 3 smart reports.

 

Version: 5.0 RC16c

 

Information about Drives:

 

SDB - Data Drive with many write errors, but green ball

SDC - Data Drive with no write errors, but red ball (I've changed SATA cables on this drive so far)

SDD - Parity Drive with many write errors, but green ball

 

Don't mind the temps in the smart reports, they are always in the low 30's, i had my case out and all fans disconnected when i ran the tests (had a big room fan pointed at the server in the meantime).

logs.zip

smartsdb.txt

smartsdc.txt

smartsdd.txt

Link to comment

I've found a common theme after tearing down the server, all 3 drives are connected to a rosewill RC-218 PCI-E SATA Controller. They are the only 3 drives connected. Seeing as the SMART reports came back good for all 3 drives, It's looking very likely that the culprit is the card.

 

I don't have a spare right now, I can re-seat the card, re-seat the cables. Should the red ball drive go back to green if that fixes it? or would i have to force it somehow to go green? I've been reading the wiki, I'm not clear on that point.

Link to comment

Once the disk is red-balled, you need to replace it ... but you can force UnRAID to rebuild the disk "on itself" (i.e. using the same drive).

 

Note which slot the red-balled disk is in.

Stop the array and unassign that disk.

Start the array -- it will now show a "missing" disk.

Stop the array and assign the disk to that slot again.

Start the array and it will rebuild the disk.

 

Note that this only works if you have good parity.  If that's not the case (i.e. you had parity errors but didn't do a correcting check to fix them), then the integrity of the data in the rebuild is compromised, so you should check all of the files against your backups.

 

If you don't have good parity, but think the disk is okay, you can simply do a "New Config" and then let a new parity sync build your parity drive.

 

Link to comment

Once the disk is red-balled, you need to replace it ... but you can force UnRAID to rebuild the disk "on itself" (i.e. using the same drive).

 

Note which slot the red-balled disk is in.

Stop the array and unassign that disk.

Start the array -- it will now show a "missing" disk.

Stop the array and assign the disk to that slot again.

Start the array and it will rebuild the disk.

 

Note that this only works if you have good parity.  If that's not the case (i.e. you had parity errors but didn't do a correcting check to fix them), then the integrity of the data in the rebuild is compromised, so you should check all of the files against your backups.

 

If you don't have good parity, but think the disk is okay, you can simply do a "New Config" and then let a new parity sync build your parity drive.

 

Thanks for the reply.

 

What if I wanted to move these 3 drives to a new controller?

 

I also have a Supermicro AOC-SASLP-MV8. It has 4 ports open, I can migrate them there. What would be the steps to do it without causing any chance of losing my (hopefully) valid parity?

Link to comment

Just move them - v5 keeps track of the drives by serial number, so it doesn't matter which controller they're on.

 

Awesome! Just to be clear, the steps are

 

1) move to other controller

2) follow your steps to rebuild the red ball drive

3) verify data with my backup in case any of the parity was damaged with the write errors.

 

Sounds about right?

Link to comment

Yes, that's correct.  If you're certain your parity is good, you can skip #3, although it's always a good idea to validate the data if there have been write errors at any point.

 

Last question before I take the plunge.

 

The evidence I have is that the controller was the culprit. I've since moved the 3 drives to my other controller. Parity and Disk 4 had errors at some point this week. I run monthly parity checks and the errors could have been from then. I'm not sure at this point. Disk 6 was red balled with no errors. All 3 drives passed the short and long smart tests.

 

Based on this, i feel more inclined to start a new config and rebuild the parity. Is that the right move here? I'm not sure if there is enough information to make a 100% correct decision, but based on what I have, what would you lean towards? rebuild parity? or try to rebuild disk 6?

 

Thanks!!

Link to comment

From what you've outlined, I'd also lean towards just doing a new config.

 

HOWEVER .. BEFORE doing that, I'd shut down;  physically remove the red-balled disk (Disk6 ... be sure you know the serial # and take out the correct disk);  connect it to a Windows PC (either internally or via a USB bridge);  and then use the free LinuxReader to read all of the files.  If you have space to copy them all, do that.    If not, just copy a group at a time to a folder and periodically delete the contents of the folder ... the idea is simply to confirm that ALL of the files are readable (not that they're error-free).  [ http://www.diskinternals.com/linux-reader/ ]

 

 

Link to comment

Yes, if it couldn't read a file it would give you an error.

 

But as I noted earlier, this just confirms that all the files are "good" from a readability standpoint -- not that they haven't at some point been corrupted.  Regardless of what you do -- rebuild or new config -- it's a good idea to compare the files with your backups to confirm all is okay.

 

Link to comment

Yes, if it couldn't read a file it would give you an error.

 

But as I noted earlier, this just confirms that all the files are "good" from a readability standpoint -- not that they haven't at some point been corrupted.  Regardless of what you do -- rebuild or new config -- it's a good idea to compare the files with your backups to confirm all is okay.

 

Nothing is currently mounted, should i mount disk6 to try the copy to /dev/null? I assume yes.

Link to comment

The syslogs indicate that all drives are working. Any drive that has suffered a write error will show a red ball. Why do you believe that other drives have write errors? Based on the syslogs, the server is working correctly. I do not have a clear understanding of the state of your system and thus can offer no advice.

Link to comment

The syslogs indicate that all drives are working. Any drive that has suffered a write error will show a red ball. Why do you believe that other drives have write errors? Based on the syslogs, the server is working correctly. I do not have a clear understanding of the state of your system and thus can offer no advice.

 

Hi dgaschk, it looks like one of my SATA controllers went bad. The 3 drives in question were connected to it and nothing else was. I've since moved the 3 drives to my 8 port controller and followed Garycase's advice. I ended up verifying the red ball drive in my windows machine with the linux reader. Copied a huge portion of the data over and had no issues. Did some spot checks, no problems found. I ended up creating a new config and I'm at 60% of the parity sync.

 

I'll have to try to verify if the controller is bad, but since I don't have an immediate need for it with my 10 drives spread over the 4 onboard and 8 on the other controller, I'll likely leave it out. When I need to expand, ill buy another AOC-SASLP-MV8.

 

Thanks for all the help garycase and dgaschk! I'll change the subject to "Solved" when my parity is finished and i verify that I'm back up and running.

Link to comment

The syslogs indicate that all drives are working. Any drive that has suffered a write error will show a red ball. Why do you believe that other drives have write errors? Based on the syslogs, the server is working correctly. I do not have a clear understanding of the state of your system and thus can offer no advice.

 

I just realized I was calling it "write errors" and I should have just used "errors". I was basing on what I saw on the dashboard. My parity drive and disk4 showed a lot of errors, but disk6 was redballed with 0 errors. Unfortunately, I did not get the syslog from that event because when I tried troubleshooting, everything started locking up. GUI went down, even commands i was running on the console were just freezing, i had 4 sessions open trying to manually bring the array offline and have a clean shutdown. Ultimately it did not shut down gracefully. I should have manually extracted the syslog before attempting anything. My mistake.

 

The syslogs I attached were after the fact. I'll continue to closely monitor things once I'm back up and running. I've been slacking lately on paying attention. It's been working so well the last 6+ months, I got lazy. :)

Link to comment

I just realized I was calling it "write errors" and I should have just used "errors".

 

I meant to comment on that very early in the thread, but forgot to say anything -- but I knew what you meant.  The errors column indicates an error report from the drive on a read -- the data is actually corrected by UnRAID in those cases;  if a write error occurs, the drive will be red-balled.

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...