Parity Errors, then Red Ball.


Recommended Posts

This is the first time anything like this has happened to me so please bear with me, I have searched around, but am still unaware of what to do exactly.

 

So starting yesterday I noticed that the monthly parity check showed there were a bunch of errors, something in excess of 6000 errors.  Thinking that was odd, I searched the forums.  Found some information that I should run a check only parity check twice to see if the errors were the same or not.  And then to run a memtest, and then other procedures. 

 

So I started to do just that.  I was about 70% through my first no correct parity check, while watching media streaming from the tower.  My media stopped playing and there was nothing in my tv share.

 

I logged onto the webgui and notice that my disk6 has been red-balled.  This drive is a recent added (2 weeks ago) that I precleared and it passed no problems (Seagate 3TB).  I've checked all the SATA connections, and all the power connections. Also have swapped out the SATA cable for a brand new one.  And the drive is still red.

 

I ran a SMART short test, and it failed half way through saying host interruption.  So I ran it again, and it completed fine with no errors.  I'm guessing my next step is to run the longer SMART test.  But after that I don't know what to do.

 

System Specs:

unRAID Version: unRAID Server Pro, Version 5.0-rc8a

Motherboard: ASUSTeK - P5N-D

Processor: Q6600 (i think not sure, it doesn't list it here) - 2.666 GHz

Cache:

Memory: 2GB

Network: 1000Mb/s - Full Duplex

 

Syslog and smart report added.

 

Link to syslog EDIT: realized in that syslog i forgot to start the array heres the updated log

 

I'm also assuming all data on the 6th drive is lost right? Seeing as there were a load of parity issues... I'm actually assuming all my data is not safe.

 

EDIT: just noticed that my cache drive is now showing "unformatted"... It's directly connected to the motherboard, and not on the same controller as drive 6.  Is there a way to fix that too, or is everything now gone? (had a lot of programs data set up there)

EDIT2 (RobJ): remove typo in 2nd link

smart.txt

Link to comment

Sorry about that. I've attached the log.

 

A little update, I took the cache drive out and was able to recover the data on it.  And I replaced it with a new WD Black 640GB drive , and put my data back on it.  That part is running smoothly, however my "Downloads" cache-only share no longer works.  I can access /mnt/cache/, but this causes problems for some of my programs (SB,SAB etc).  Any ideas why the Download cache-only share doesn't work?

 

Other then that I swapped out the power cables, and the SATA cables again.  And now the drive shows up as orange, so I'm in the process of rebuilding data.  Honestly I'm not that concerned if the data is lost, as long as the drive is still working.

syslog2.zip

Link to comment

There are a number of problems.

 

1. I fixed a typo in your 2nd link, but it's always best to attach the entire syslog, zipped.

 

2. Your attached SMART report is for Disk 5, and looks fine, but is probably not the one you meant to attach.

 

3. Syslog ends in "kernel: [Hardware Error]: Machine check events logged", which is serious.  Primary cause I think is memory, which is also the easiest to test, and almost the easiest to fix.  Please run a long memory test.  The UnRAID boot menu has a good one, let it run for several hours at least, or until errors occur.  If memory checks out fine, then there may be another issue with your motherboard, or power issues.  I'm not sufficiently experienced in Linux to know where those 'check events' are logged.  Hopefully, a real Linux expert will advise you where to look, *before* you reboot next.

 

4. Your Cache drive has bad sectors at the very beginning, which stops it from mounting.  This has to be fixed.  I see you have now replaced it.

 

5. Your Parity drive Disk 6 (3TB drive) is attached to one of the slowest SATA ports you have, on a Silicon Image card, which limits it to UDMA/100 speed at best.  Worse yet, it is acting a little jiggly, raising an exception 3 or 4 times, which resulted in it being slowed by the exception handler to UDMA/66, which will limit its speed to less than half of what it is capable.  Your 3TB drives are your fastest, so reconnect both of them to your fastest onboard ports.  And make sure the cables (both ends) to it are tight.  Loose cable *might* be causing the exceptions, or something is wrong with the first port on that card, or this is another indication of power issues.  At some point, you should also consider retiring both Silicon Image cards, as they are limiting performance to any attached drives.

 

6. There is no evidence of issues with Disk 6, but for that we would have needed to see the syslog when that happened.  It would have been much better if you had captured it after those errors occurred, before rebooting.  At the moment, Disk 6 looks fine.  But since there are indications of either memory or power issues, I don't doubt other problems could occur too.  It's absolutely imperative you rule out or fix any memory issues first.  Then you may need to try a better power supply.

 

Edit: I mixed up the 2 3TB drives, Parity and Disk 6.  Inexcusable.

Link to comment

A little update, I took the cache drive out and was able to recover the data on it.  And I replaced it with a new WD Black 640GB drive , and put my data back on it.  That part is running smoothly, however my "Downloads" cache-only share no longer works.  I can access /mnt/cache/, but this causes problems for some of my programs (SB,SAB etc).  Any ideas why the Download cache-only share doesn't work?

 

Not sure how you were able to recover data from the old Cache drive.  It would not surprise me that even if you got it to mount, the Reiser file system would have been corrupted.  That could cause an incomplete recovery, resulting in issues with the contents of the new Cache disk.

Link to comment

A little update, I took the cache drive out and was able to recover the data on it.  And I replaced it with a new WD Black 640GB drive , and put my data back on it.  That part is running smoothly, however my "Downloads" cache-only share no longer works.  I can access /mnt/cache/, but this causes problems for some of my programs (SB,SAB etc).  Any ideas why the Download cache-only share doesn't work?

 

Not sure how you were able to recover data from the old Cache drive.  It would not surprise me that even if you got it to mount, the Reiser file system would have been corrupted.  That could cause an incomplete recovery, resulting in issues with the contents of the new Cache disk.

 

I unplugged it, and installed it in a spare computer.  Then searching found a tool with a GUI that would allow me to read the drive and copy the data off.  I thought it would be a long shot but it turned out to work.  EDIT: I should note the only data i recovered was the config files for SAB,CP,SB.  I tested them on a VM install of the programs, and they seemed to work fine.  Haven't copied it to the server yet.

 

About the parity drive I just checked the SATA cable and it is indeed connected to one of the onboard SATA ports.

 

I have two Seagate 3TB drives:

Parity: ST3000DM001-1CH166_W1F16W9X (sdi) 2930266532

Disk 6: ST3000DM001-1CH166_Z1F1QJHT (sda) 2930266532

 

Disk 6 is plugged into a PCI card, and the parity to the onboard.  I know its not ideal, but I'm saving my money for an upgrade.

 

I can try connecting both 3TB to onboard ports.  About the power supply it is a relatively new Corsair 850w PSU, which i upgraded a few weeks back from a 400w PSU.

 

I will run a mem test, but should I wait till unraid is finished its "Data-Rebuild" or just cancel it now?

 

And yes the SMART log was the wrong one, i apologize it was late and I was tired.  I have attached the proper one this time (I hope.. hah).

 

What is the best expansion card on a budget?

smart.txt

Link to comment

The drive Z1F1QJHT is experiencing power problems. The Power-Off_Retract_Count is about 4 times the Power_Cycle_Count. This could be a bad or loose power connection.

 

That SMART report was taken after changing the power cable, but after this memtest is done i will change it to another power cord.

 

 

EDIT:

I have swapped out the power cable for a new solo one straight from the PSU thats not powering anything else.  I ran the memtest for 4 hours, with 5 passes and no errors.  I will run an overnight memtest tonight.

 

I have attactched a zip contaning another syslog, and SMART reports for my cache, parity, and "red-balled" disk.

Link to comment

 

EDIT:

I have swapped out the power cable for a new solo one straight from the PSU thats not powering anything else.  I ran the memtest for 4 hours, with 5 passes and no errors.  I will run an overnight memtest tonight.

 

I have attactched a zip contaning another syslog, and SMART reports for my cache, parity, and "red-balled" disk.

 

Apparently you can't attach a file when you click modify, I'm sorry for the double post please correct me if I'm wrong... I couldn't see an attach button.

 

EDIT:

I just happened to be checking the syslog again (after uploading this one) and notice this was logged:

2502: Feb 12 20:44:22 Tower kernel: md: using 1536k window, over a total of 2930266532 blocks.

2501: Feb 12 20:44:22 Tower kernel: md: recovery thread rebuilding disk6 ...

2500: Feb 12 20:44:22 Tower kernel: md: recovery thread woken up ...

2499: Feb 12 20:44:22 Tower kernel: mdcmd (43): check NOCORRECT

2498: Feb 12 20:38:29 Tower kernel: [Hardware Error]: Machine check events logged

 

That hardware error is back.

logs.zip

Link to comment

A little update, I took the cache drive out and was able to recover the data on it.  And I replaced it with a new WD Black 640GB drive , and put my data back on it.  That part is running smoothly, however my "Downloads" cache-only share no longer works.  I can access /mnt/cache/, but this causes problems for some of my programs (SB,SAB etc).  Any ideas why the Download cache-only share doesn't work?

 

Not sure how you were able to recover data from the old Cache drive.  It would not surprise me that even if you got it to mount, the Reiser file system would have been corrupted.  That could cause an incomplete recovery, resulting in issues with the contents of the new Cache disk.

 

I unplugged it, and installed it in a spare computer.  Then searching found a tool with a GUI that would allow me to read the drive and copy the data off.  I thought it would be a long shot but it turned out to work.  EDIT: I should note the only data i recovered was the config files for SAB,CP,SB.  I tested them on a VM install of the programs, and they seemed to work fine.  Haven't copied it to the server yet.

 

About the parity drive I just checked the SATA cable and it is indeed connected to one of the onboard SATA ports.

 

I have two Seagate 3TB drives:

Parity: ST3000DM001-1CH166_W1F16W9X (sdi) 2930266532

Disk 6: ST3000DM001-1CH166_Z1F1QJHT (sda) 2930266532

 

Disk 6 is plugged into a PCI card, and the parity to the onboard.  I know its not ideal, but I'm saving my money for an upgrade.

 

I can try connecting both 3TB to onboard ports.  About the power supply it is a relatively new Corsair 850w PSU, which i upgraded a few weeks back from a 400w PSU.

 

I will run a mem test, but should I wait till unraid is finished its "Data-Rebuild" or just cancel it now?

 

And yes the SMART log was the wrong one, i apologize it was late and I was tired.  I have attached the proper one this time (I hope.. hah).

 

What is the best expansion card on a budget?

 

No need to apologize, in fact, I'm the one that needs to apologize, mixing up your 2 3TB drives!  I'm sorry.  I somehow switched the Parity drive with Disk 6.  So there's nothing at issue with the Parity drive, and the exceptions and slowdowns were actually on Disk 6.  And in view of dgaschk's comment (very good catch by the way!), Disk 6 has been having other issues too.  Its SMART report looks fine though, apart from the power issue, so Disk 6 is probably a good drive, once it's connected to a good port with good power.  I see you have done that.

 

I noticed that the first syslog showed your server had 2TB of RAM, and your second syslog showed it had 4TB. You did not mention anything about increasing the memory, so I have to verify, did you add memory?  If not, this is obviously a red flag.

 

I would not interrupt the rebuild.  You are rebuilding Disk 6?

 

I'm curious what the tool with gui was, that you used to copy off the data?

 

As to the "best expansion card on a budget", that's something I would like to learn too!

Link to comment

Well if anything you made me realize I should have my fastest drives on the on board SATA. So no harm done.

 

About the RAM yes my mistake. I recently received 4gb of higher quality RAM so i swapped out the 2gb RAM for the better sticks.

 

Yes unraid is rebuilding disk 6. Once it's done I hope everything will be sorted out.

 

Still unsure of the Hardware error... Ive Google'd around. And the average answer is its usually a problem with the RAM. So after the rebuild I will run an overnight memtest.

 

I will edit in a link to the tool as im on my phone right now.

 

EDIT:

This was the post for reading the drive on windows.  Which lead me to this site: http://yareg.akucom.de/

Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.