Jump to content

Failure to rebuild parity


elco1965

Recommended Posts

About a week ago my parity drive failed.  I purchased 2 new 4TB Seagate ST4000DM000, one to replace the failed parity drive and one as a spare.  I followed the same procedure I've followed before when replacing a failed drive.  Stop the array, shutdown, replace drive, reboot, stop the array, assign new drive to the parity slot, start array and begin to sync the parity drive, go to bed.  I wake up the next morning with no email from the server reporting failures of any kind.  Feeling good about the server I log into the WebGUI, see a red ball next to the new parity drive and instantly receive an email from the server informing me of the failed parity drive.  I think OK bad drive.

 

I put the other new drive in, follow the same procedure, go about my day with no email from the server. I get home that night and again feeling good about it I log onto the webgui.  To my shock I find a red ball next to new drive number 2.  I reboot the machine, just because, and now the new drive isn't even listed in the array devices tab.  "No device" next to the second new parity. 

 

The last drive I tried I borrowed from a friend.  Ran all the Seatools test.  Long and short generic, long and short smart.  The drive tested with out issue.  I follow the same procedure and install the borrowed extensively tested drive last night.  No email of failure last night or today.  I think the drive has been rebuilding parity about 20 hours surely it is complete and the array is operating properly.  Log into the gui, red ball next to the parity drive, and email sent indicated the parity drive failure. 

 

I have rebooted a couple times.  Tried to install the drives again.  Always the same thing.  It appears to be working until I log into the wegui.  I then get the failed drive email, reboot and the drive is missing from the array devices tab, swap drive repeat.

 

I did some searching hear on the forums and don't see that anyone has had a problem with multiple parity drive failures.  I am currently at a loss.

 

The array typically  runs virtualized under ESXi.  But I have it running natively after the last couple failures.  It all resides in a Supermicro CSE-846BE16-R920B expander back plane case, that was purchased in March of this year, running a SuperMicro MBD-X8SI6-F-0 ATX mobo, and Intel Xeon X3440 SLBLF CPU, and 16GB MICRON PC3-8500R DDR3-1066 REG ECC RAM.  I have the back plane connected to the SFF 8087 connectors on the mobo.  The SAS controller on the mobo has been flashe to IT mode.  syslog attached.  This is Version 6.0 Beta 12

 

Thank you in advance for your help.

syslog.zip

Link to comment

You didn't mention preclearing. While it's not necessary for parity or any rebuild to be clear, it can be a good test. I see you mentioned testing with another method. Did you do that with all the drives?

 

Most likely you have a problem with something other than the drives, but common to them all; i.e., SATA or power cable, SATA port, SATA controller. Do you have another port you can try?

 

You have posted this in the wrong place regardless of what version you are on. You don't specifically mention it so I assume you are on v6 since that is the area you have posted this in. See the v6 help link in my sig, or the v5 help link if that is your version.

 

I will report this to the mods so this thread can be move to a more appropriate subforum.

 

Mods - The poster has provided no evidence this is a defect. Please move this to General Support.

Link to comment

Sorry for the initial oversight.  I posted, reread, and saw the version missing.  I modified it just prior to you responding to include this information.

 

All drives are plugged into the expander backplane and connected to the mobo via 2 SFF 8087 cables.  I did try different different SATA ports on the backplane with the same resault.  It was my thinking that if it were the hardware that connected all the drives to the mobo I would have more random drive failures.  But I don't know for sure.

 

I did try to plug the parity drive into 2 different SATA ports on the mobo also with the same result. 

Link to comment

From Main page, click on the parity drive. That will take you to the page for that drive. In the Health section, click on Disk attributes. In the latest beta, you can click on the Attributes button at the bottom of the page to download them, but that button may not be available in the version you have so you may have to copy and paste. Post the results.

Link to comment

Known ATA S.M.A.R.T. attributes

Attribute 199 is usually a problem with data getting to the drive; i.e., cable or controller. 184 might be drive or might be interface.

 

Have you tried a different cable?

 

Are your SATA cables bundled together? While it might make things neater and perhaps even improve airflow if you bundle your SATA cables, it has also been known to cause crosstalk between cables.

Link to comment

Understood.

 

But I did take this drive out of the backplane and connected it directly to the mobo at 2 different SATA ports and with 2 different cables and I got the same failure.  I did my best to try and eliminate all other shared hardware and isolate the drive twice attempting to rebuild parity. 

 

Also the backplane is connected to the motherboard with SFF 8087 to SFF 8087 cables.

 

I've also had this same failure not being able to rebuild parity with 4 different drives now. 

Link to comment

The case has redundant 920 watt power supplies

 

http://www.supermicro.com/products/chassis/4U/846/SC846BE16-R920.cfm

 

The drives are set to spin down but I thought unRAID would keep the drives it needed spinning. Such as in rebuilding parity.

That certainly seems like it should be sufficient. Yes, unRAID does spin up drives as needed. I just meant the rebuild process would spin all drives so would use more power.
Link to comment

I will set all drives to not spin down and try again this afternoon.  But I'm still at a loss as to what is causing this with just the parity rebuild. I've tried different ports in the back plane, SATA ports on the mobo, different cables, reboots, and I even tried an older version thinking it might be the beta.

Link to comment

I will set all drives to not spin down and try again this afternoon.  But I'm still at a loss as to what is causing this with just the parity rebuild. I've tried different ports in the back plane, SATA ports on the mobo, different cables, reboots, and I even tried an older version thinking it might be the beta.

Should not be necessary to set drives to not spin down and I would be surprised if this made any difference.

 

It seems like the only thing in common to all of the scenarios you have tried is the power supply. Maybe it is beginning to fail in some way. I don't have any experience with redundant power supplies. Maybe there is something about the way redundancy is implemented that might be involved.

Link to comment

I'm saying this is impossible, but the power supplies were new in March. And I made the swap in April.  It's only been running for a month in it's current configuration for a month. 

 

The power supplies don't report any issues via IPMI. And I get an audible alarm should one of them be unplugged or fails.  The redundancy is handled by the power distributor.   

Link to comment

Your original syslog appears to have some SAS errors, followed by disk0 write failures which triggered the redball. But you say the same thing happens with the drive plugged in to the mobo.

 

I really don't know what else to suggest at this time. You are running a more hardware than I have experience with. I've never even used SAS. Maybe someone else will get involved.

Link to comment

In an attempt to eliminate one more possibility, and being at work and needing to remote in, I have set all drives to not spin down and started parity sync one more time.  I hope this helps as I will be leaving town tomorrow for about a week and won't be able to get to it physically. I will report back soon.

 

Thanks for the suggestions thus far

Link to comment

I had to reboot and reassign to obtain the access to the drive.

 

This drive has failed. Note the word "Now" in the Failed column.

Can you get a smart report for the other 2 drives you tried? Seems unlikely all 3 would be bad but maybe so.
Link to comment
  • 2 weeks later...

I had to reboot and reassign to obtain the access to the drive.

 

This drive has failed. Note the word "Now" in the Failed column.

 

I have 10 Seagate disks model number ST400DM000 all but 2 of them show this 184 end to end fail now error.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...