Jump to content

Disk failed during parity rebuild -- HELP! SAS2LP and V6 :(


dogbowl

Recommended Posts

I replaced my parity drive tonight (upgrading from 3TB to a 4TB) and at about the 4% rebuilt mark, I had a drive red ball due to SMART errors.

Whats odd is that the array is accessible and the failed disk shows "contents emulated" -- however without a proper parity disk, I wouldn't expect this to be possible.

 

Capture3.png

 

 

My sys log (http://vinelodge.com/wp/wp-content/uploads/2015/10/tower-syslog-20151030-0008.zip) shows multiple "disk3 read error" and the SMART report on that drive reads:

 

=== START OF INFORMATION SECTION ===

Vendor:              /4:0:2:0

Product:             

User Capacity:        600,332,565,813,390,450 bytes [600 PB]

Logical block size:  774843950 bytes

Physical block size:  3099375800 bytes

Lowest aligned LBA:  14896

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

>> Terminate command early due to bad response to IEC mode page

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

Whats the best approach here?  I do still have my old parity drive available. 

This is on 6.1.2 system on a SUPERMICRO MBD-X9SCM-F motherboard with AOC-SAS2LP-MV8 controller cards.

 

 

 

Link to comment

I'm going to wait for the experts to advise you on how to proceed, but based upon the errors in the log prior to all the read errors, it would appear that you jarred the connection (power or sata) to disk 3 when changing out the parity drive and it dropped off once it started getting a workload.

 

This is why I'm such a huge fan of hot swap cases (even cheap ones).  Drives are going to be upgraded / die at some point, and without hot swap, the odds are very good with lots of drives that you will disturb a connection to another drive inadvertently and cause problems.

Link to comment

Thats a possibility but I do have these drives in a hot swap cases.

This one in particular: http://www.supermicro.com/products/accessories/mobilerack/CSE-M35T-1.cfm

 

I went forward with replacing the original party drive and that resulted with a "too many wrong and/or missing disks" and I don't have the option to start the array.

Nothing has changed across the drives (that I'm aware of) so they old parity drive should still be 'valid'.

 

Not sure what the best, next step would be here...  (Other than purchasing a good 2tb drive)

Link to comment

I think a this point, my best option would be to force the (old) parity drive following instructions here:

http://lime-technology.com/wiki/index.php/Make_unRAID_Trust_the_Parity_Drive,_Avoid_Rebuilding_Parity_Unnecessarily

 

However, I'm not sure if those are instructions for v5 or v6.

 

The hope would be to bring the array back online but still in an unprotected state as that 2tb drive is bad.  With that original parity drive active again, I could then rebuild the 2tb to get everything back.

 

 

EDIT:  Heres a link to the SMART report from my drive 3.  Could this drive be okay?

http://vinelodge.com/wp/wp-content/uploads/2015/10/WDC_WD20EFRX-68AX9N0_WD-WMC1T0987467-20151030-1548.txt

Link to comment

Current Status:

 

I have the new parity disk (The 4TB i was originally upgrading to) installed in the server.  Have removed my bad Disk 3 from the array, started it. Stopped it, and added it back.

 

I now have an invalid parity disk as well as a "Device Contents Emulated" for disk 3.  So I believe my only options are either a Tools -> New Config or a Rebuild to start a data rebuild.  I'm sure that each of those options are actually opposite actions but I'm not familiar enough with the details to know which would be the best choice...

Link to comment

The red ball was caused by a failed read from your Disk3. The smart report for your hard disk looks fine. This implies a problem with communication. I would change the SATA cable to the hard disk that failed. Move it to the motherboard if you have to. Two options from here.

 

1. Put your old parity drive back in the array. Do a new config, assign all drives and trust my parity. Start a non-correcting parity check. I bet it shows zero errors.

 

2. Keep the new parity drive back in the array. Do a new config, assign all drives and then do a parity sync. I think it will work fine.

 

Number one is the more conservative option. If it works you can move on to number two.

Link to comment

Have had some interesting results.  The SATA cables are all breakout cables from my SAS2LP card and I don't have a spare of one of those handy.  So I've just primarily relied on the New Config and hoped for the best with the parity rebuilds.

 

With the new (4TB) parity drive in, the parity check results in numerous errors on that same disk (disk 3) to the point that it all fails and I'm left with both a bad parity disk as well as a 'emulated' disk 3.

Syslog for that here http://vinelodge.com/wp/wp-content/uploads/2015/10/tower-diagnostics-20151031-0049.zip

 

So, then switching back to the original 3TB parity drive and doing a new config (I was unable to select a non-correcting parity check as my UI goes unresponsive after reboot for about 120 seconds).

That parity check completed successfully -- however *another* drive threw errors and is now redballed:

Capture6.png

 

Heres the diagnostics from that event http://vinelodge.com/wp/wp-content/uploads/2015/10/tower-diagnostics-20151031-0935.zip

 

The SMART report on that disk 2 looks good (I think?) so I'm not sure what to think now.

Throughout all of this, I have not intentionally written to any disks however even though they are set to not start, my Docker apps start after every reboot and I suspect they are writing new files.  I don't know of this would cause these errors though.

Link to comment

Hardware problems are difficult to pin down. What you are seeing could be caused by cables (my suspicion), bad hard drive(s), bad memory, power supply, controller, hot swap bays and maybe even motherboard. The way I approach these kinds of problems is to swap with known good parts.

 

I looked at all of your Smart reports. They all look good except Disk2, which did not complete. I suspect this is due to a cable problem. Do you have any free motherboard ports?

 

If you wrote to the array before replacing your original 3TB parity drive then you parity is not in sync with your data drives and you have lost parity protection.

Link to comment

For future reference when you ask for help for something as complex as this you should wait until you've been in contact with someone who has lot's of experience in this or lime-tech directly. By being inpatient about it you might find it difficult finding someone willing to step in and provide you with guidance.

Link to comment

I really appreciate any advice on this -- Things are looking better and I can only assume this has been due to a hardware issue somewhere.  I plan to swap out the SATA breakout cable as soon as I can get a replacement.  All other hardware has been up and running for some time although I did recently change the PCI-E slot that this particular SAS2LP is plugged into.  (Both drives that I've had these issues with are on the same SAS2LP controller card and both are on the same breakout cable.)  For what its worth, the drives are in different hot swap caddies.

 

So, I ran a reiserfsck --fix-fixable /dev/md2  and that resulted with some recommendations:

Capture7_s.png

 

so I then executed the reiserfsck --rebuild-tree and that completed:

Capture8_s.png

 

I then stopped the array, removed drive2, started, stopped, added drive2 back, started again and it began to rebuild.  That completed successfully and my array seems to be back to normal.

 

I'm currently running a parity check

Link to comment
I then stopped the array, removed drive2, started, stopped, added drive2 back, started again and it began to rebuild.  That completed successfully and my array seems to be back to normal.

 

I'm currently running a parity check

 

I've got my fingers crossed for you. It is good to have a spare SSF-8087 to SATA cable for troubleshooting. The way to minimize risk when changing a parity disk is to prevent writing to the array until the parity sync and a parity check are complete. We all get impatient/make mistakes though. It is also a good practice to do a parity check right before making a change (adding a disk, replacing a disk, etc.).

Link to comment

I have had my system running and stable for so long, I had gotten relaxed with running parity checks.  So, yes, when I went to upgrade my parity drive, I didn't run a parity check ( and it had been about 30 days since a parity check had been run).

 

When I first restarted after installing the 4TB drive, my dockers all came back to life and other, automate backups also ran before the parity drive upgrade was complete.  If I don't have a hardware issue, something with all of that activity could have caused this.

 

Regardless, I'm going to take this opportunity to replace a number of drives and then possibly switch over to the new filesystem.  I'll also be setting up a scheduled parity check  :)

 

 

Link to comment

Following up on this.  I have recovered everything and my array is back, parity verified and had no data loss (that I'm aware of).

 

It does seem as if this was due to a bad SATA cable as I was only able to fully rebuild and verify parity after replacing that particular break out cable from the SAS2LP.  Not something that I would have initially expected as all my drives are in hot swap caddies but I have no other explanation...

 

Thanks for the help above.

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...