Multiple drive failures

September 8, 201411 yr

Lately I have been having drive failures. Usually it is a simple read failure. the simple thing was to stop the array, reboot, remove the disk, start in maintenance mode, stop array, re enable the disk, allow it rebuild and then carry on. Today another drive had a read failure during the rebuild process. What do i do to not loose any data that is on the disk that was in the process of being rebuilt? I know the data is good on the other disabled drive and parity is still good.

Quote

September 8, 201411 yr

Have you just been rebuilding without ever really getting to the root cause of your problems? How do you know anything is good? Better post a syslog.

Quote

September 9, 201411 yr

Author

https://www.dropbox.com/s/2m0kxmvib2ws4tp/syslog-2014-09-08.txt?dl=0

Quote

September 9, 201411 yr

Author

Sorry - just saw that i posted in the wrong section. Running 5.0.5

Quote

September 9, 201411 yr

Your syslog wasn't any help, as it involves a boot but the array is never started, so no failed disks and no disk errors. We need to see a syslog where a drive fails. He's right that you need to find out the root of the problem, because often it has nothing to do with the drive itself. It could be as simple as several faulty SATA cables. We'll know better when we see the errors in the syslog.

BUT FIRST! The syslog is over 16.5MB of UnRAID refresh, repeating the drive inventory and array setup every single second that the array is offline! This is something that is caused by your old version of Dynamix, v2.2.3. Please upgrade it as soon as possible to the latest. Then reboot again and start the array. Once you have a drive fail, then capture the syslog, zip it, and attach it here.

Quote

September 9, 201411 yr

Author

Thanks for the suggestions. I have done a few things. Since the issues started ive replaced the motherboard, cpu and RAM. Last night I also installed a second PSU to assist in power load balance.

I know the data is good as i have done several parity checks and reiserfschk and all have come back clean. Starting to think it may be a cable issue. I am using 2 supermicro SAS controller cards with sata breakout cables. Link to controller below.

I think what i need to know right now is how can i tell the system that Disk 6 is ok so i can finish the data rebuild on disk 1. I will attempt to get the breakout cables replaced over my lunch hour and try to get Dynamix updated as well.

http://www.newegg.com/Product/Product.aspx?Item=N82E16816101792

Quote

September 9, 201411 yr

Author

Dynamix updated to 2.2.8

Quote

September 9, 201411 yr

You can use badblocks in readonly mode to read through the whole disk and verify that all sectors are good.

Do a SMART log first.

badblocks.

SMART log again.

Read your syslog to see if errors were reported but corrected by the kernel.

You may want to do each suspect disk on a different machine with as few drives as possible to eliminate the hardware.

The last case is to use ddrescue to copy the disk to another 'spare' disk.

This lets you work on the spare disk without overwriting any data on the source disk.

You can search the forum for what I was able to document. (it's been a very long time).

I had to do this when I had a double drive failure.

I had to copy the drive with ddrescue in forward mode, logging the suspect sectors.

Then copy the drive with ddrescue in reverse mode multiple times with the log of suspect sectors as input.

I was able to recover every sector except 1.

However if you do badblocks, smart and syslog review with no errors, it's not a physical drive format issue.

Quote

September 9, 201411 yr

Author

Can you elaborate more on the steps for smartlog and badblocks?

Also replaced all the sata cables.

Quote

September 9, 201411 yr

You can issue

With all commands below ? is the drive letter of the drive in question

/dev/sd?

to grab the SMART log from the drive.

smartctl -a /dev/sd?

To run a short smart test

smartctl -t short /dev/sd?

To run a long smart test (entire drive, can take up to 5 hours).

smartctl -t long /dev/sd?

I would not recommend running the short or long test on a drive that is active on the array.

Chances are emhttp will spin down the drive and disrupt the test.

So do this while the array is stopped.

To capture the smart log here are some example commands.

In this case I am replacing /dev/sd? with /dev/sdc

root@unRAID:/tmp# mkdir -p /boot/logs

root@unRAID:/tmp# smartctl -a /dev/sdc | todos > /boot/logs/sdc.txt

root@unRAID:/tmp# ls -l /boot/logs/sdc.txt

-rwxrwxrwx 1 root root 5607 2014-09-09 15:49 /boot/logs/sdc.txt*

you can possibly access them with a windows host with the the following example.

\\(yourhost)\flash\logs

For badblocks, this is a readonly test of every sector.

Again, I'm replacing ? with c

root@unRAID:/tmp# badblocks -o /tmp/sdc.badblocks -sv /dev/sdc

Checking blocks 0 to 2930266583

Checking for bad blocks (read-only test): 0.04% done, 0:07 elapsed

At the end (and it will take a long time) you can examine /tmp/sdc.badblocks for entries.

Also look at the syslog for retries on the suspect drives.

Man page is here.

http://linux.die.net/man/8/badblocks

You only want to do read-write tests on an empty drive before you preclear it.

So make sure you are doing the read-only test.

You can do this on a live array, it will interfere and slow down access to the drive being tested, but it shouldn't cause any other issues.

Quote

September 9, 201411 yr

Author

I think i may have a drive issue... smart doesn't look good for this drive.

smart.txt

sdc.txt

Quote

September 9, 201411 yr

Author

root@10:~# badblocks -o /tmp/sdk.badblocks -sv /dev/sdk

Checking blocks 0 to 2930266583

Checking for bad blocks (read-only test): 2.31% done, 7:19 elapsed

will post up when thats finished.

Quote

September 9, 201411 yr

The short test is almost useless in this type of a situation.

it will weed out a pretty gross LBA error, but's not enough.

You need to do the long test.

make sure the array is not online. you dont want the drive spinning down.

The long test will take 3-5 hours depending on drive size.

Here's where the trouble lies.

In any case, if you do not have a backup you can try to use dd rescue to copy the drive forward and back to see what it can save.

Let's see what SMART reports after the badblocks read test. Keep in mind the firmware and the kernel will do retries.

So bad blocks alone cannot detect a bad drive, you have to review the syslog and look at the smart reports.

Device Model: WDC WD30EZRX-00DC0B0

Serial Number: WD-WCC1T1339998

196 Reallocated_Event_Count 0x0032 197 197 000 Old_age Always - 3

197 Current_Pending_Sector 0x0032 199 199 000 Old_age Always - 674

198 Offline_Uncorrectable 0x0030 199 199 000 Old_age Offline - 604

Quote

September 10, 201411 yr

Author

root@10:~# badblocks -o /tmp/sdk.badblocks -sv /dev/sdk

Checking blocks 0 to 2930266583

Checking for bad blocks (read-only test): 14.80% done, 47:38 elapsed

done

Pass completed, 43 bad blocks found.

sent the smart data off to WD - they are going to RMA the drive.

Quote

September 10, 201411 yr

Author

Smart report after badblocks

smart2.txt

Quote

September 10, 201411 yr

You'll see that current Pending Sector count went up.

So more sectors are bad.

196 Reallocated_Event_Count 0x0032 197 197 000 Old_age Always - 3

197 Current_Pending_Sector 0x0032 199 199 000 Old_age Always - 715

198 Offline_Uncorrectable 0x0030 199 199 000 Old_age Offline - 604

I see that you ran the short test again, you need to run the long test.

In any case, this drive is not a good candidate for unRAID. (at this point).

Get your data off of it (unless you have a backup).

Once you have your data off of it, you can RMA it or try to resurrect it with badblocks or preclear.

You can run badblocks in multiple pass write-read mode which will write and verify every single sector.

This will erase your drive, so you must save your data first.

After that you can look at the smart report and see how the Pending and reallocated sectors are.

Or you could just RMA the drive and let the manufacturer deal with it.

Quote

September 10, 201411 yr

Author

Doing and advanced RMA and will be purchasing an additional drive. will back-up the data and replace the drive BUT i still need to correct the overall issue of restoring data to the other drive that was in the middle of a rebuild when this error occurred.

Quote

September 10, 201411 yr

Is the 'other' drive still functional? or is it totally gone now?

Quote

September 10, 201411 yr

Author

The data is gone as it was in the middle of a rebuild when drive 6 threw an error. If i can get drive6 to be happy again, i know that drive1 could finish its rebuild.

Quote

September 10, 201411 yr

I'm not as confident that his parity and drive 6 are OK as he is.

And I don't know if this would work or not so someone else chime in.

Could he set a new config and check the valid parity box, then start the array.

Stop the array and unassign drive 1. Start the array so it sees drive 1 missing.

Stop the array and assign drive 1. Would it then be able to rebuild?

Quote

September 10, 201411 yr

Author

That is exactly what I was thinking but I felt it all a bit above my head so i had to come here first.

Quote

September 10, 201411 yr

Did the drive totally die and is in operational? Does it spin, can you grab a smart log?

If the drive spins and a smart log is accessible, maybe you can just dd rescue the data and try to do reiserfs repairs on it.

Help me understand.

Drive 6 is the one with the pending sectors (read failures)

Drive 1 was having other trouble? Is this still alive or is it totally dead.

If this was being rebuilt were any writes performed to it while it was in simulation mode?

Quote

September 10, 201411 yr

Author

from the beginning

drive 1 throw an error - believed to be a false error as it has happened before (different drives, moves around, stop array, reboot, remove drive, start array in main mode, stop array, re-add drive, start array, rebuild, life is happy) i believe i was having either a power issue or cable issue, both of which have been replaced very recently.

Drive6 throws error during drive1 rebuild - data on drive 1 is probably crap as it was in the middle of a total rewrite.

go to forums and post thread.

ect.

now.

RMA 3tb drive on the way and additional new 3tb ordered from NewEgg. I would like to see drive1 data be rebuilt but if i lost data, i lost data. i would rather loose 2 TB and not 5 TB.

Quote

September 10, 201411 yr

from the beginning

drive 1 throw an error - believed to be a false error as it has happened before (different drives, moves around, stop array, reboot, remove drive, start array in main mode, stop array, re-add drive, start array, rebuild, life is happy) i believe i was having either a power issue or cable issue, both of which have been replaced very recently.

Drive6 throws error during drive1 rebuild - data on drive 1 is probably crap as it was in the middle of a total rewrite.

go to forums and post thread.

ect.

ect.

ect.

now.

RMA 3tb drive on the way and additional new 3tb ordered from NewEgg. I would like to see drive1 data be rebuilt but if i lost data, i lost data. i would rather loose 2 TB and not 5 TB.

I think it is quite likely that the data on drive 1 is actually intact. If it was doing a rebuild back onto the same drive then it should have simply been over-writing each sector with the same data as it had before. The only question is whether the other drive going bad has caused the rebuild process to write bad data to any sectors. With any luck the only data that would be lost would be any that was written to the 'emulated' disk1 while disk1 was in a red-balled state.

It might be worth running a reiserfsck check against the physical drive to see what is reported.

Quote

September 10, 201411 yr

from the beginning

drive 1 throw an error - believed to be a false error as it has happened before (different drives, moves around, stop array, reboot, remove drive, start array in main mode, stop array, re-add drive, start array, rebuild, life is happy) i believe i was having either a power issue or cable issue, both of which have been replaced very recently.

Drive6 throws error during drive1 rebuild - data on drive 1 is probably crap as it was in the middle of a total rewrite.

go to forums and post thread.

ect.

ect.

ect.

now.

RMA 3tb drive on the way and additional new 3tb ordered from NewEgg. I would like to see drive1 data be rebuilt but if i lost data, i lost data. i would rather loose 2 TB and not 5 TB.

I think it is quite likely that the data on drive 1 is actually intact. If it was doing a rebuild back onto the same drive then it should have simply been over-writing each sector with the same data as it had before. The only question is whether the other drive going bad has caused the rebuild process to write bad data to any sectors. With any luck the only data that would be lost would be any that was written to the 'emulated' disk1 while disk1 was in a red-balled state.

It might be worth running a reiserfsck check against the physical drive to see what is reported.

This is why I asked the redundant question about Drive 1.

If it still operates, I would certainly do the SMART and badblocks in read mode testing.

If the drive shows it's healthy, then I would ddrescue it to another drive, then work on the other drive to recover the data.

If I had no other choice I might try the reiserfsck functions on it directly.

As long as the emulated drive 1 was not written to, I would assume the data still to be there.

otherwise any updates that were written to the emulated drive would be lost. If that's not as important so be it.

Quote

Multiple drive failures

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)