Failed drive while parity building an different failed drive

axeman · February 18, 2019

Okay Gents... a doozy for you:

I had disk 10 show errors for a while, and it eventually got disabled. I got a replacement, and pre-cleared it. As it was clearing, I got lots of umda crc errors. I feel like this has happened whenever I preclear a disk, i get random umda crc errors.

Of course it was probably a power or controller issue (as of the cable reseating, etc I have not had a single umda-crc error). As the pre-clear was about to finish- disk 4 in the array got redballed. No biggie, the pre-clear finished, I resated all cables, shutdown array, unassigned disk 4, and reassigned it.

The plan was to assign the newly cleared disk to Disk 10.

As Disk 4 was getting rebuilt (18%), I get a notice that Disk 10 has read errors and is disabled.

Thankfully I have a dual parity setup. I think I'll be OK, because of this (unless another disk gets disabled).

Question is - what do I do? Shut down, stop disk 4's rebuild and have unraid rebuild disk 4 AND disk 10 (with the newly cleared drive)? Or wait for disk 4's rebuild to finish?

Do I allow that parity check to finish after the rebuild, or immediately stop and unassign, and reassign disk 10 with the newly cleared rive?

I'm at a precarious position, obviously - want the safest course of action to mitigate the situation...

FWIW, i'm on 6.6.6

trurl · February 18, 2019

Diagnostics?

axeman · February 18, 2019

29 minutes ago, trurl said:

Diagnostics?

will try again ...it never completed.

axeman · February 18, 2019

1 hour ago, trurl said:

Diagnostics?

it's been a while.. is there anything in particular I should post? the diagnostics download doesn't seem to complete.

smartctl on the two failed drives? syslog?

Thanks again for your help

axeman · February 18, 2019

On 2/17/2019 at 9:09 PM, trurl said:

Diagnostics?

Third time is a charm!

Edited February 19, 2019 by axeman
Removed old diag file... new post with updated diags

JorgeB · February 18, 2019

Since you're almost half way done I would let the rebuild finish, disk10 dropped offline, possibly from a bad cable, disk4 has a very high number of CRC errors, also likely got disabled because of a bad cable/connection.

axeman · February 18, 2019

3 hours ago, johnnie.black said:

Since you're almost half way done I would let the rebuild finish, disk10 dropped offline, possibly from a bad cable, disk4 has a very high number of CRC errors, also likely got disabled because of a bad cable/connection.

Thanks.. should I expect some sort of corruption, since we're emulating two disks, or did dual parity prevent this from going south.

I agree that it's probably related to controller/cabling/power. Disks seem OK otherwise.

Should I run a parity check after the rebuild, or start rebuild of disk10 and then run a parity check?

JorgeB · February 18, 2019

20 minutes ago, axeman said:

should I expect some sort of corruption, since we're emulating two disks, or did dual parity prevent this from going south.

There shouldn't be any, though your syslog is hard to analyze because it's spammed with USB errors, but as long as there are nothing like this on the syslog during the rebuild you're fine:

md: recovery thread: multiple disk errors

24 minutes ago, axeman said:

Should I run a parity check after the rebuild, or start rebuild of disk10 and then run a parity check?

I would rebuild disk10 first.

axeman · February 19, 2019

12 hours ago, johnnie.black said:
There shouldn't be any, though your syslog is hard to analyze because it's spammed with USB errors, but as long as there are nothing like this on the syslog during the rebuild you're fine:
md: recovery thread: multiple disk errors
I would rebuild disk10 first.

Disk 4 had some VM HD's on it - I botted up the VM and scanned, everything seemd OK. I'm currently rebuilding Disk10 on the newly cleared drive. I'll probably put the original Disk10 into the Cachepool

Thanks for talking me through it.

axeman · February 19, 2019

And another disk shows read errors ... but is not disabled, so that's good. Attached is new diagnostics.

So clearly, I have either a cabling, power or controller issue. It'll be fun trying to narrow down where the failure is. Incidentally, I haven't had any errors until going to 6.6.6 - maybe it's more diligent in finding these?

rigel-diagnostics-20190219-1120.zip

JorgeB · February 19, 2019

Disk5 errors look like an actual disk problem.

You're also having issues with the cache pool:

Feb 18 20:42:26 Rigel kernel: BTRFS info (device dm-15): bdev /dev/mapper/sds1 errs: wr 1472028, rd 1516462, flush 8, corrupt 0, gen 0
Feb 18 20:42:26 Rigel kernel: BTRFS info (device dm-15): bdev /dev/mapper/sdv1 errs: wr 4998, rd 2, flush 0, corrupt 0, gen 0

These are hardware errors, you should run a scrub and check all errors are corrected, see here for more info:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

axeman · February 19, 2019

23 minutes ago, johnnie.black said:
Disk5 errors look like an actual disk problem.

You're also having issues with the cache pool:
Feb 18 20:42:26 Rigel kernel: BTRFS info (device dm-15): bdev /dev/mapper/sds1 errs: wr 1472028, rd 1516462, flush 8, corrupt 0, gen 0
Feb 18 20:42:26 Rigel kernel: BTRFS info (device dm-15): bdev /dev/mapper/sdv1 errs: wr 4998, rd 2, flush 0, corrupt 0, gen 0
These are hardware errors, you should run a scrub and check all errors are corrected, see here for more info:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

Thanks for taking the time to go through it.

Strange - Cache Pool had a device that got the click of death about 6 months ago. I removed it. I will run a scrub on that. I know one of the destination disks have been full (so my mover isn't moving). Could the errors here be because of that?

I'll read through your link and see what it turns up.

Unraid has not kicked Disk 5 out of the array yet - should I run a parity check? or test the recently removed Disk10 and use that as a replacement for disk5?

JorgeB · February 19, 2019

23 minutes ago, axeman said:

I know one of the destination disks have been full (so my mover isn't moving). Could the errors here be because of that?

No, those are hardware read/write errors, though they could be old, link has the command to reset them.

24 minutes ago, axeman said:

Unraid has not kicked Disk 5 out of the array yet - should I run a parity check?

You can run a non correcting check or run an extended SMART test on disk5.

axeman · February 19, 2019

51 minutes ago, johnnie.black said:

No, those are hardware read/write errors, though they could be old, link has the command to reset them.

You can run a non correcting check or run an extended SMART test on disk5.

Thank you! I ran the scrub, and results look awful.


scrub status for 8844a357-4626-49a0-a8f2-797ad43999be
	scrub started at Tue Feb 19 13:39:12 2019 and was aborted after 00:25:00
	total bytes scrubbed: 67.70GiB with 8854316 errors
	error details: read=8854312 super=3 verify=1
	corrected errors: 78, uncorrectable errors: 8854235, unverified errors: 0

I didn't abort it, i'm guessing self aborted?

I'm going to run the extended tst on D5. I appreciate your input!

axeman · February 19, 2019

AND we have smart errors on Disk 5 as you predicted.

rigel-smart-20190219-1500 DISK 5.zip

JorgeB · February 19, 2019

Disk5 needs to be replaced and cache pool recreated, backup everything you can (or need) from the pool, re-format and restore data.

axeman · February 20, 2019

50 minutes ago, johnnie.black said:

Disk5 needs to be replaced and cache pool recreated, backup everything you can (or need) from the pool, re-format and restore data.

Well that's certainly not what I expected when I created this thread!

Okay. So I'm going to run a extended smart test on Disk 10.. .that's the only spare I have ATM. It's got a bunch of reallocated sectors though - so might not be the best candidate.

I guess the question would be what's in a worse state, Disk 10 or Disk 5 (while I get a replacement ordered, delivered and installeD).

the Cache pool is the ONLY part of the server that I don't backup - since I always assumed it'll be data in flight.

JorgeB · February 20, 2019

6 hours ago, axeman said:

I guess the question would be what's in a worse state, Disk 10

Looks like diags on this thread doesn't have the old disk10, post the SMART report.

axeman · February 20, 2019

7 hours ago, johnnie.black said:

Looks like diags on this thread doesn't have the old disk10, post the SMART report.

oh weird... well, best i can get is the short test. the long test doesn't get past 10%...

I'm unbalancing some data around to get the data off Cache ... and boy are you right with that one too.. the HD light is solid on, so definitely HW failure.

just glad this is an Unraid array, otherwise, I would not be so zen about it all.

TOSHIBA_DT01ACA300_X3G7E59KS-20190220-0941.txt

JorgeB · February 20, 2019

Toshiba has a failing now attribute, the number of reallocated sectors, also many CRC errors likely resulting from a bad SATA cable, but without the extended SMART test result difficult to say if it's better or worse than the other.

axeman · February 23, 2019

On 2/20/2019 at 10:18 AM, johnnie.black said:

Toshiba has a failing now attribute, the number of reallocated sectors, also many CRC errors likely resulting from a bad SATA cable, but without the extended SMART test result difficult to say if it's better or worse than the other.

Thanks again - my original rely to you didn't get posted.

I have two drives enroute. A parity scan (non correcting) is running. I'm going to pull these dead/dying drives out and put them in my unraid bench setup to keep diagnosing. Don't want to risk data loss on the main array.

I appreciate your assistance throughout this whole ordeal.

axeman · February 28, 2019

A bit of an update....

I put the Cache drive and the former Disk 10 into my other array, and ran some pre-clears. Believe it or not the Cache drive (Samsung 1TB) ran through and passed. Disk 10 faltered... and there's something definitely wrong with it it's going, but VERY slowly. reads are at about 18-20MB/s on the same slots where the Samsung was cruising at 100+MB/s.

Meanwhile, back on the main array, Disk 5 (WDEARS) has tons of read errors. I have the two pre-clearing (they are the largest in the array, so will end up in parity), so that won't help. For the moment, I'm reading data off the disk and moving to another array disk. Errors are at 39K and rising. I'm just hoping that I can move most of the data off before the drive dies.

Perhaps remove it from the array (shrink), wait for the disks to be pre-cleared and re-expand.

I know parity doesn't need pre-clear, the reason i'm running these here is because they are WD shucked, and want to make sure they are good in case I need Warranty support, etc.

Failed drive while parity building an different failed drive

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation