Failed drive while parity building an different failed drive


axeman

Recommended Posts

Okay Gents... a doozy for you:

 

I had disk 10 show errors for a while, and it eventually got disabled. I got a replacement, and pre-cleared it. As it was clearing, I got lots of umda crc errors. I feel like this has happened whenever I preclear a disk, i get random umda crc errors. 

 

Of course it was probably a power or controller issue (as of the cable reseating, etc I have not had a single umda-crc error). As the pre-clear was about to finish- disk 4 in the array got redballed. No biggie, the pre-clear finished, I resated all cables, shutdown array, unassigned disk 4, and reassigned it. 

 

The plan was to assign the newly cleared disk to Disk 10. 

 

As Disk 4 was getting rebuilt (18%), I get a notice that Disk 10 has read errors and is disabled. 

 

Thankfully I have a dual parity setup. I think I'll be OK, because of this (unless another disk gets disabled). 

 

Question is - what do I do? Shut down, stop disk 4's rebuild and have unraid rebuild disk 4 AND disk 10 (with the newly cleared drive)? Or wait for disk 4's rebuild to finish?

Do I allow that parity check to finish after the rebuild, or immediately stop and unassign, and reassign disk 10 with the newly cleared rive?

 

I'm at a precarious position, obviously - want the safest course of action to mitigate the situation...

 

FWIW, i'm on 6.6.6

Link to comment
3 hours ago, johnnie.black said:

Since you're almost half way done I would let the rebuild finish, disk10 dropped offline, possibly from a bad cable, disk4 has a very high number of CRC errors, also likely got disabled because of a bad cable/connection.

Thanks.. should I expect some sort of corruption, since we're emulating two disks, or did dual parity prevent this from going south. 

 

I agree that it's probably related to controller/cabling/power. Disks seem OK otherwise. 

 

Should I run a parity check after the rebuild, or start rebuild of disk10 and then run a parity check?

Link to comment
20 minutes ago, axeman said:

should I expect some sort of corruption, since we're emulating two disks, or did dual parity prevent this from going south. 

There shouldn't be any, though your syslog is hard to analyze because it's spammed with USB errors, but as long as there are nothing like this on the syslog during the rebuild you're fine:

md: recovery thread: multiple disk errors

 

24 minutes ago, axeman said:

Should I run a parity check after the rebuild, or start rebuild of disk10 and then run a parity check?

I would rebuild disk10 first.

Link to comment
12 hours ago, johnnie.black said:

There shouldn't be any, though your syslog is hard to analyze because it's spammed with USB errors, but as long as there are nothing like this on the syslog during the rebuild you're fine:


md: recovery thread: multiple disk errors

 

I would rebuild disk10 first.

 

Disk 4 had some VM HD's on it - I botted up the VM and scanned, everything seemd OK. I'm currently rebuilding Disk10 on the newly cleared drive. I'll probably put the original Disk10 into the Cachepool 

 

Thanks for talking me through it. 

Link to comment

Disk5 errors look like an actual disk problem.

 

You're also having issues with the cache pool:

Feb 18 20:42:26 Rigel kernel: BTRFS info (device dm-15): bdev /dev/mapper/sds1 errs: wr 1472028, rd 1516462, flush 8, corrupt 0, gen 0
Feb 18 20:42:26 Rigel kernel: BTRFS info (device dm-15): bdev /dev/mapper/sdv1 errs: wr 4998, rd 2, flush 0, corrupt 0, gen 0

These are hardware errors, you should run a scrub and check all errors are corrected, see here for more info:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

 

Link to comment
23 minutes ago, johnnie.black said:

Disk5 errors look like an actual disk problem.

 

You're also having issues with the cache pool:


Feb 18 20:42:26 Rigel kernel: BTRFS info (device dm-15): bdev /dev/mapper/sds1 errs: wr 1472028, rd 1516462, flush 8, corrupt 0, gen 0
Feb 18 20:42:26 Rigel kernel: BTRFS info (device dm-15): bdev /dev/mapper/sdv1 errs: wr 4998, rd 2, flush 0, corrupt 0, gen 0

These are hardware errors, you should run a scrub and check all errors are corrected, see here for more info:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

 

Thanks for taking the time to go through it.

 

Strange - Cache Pool had a device that got the click of death about 6 months ago. I removed it. I will run a scrub on that. I know one of the destination disks have been full (so my mover isn't moving). Could the errors here be because of that?

I'll read through your link and see what it turns up.

 

Unraid has not kicked Disk 5 out of the array yet - should I run a parity check? or test the recently removed Disk10 and use that as a replacement for disk5?

 

 

 

 

Link to comment
23 minutes ago, axeman said:

I know one of the destination disks have been full (so my mover isn't moving). Could the errors here be because of that?

No, those are hardware read/write errors, though they could be old, link has the command to reset them.

 

24 minutes ago, axeman said:

Unraid has not kicked Disk 5 out of the array yet - should I run a parity check?

You can run a non correcting check or run an extended SMART test on disk5.

 

Link to comment
51 minutes ago, johnnie.black said:

No, those are hardware read/write errors, though they could be old, link has the command to reset them.

 

You can run a non correcting check or run an extended SMART test on disk5.

 

Thank you! I ran the scrub, and results look awful. 

 


scrub status for 8844a357-4626-49a0-a8f2-797ad43999be
	scrub started at Tue Feb 19 13:39:12 2019 and was aborted after 00:25:00
	total bytes scrubbed: 67.70GiB with 8854316 errors
	error details: read=8854312 super=3 verify=1
	corrected errors: 78, uncorrectable errors: 8854235, unverified errors: 0

I didn't abort it, i'm guessing self aborted? 

 

I'm going to run the extended tst on D5. I appreciate your input!

Link to comment
50 minutes ago, johnnie.black said:

Disk5 needs to be replaced and cache pool recreated, backup everything you can (or need) from the pool, re-format and restore data.

 

 

Well that's certainly not what I expected when I created this thread!

 

Okay. So I'm going to run a extended smart test on Disk 10.. .that's the only spare I have ATM. It's got a bunch of reallocated sectors though - so might not be the best candidate. 

 

I guess the question would be what's in a worse state, Disk 10 or Disk 5 (while I get a replacement ordered, delivered and installeD). 

 

the Cache pool is the ONLY part of the server that I don't backup - since I always assumed it'll be data in flight. 

 

 

Link to comment
7 hours ago, johnnie.black said:

Looks like diags on this thread doesn't have the old disk10, post the SMART report.

oh weird... well, best i can get is the short test. the long test doesn't get past 10%...

 

I'm unbalancing some data around to get the data off Cache ... and boy are you right with that one too.. the HD light is solid on, so definitely HW failure. 

 

just glad this is an Unraid array, otherwise, I would not be so zen about it all. 

TOSHIBA_DT01ACA300_X3G7E59KS-20190220-0941.txt

Link to comment
On 2/20/2019 at 10:18 AM, johnnie.black said:

Toshiba has a failing now attribute, the number of reallocated sectors, also many CRC errors likely resulting from a bad SATA cable, but without the extended SMART test result difficult to say if it's better or worse than the other.

Thanks again - my original rely to you didn't get posted. 

 

I have two drives enroute. A parity scan (non correcting) is running. I'm going to pull these dead/dying drives out and put them in my unraid bench setup to keep diagnosing. Don't want to risk data loss on the main array. 

 

I appreciate your assistance throughout this whole ordeal. 

Link to comment

A bit of an update....

 

I put the Cache drive and the former Disk 10 into my other array, and ran some pre-clears. Believe it or not the Cache drive (Samsung 1TB) ran through and passed. Disk 10 faltered... and there's something definitely wrong with it it's going, but VERY slowly. reads are at about 18-20MB/s on the same slots where the Samsung was cruising at 100+MB/s.

 

Meanwhile, back on the main array, Disk 5  (WDEARS) has tons of read errors. I have the two pre-clearing (they are the largest in the array, so will end up in parity), so that won't help. For the moment, I'm reading data off the disk and moving to another array disk.  Errors are at 39K and rising. I'm just hoping that I can move most of the data off before the drive dies. 

 

Perhaps remove it from the array (shrink), wait for the disks to be pre-cleared and re-expand. 

 

I know parity doesn't need pre-clear, the reason i'm running these here is because they are WD shucked, and want to make sure they are good in case I need Warranty support, etc. 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.