Drive red balled and Parity is showing Data is Invalid

camprman · August 19, 2016

I've been very lucky over the years, but my luck has finally ran out. I knew I had a drive failing but being on a fixed income, I had to wait to buy a new drive. The drive has now failed, and without looking or thinking I bought 2 new 4tb drives. Right now my parity drive is 3tb and the drive that has gone down is also 3tb.

From all that I've read it appears that since I didn't think before I ordered my only option is to do a swap-disable. Right now the emulated drive is functional as expected, but I just don't have the room to try to copy everything off before doing the swap. I'm more than a little nervous, and just wonder if there is anything I can do short of copying before doing the swap-disable?

I suppose you could say I'm a little overwhelmed at the moment.

JorgeB · August 19, 2016

And why do you want to copy before the swap? The swap-disable procedure doesn't delete any data.

camprman · August 19, 2016

Just my fear of losing files. I'm not afraid that the swap-disable feature will delete data so much as I'm worried about Mr. Murphy.

JorgeB · August 19, 2016

Well, in that case either your only option is copy your data (assuming you don't have backups), you can use the extra disk you've got and assign it temporarily to the cache slot (or use the unussigned devices plugin).

trurl · August 19, 2016

Are you absolutely sure the drive has failed? Go to Tools - Diagnostics and post complete diagnostics zip.

camprman · August 19, 2016

Here you go. The device is sdl.

tower-diagnostics-20160819-1805.zip

John_M · August 20, 2016

Aug 13 10:57:16 Tower emhttp: ST3000DM001-1CH166_Z1F303M7 (sdl) 2930266584

I don't see a SMART report for this disk but there are lots of errors in the syslog such as this:

Aug 17 12:43:24 Tower kernel: blk_update_request: critical medium error, dev sdl, sector 2446590216

so it looks pretty sick.

You also have an overheating CPU. Your syslog has many instances like this:

Aug 18 02:19:18 Tower kernel: CPU5: Core temperature above threshold, cpu clock throttled (total events = 1559278)

Aug 18 02:19:18 Tower kernel: CPU1: Core temperature above threshold, cpu clock throttled (total events = 1559275)

Aug 18 02:19:18 Tower kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 3700436)

You need to check the fan and blow the dust out of the heatsink.

JorgeB · August 20, 2016

Like John pointed out, disk10 dropped offline and there's no SMART report, if you reboot it should come alive and you could grab a SMART report, but by the type of errors it really looks like a bad disk.

A few more observations:

When disk10 failed there was an error writing to super.dat, you should stop the array to recreate it, if there's a power cut or unexpected server reboot you'll lose all disk assignments, this is especially bad when there's a disable disk.

There was one read error on both disks 8 and 9, since it happened just after a controller reset I'd venture a guess that it was the cause and the disks are fine, if RobJ sees this maybe he'll know for sure, SMART for disk8 looks fine, disk9 had some issues in the past, due to the known high failure rate of these disks I would replace it at the first sign of trouble.

Due to those read errors both disks 8 and 9 were remounted read-only, you want to run reiserfsck on both.

camprman · August 22, 2016

Thank you johnnie.black. That would have been the absolute worst. I ran reiserfsck on both disk8 and disk9 with no corruptions and not having to do any rebuilds. New drives should be here tomorrow.

Drive red balled and Parity is showing Data is Invalid

Recommended Posts

camprman

Link to comment

JorgeB

Link to comment

camprman

Link to comment

JorgeB

Link to comment

trurl

Link to comment

camprman

Link to comment

John_M

Link to comment

JorgeB

Link to comment

camprman

Link to comment

Join the conversation