Drive red balled and Parity is showing Data is Invalid


Recommended Posts

I've been very lucky over the years, but my luck has finally ran out.  I knew I had a drive failing but being on a fixed income, I had to wait to buy a new drive.  The drive has now failed, and without looking or thinking I bought 2 new 4tb drives.  Right now my parity drive is 3tb and the drive that has gone down is also 3tb.

 

From all that I've read it appears that since I didn't think before I ordered my only option is to do a swap-disable.  Right now the emulated drive is functional as expected, but I just don't have the room to try to copy everything off before doing the swap.  I'm more than a little nervous, and just wonder if there is anything I can do short of copying before doing the swap-disable? 

 

I suppose you could say I'm a little overwhelmed at the moment.

Link to comment

Aug 13 10:57:16 Tower emhttp: ST3000DM001-1CH166_Z1F303M7 (sdl) 2930266584

 

I don't see a SMART report for this disk but there are lots of errors in the syslog such as this:

 

Aug 17 12:43:24 Tower kernel: blk_update_request: critical medium error, dev sdl, sector 2446590216

 

so it looks pretty sick.

 

You also have an overheating CPU. Your syslog has many instances like this:

 

Aug 18 02:19:18 Tower kernel: CPU5: Core temperature above threshold, cpu clock throttled (total events = 1559278)

Aug 18 02:19:18 Tower kernel: CPU1: Core temperature above threshold, cpu clock throttled (total events = 1559275)

Aug 18 02:19:18 Tower kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 3700436)

 

You need to check the fan and blow the dust out of the heatsink.

 

Link to comment

Like John pointed out, disk10 dropped offline and there's no SMART report, if you reboot it should come alive and you could grab a SMART report, but by the type of errors it really looks like a bad disk.

 

A few more observations:

 

When disk10 failed there was an error writing to super.dat, you should stop the array to recreate it, if there's a power cut or unexpected server reboot you'll lose all disk assignments, this is especially bad when there's a disable disk.

 

There was one read error on both disks 8 and 9, since it happened just after a controller reset I'd venture a guess that it was the cause and the disks are fine, if RobJ sees this maybe he'll know for sure, SMART for disk8 looks fine, disk9 had some issues in the past, due to the known high failure rate of these disks I would replace it at the first sign of trouble.

 

Due to those read errors both disks 8 and 9 were remounted read-only, you want to run reiserfsck on both.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.