[SOLVED] Data-rebuilt disk becomes red ball once parity check starts


Recommended Posts

Hi, my disk4(sdg) was red balled during a mover operation. 

From my research, I should be trusting my parity drive as mover completes after disk was disabled

 

So I took it out of array and check the cable connections

Start tower up, disk is still unassigned in array config

proceed with a preclear operation to see if there is any problem with the disk. Report looks normal to me.

proceed to assign back to array and start data rebuild

data rebuild completes successfully

then i proceed with a parity check and the disk becomes red balled almost immediately

 

I am using 5.0-rc6-r8168-test and disk 4 is connected to a SM AOC-SAS2LP-MV8

The disk is a Samsung F4 2Tb

 

Appreciate if you can advise what could be wrong? Thanks in advance

 

Attaching the preclear reports and syslog

preclear_reports_S2H7J90B203119_2012-09-13.zip

syslog.20120914.txt

Link to comment

Your syslog doesn't seem to say a whole lot. It looks like a report taken just after your most recent reboot. What we'd need to see is the syslog from when the failure occurred.

 

Whenever a drive is questionable, you should probably run a SMART check. Run that and post the results please.

 

On your preclear run, there are a couple of worrying items. "Program_Fail_Cnt_Total" has a crazy high number: 16937351. After some internet snooping, it looks like a common complaint for this drive. I'm not sure how much of a worry it actually is. The second is "G-Sense Error Rate", which should probably not be showing up at all in a desktop machine, except in an earthquake. However there are no reallocated sectors, or pending sectors, or UDMA CRC errors.

 

 

Link to comment

I did not disable spin down, however I have took the disk out of array before the long test. Does this exclude the spin down factor?

Probably.

So does the failed test means I have a spoilt disk?

thanks

No, because there are no sectors pending re-allocation, nor any re-allocated.  It meant that one sector's contents did not match the checksum at the end of that sector, and it was apparently successfully re-written in place.

 

Joe L.

 

 

Link to comment

Hi

test repeated and its the same results.

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure      90%      2348        51264

# 2  Extended offline    Completed: read failure      90%      2300        51264

# 3  Short offline      Completed without error      00%      2258        -

 

Any

smart-sdg-rpt1.txt

Link to comment

Hi, here's an update. Had me on false hopes but I have the syslog for the error now

 

Checked that my F4 do not need firmware patching

 

Run preclear on the drive

Ran the long SMART test <-- complete without error

 

Perform data-rebuilt

Ran the long SMART test <-- complete without error

 

Perform parity check NOCORRECT  <--- complete without error

Reboot the tower for any problems<-- array started well, no error

 

Proceed to copy some files over to cache disk

Manual activate mover script

Disk4 become red balled and disabled

 

Any idea why it becomes red balled ? lotsa write errors in syslog

smart-sdg-aft-preclear.txt

smart-sdg-long-aft-rebuilt.txt

syslog.20120920.txt.zip

Link to comment

Hi, here's an update. Had me on false hopes but I have the syslog for the error now

 

Checked that my F4 do not need firmware patching

 

Run preclear on the drive

Ran the long SMART test <-- complete without error

 

Perform data-rebuilt

Ran the long SMART test <-- complete without error

 

Perform parity check NOCORRECT  <--- complete without error

Reboot the tower for any problems<-- array started well, no error

 

Proceed to copy some files over to cache disk

Manual activate mover script

Disk4 become red balled and disabled

 

Any idea why it becomes red balled ? lotsa write errors in syslog

 

What does a current SMART report show?

Link to comment

SMART report cannot be run immediately after the error. Will reboot the tower to test

 

root@Tower:~# smartctl -a /dev/sdg

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

Short INQUIRY response, skip product id

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

root@Tower:~# smartctl -a -T permissive /dev/sdg

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

Short INQUIRY response, skip product id

Log Sense failed, IE page [scsi response fails sanity test]

defect list format 6 unknown

Grown defect list length=12078 bytes [unknown number of elements]

 

Error Counter logging not supported

Device does not support Self Test logging

 

Link to comment

Here's the long smart report, as before, it stops without completion.

 

There is a difference this time. Does it mean hdd is failing?

 

196 Reallocated_Event_Count 0x0032  252  252  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  100  100  000    Old_age  Always      -      2

199 UDMA_CRC_Error_Count    0x0036  200  200  000    Old_age  Always      -      0

smart-sdg-0921.txt

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.