[SOLVED] Data-rebuilt disk becomes red ball once parity check starts

funhur · September 14, 2012

Hi, my disk4(sdg) was red balled during a mover operation.

From my research, I should be trusting my parity drive as mover completes after disk was disabled

So I took it out of array and check the cable connections

Start tower up, disk is still unassigned in array config

proceed with a preclear operation to see if there is any problem with the disk. Report looks normal to me.

proceed to assign back to array and start data rebuild

data rebuild completes successfully

then i proceed with a parity check and the disk becomes red balled almost immediately

I am using 5.0-rc6-r8168-test and disk 4 is connected to a SM AOC-SAS2LP-MV8

The disk is a Samsung F4 2Tb

Appreciate if you can advise what could be wrong? Thanks in advance

Attaching the preclear reports and syslog

preclear_reports_S2H7J90B203119_2012-09-13.zip

syslog.20120914.txt

kenoka · September 14, 2012

Your syslog doesn't seem to say a whole lot. It looks like a report taken just after your most recent reboot. What we'd need to see is the syslog from when the failure occurred.

Whenever a drive is questionable, you should probably run a SMART check. Run that and post the results please.

On your preclear run, there are a couple of worrying items. "Program_Fail_Cnt_Total" has a crazy high number: 16937351. After some internet snooping, it looks like a common complaint for this drive. I'm not sure how much of a worry it actually is. The second is "G-Sense Error Rate", which should probably not be showing up at all in a desktop machine, except in an earthquake. However there are no reallocated sectors, or pending sectors, or UDMA CRC errors.

funhur · September 14, 2012

@kenoka

The failure you referred is when it red balled during the mover operation?

Let me check for the older syslogs. I had shut down the tower to check the cable connections.

Noted the "G-Sense Error Rate".

Will do a smart check when I get home later

thanks

dgaschk · September 14, 2012

Both G-Sense Error Rate and Program_Fail_Cnt_Total are unchanged from their initial VALUEs of 100. The raw numbers have meaning only to Samsung and should not be used to determine drive health.

Run a long SMART test.

funhur · September 14, 2012

I can't find any previous syslog in /var/log, the log file looks be overwritten every time the tower reboot. Are they kept in other directories ?

Just started the long smart test

kenoka · September 14, 2012

No, you need to capture the current syslog prior to shutdown. It restarts with every boot.

funhur · September 14, 2012

Attaching the long SMART results. Is it normal as I see some read errors towards the end but overall health passed

thanks

smart-sdg-rpt.txt

dgaschk · September 15, 2012

The long test failed:

# 1  Extended offline    Completed: read failure       90%      2300         51264

Joe L. · September 15, 2012

The long test failed:

# 1  Extended offline    Completed: read failure       90%      2300         51264

It would if you had the disk spin down. Did you disable the spin-down feature of unRAID while the test was in progress?

funhur · September 15, 2012

I did not disable spin down, however I have took the disk out of array before the long test. Does this exclude the spin down factor?

So does the failed test means I have a spoilt disk?

thanks

Joe L. · September 15, 2012

I did not disable spin down, however I have took the disk out of array before the long test. Does this exclude the spin down factor?

Probably.

So does the failed test means I have a spoilt disk?

thanks

No, because there are no sectors pending re-allocation, nor any re-allocated. It meant that one sector's contents did not match the checksum at the end of that sector, and it was apparently successfully re-written in place.

Joe L.

dgaschk · September 15, 2012

Repeat the test.

funhur · September 17, 2012

Hi

test repeated and its the same results.

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 2348 51264

# 2 Extended offline Completed: read failure 90% 2300 51264

# 3 Short offline Completed without error 00% 2258 -

Any

smart-sdg-rpt1.txt

funhur · September 17, 2012

I found the following but my drive is Feb-11

Article stated drives manufactured December 2010 or later include the firmware patch

So i guess I do not need this patch right ?

http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks

dgaschk · September 17, 2012

Check on the Seagate site.

funhur · September 20, 2012

Hi, here's an update. Had me on false hopes but I have the syslog for the error now

Checked that my F4 do not need firmware patching

Run preclear on the drive

Ran the long SMART test <-- complete without error

Perform data-rebuilt

Ran the long SMART test <-- complete without error

Perform parity check NOCORRECT <--- complete without error

Reboot the tower for any problems<-- array started well, no error

Proceed to copy some files over to cache disk

Manual activate mover script

Disk4 become red balled and disabled

Any idea why it becomes red balled ? lotsa write errors in syslog

smart-sdg-aft-preclear.txt

smart-sdg-long-aft-rebuilt.txt

syslog.20120920.txt.zip

dgaschk · September 20, 2012

Hi, here's an update. Had me on false hopes but I have the syslog for the error now

Checked that my F4 do not need firmware patching

Run preclear on the drive

Ran the long SMART test <-- complete without error

Perform data-rebuilt

Ran the long SMART test <-- complete without error

Perform parity check NOCORRECT <--- complete without error

Reboot the tower for any problems<-- array started well, no error

Proceed to copy some files over to cache disk

Manual activate mover script

Disk4 become red balled and disabled

Any idea why it becomes red balled ? lotsa write errors in syslog

What does a current SMART report show?

funhur · September 20, 2012

SMART report cannot be run immediately after the error. Will reboot the tower to test

root@Tower:~# smartctl -a /dev/sdg

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Short INQUIRY response, skip product id

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

root@Tower:~# smartctl -a -T permissive /dev/sdg

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Short INQUIRY response, skip product id

Log Sense failed, IE page [scsi response fails sanity test]

defect list format 6 unknown

Grown defect list length=12078 bytes [unknown number of elements]

Error Counter logging not supported

Device does not support Self Test logging

funhur · September 20, 2012

Here's the long smart report, as before, it stops without completion.

There is a difference this time. Does it mean hdd is failing?

196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 2

199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0

smart-sdg-0921.txt

dgaschk · September 21, 2012

This drive is problematic. Run pre-clear or several on it and if the pending sector count goes to zero and stays there then the disk should be ok. RMA is an option.

funhur · September 21, 2012

Thank you all for the help rendered.

I will RMA the drive, far too much time is spent on this

[SOLVED] Data-rebuilt disk becomes red ball once parity check starts

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived