[SOLVED] When to replace a drive?

February 13, 201214 yr

I've been looking through the FAQ's so forgive me if I have missed this one.

I am currently building a new unRAID server. Moving from a Windows 7 (Software using SyncBackSE) Mirroring for my media. I moved to unRAID due to the fact that pure mirroring for media is to expensive long run.

I have been slowly taking drives out of the old system and pre-clearing them and adding them to the array, copy the files over and move on. I have moved 7 of my 11 drives over with out an issue so far.

However today I got weird RSYNC errors that showed the new drive (drive 5, since sda and sdb are parity/cache drives). I have taken SAMBA offline, and the commands on the wiki "Check Disk Fileysystems" for this issue. A Tree Recovery was required on 3 trees.

Checking SMART I do not see any obvious SMART related issues.

So how do I know when to replace a drive. Will the FS properly mark these areas bad now, or will data resettle?

I'm use to other linux FS's but not the one unRAID uses

Drives are so expensive right now

Quote

February 13, 201214 yr

Author

Here is the SMART report. It shows 2 bad blocks.

Will unRAID not write to those bad blocks? I would assume the MBR would prevent this but not 100% sure.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 208819465

3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0

4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 85

5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0

7 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 5543713

9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 10168

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 79

183 Runtime_Bad_Block 0x0032 098 098 000 Old_age Always - 2

184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

188 Command_Timeout 0x0032 100 099 000 Old_age Always - 65537

189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0

190 Airflow_Temperature_Cel 0x0022 071 059 045 Old_age Always - 29 (Lifetime Min/Max 28/30)

194 Temperature_Celsius 0x0022 029 041 000 Old_age Always - 29 (0 21 0 0)

195 Hardware_ECC_Recovered 0x001a 050 025 000 Old_age Always - 208819465

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 3

240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 214619515788947

241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 861873087

242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 996653285

Quote

February 13, 201214 yr

Im new to this.. Where on that report do you see two bad blocks?

EDIT:

Nevermind.. I see it.

Quote

February 13, 201214 yr

These are some of the key ones to look for.

SMART Attributes Data Structure revision number: 10

5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

ANY current pending sectors means there are sectors that could not be read safely and will be reassigned on the next write.

IF this field is non zero, there is a chance the drive could go offline on a drive reconstruction elsewhere in the array.

Reallocated is after the fact so you can get away with a few here. If this continually grows, the drive is probably in a marginal state and needs to be tested or RMA'ed. Same with 187, 198.

as far as 183 Runtime_Bad_Block , I do not know this attribute.

Also the smart report will say PASS or FAIL or FAILING NOW if the drive has internal thresholds which warn against a failure.

Sometimes you catch it early other times there is a mechanical failure that even smart cannot detect.

Also if you want to be sure of the drive's health, stop the array.

Do a smartctl -t LONG on the drive in question. Wait the appropriate amount of time. Then check the smart report.

If you do this while the array is active, emhttp may spin down the drive and interfere with the test.

The other choice is to change the spin down time to never

Quote

February 13, 201214 yr

Also if these are old drives that have been in service a long time. I would suggest a badblocks test in destructive write mode first (i.e. before the preclear).

It is very thorough in testing the drive out. It does 4 passes with patterns that preclear does not use. It keeps track of the bad blocks and reports them if there are issues.

Quote

February 13, 201214 yr

Author

Webo,

In other FS's you can pass the badblock info the system and it will auto-stay clear of those bad blocks.

is this possible with unRAID?

Quote

February 13, 201214 yr

Webo,

In other FS's you can pass the badblock info the system and it will auto-stay clear of those bad blocks.

is this possible with unRAID?

Yes, the file-system will stay clear, but NO it will not help, because you would have to avoid those same blocks on EVERY drive, as they would be used across the entire array of disks to calculate parity and to re-construct drives being replaced or being simulated.

Quote

February 13, 201214 yr

Author

so I need to yank the drive in this case.

okay.

Quote

[SOLVED] When to replace a drive?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)