How do I use cache drive to replace a failed data drive remotely?

June 15, 201610 yr

One problem solved and immediately I am struck with the next one.

Just got a notification that my unRAID array, which is all the way up in Berlin (500 miles away from me), is in a compromised state after disk 9 is marked as disabled after it produced 9 errors during a New Permissions operation.

Even though i do have a cache drive in my server, I am not really using it as such...it is not configured for its intended purpose, but only as an additional drive to copy data to and from during my usual file-maintenance.

Since the cache drive is also a 3TB drive, just like my failed data disk, what would the correct procedure be to make my cache drive the new data disk 9, without risking the loss of data?

Quote

June 15, 201610 yr

Community Expert

-stop array

-unassign cache disk

-assign cache disk to failed disk slot

-start array to begin rebuild

Quote

June 15, 201610 yr

Author

Well, when you say it like that, Johnnie, you make it sound so easy...and it was...rebuilding now...I forget sometimes how brilliant the unRAID system is.

Thanks for the concise instructions!

Is there some way you know of to try to remotely "fix" the failed drive?

Quote

June 15, 201610 yr

Community Expert

Post the diagnostics: Tools > Diagnostics

Quote

June 15, 201610 yr

Author

here they are.

I am amazed about how regularly these WD-Red drives fail on me...I must have had about 10 of them go bad on me in the past 3-4 years...very frustrating...I thought they are meant to be particularly long-lasting in these type of server applications.

unraid-diagnostics-20160615-1050.zip

Quote

June 15, 201610 yr

I am amazed about how regularly these WD-Red drives fail on me...I must have had about 10 of them go bad on me in the past 3-4 years...very frustrating...I thought they are meant to be particularly long-lasting in these type of server applications.

Do they come from the same order? You might have a bad batch (or WD Red drives are crap, or you have bad luck, or any of those combinations $:-\$ ).

I made it a point to never order the same drive from the same seller within a short period / same order. I read somewhere that drives on the same batch tend to have similar issues.

Quote

June 15, 201610 yr

Author

I wish it'd be that obvious a reason, testdasi, but i've been ordering WD-Green and Red drives for the past 15 years, from all over the place (Amazon, B&H, CyberPort, EggeHead) in both the USA and Germany, with no discernible commonalities (other than that they are all WD drives), so my guess is that it's a combination of bad quality and bad luck...what's worse is that WD's warranty isn't international, and that they won't swap German drives in the USA, or US drives in Germany, which makes it very difficult for someone like me who lives and works on two continents.

which drives do you use/recommend? I might have to start looking elsewhere and slowly start swapping out drives, one by one, to another manufacturer.

Quote

June 15, 201610 yr

Community Expert

The failed disk dropped offline, so there's no SMART report, when the rebuild finishes reboot the server and post new diagnostics or just the SMART report for that disk.

Quote

June 15, 201610 yr

Author

will do, johnnie...about 12hrs to go for the rebuild.

Quote

June 15, 201610 yr

which drives do you use/recommend? I might have to start looking elsewhere and slowly start swapping out drives, one by one, to another manufacturer.

I really don't want to go into brand recommendation cuz the proverbial "YMMV" applies. My sample is also quite small, only about 10+ drives. Of which the only one that failed was a WD Black - but then it was because Amazon stupidly posted it in a thin cardboard envelope with zero padding.

Personally, I have always put Hitachi as my first choice (it appears to be confirmed by Backblaze in 2015 (<1% failure rate vs WD 2.5% and Seagate 3%). But it is now owned by WD (rebranded as "HGST") so who knows where things are heading. ::)

Quote

June 15, 201610 yr

Author

points well taken, testdasi...i'm just getting tired of looking at a big stack of failed 1, 2 and 3TB drives from the past 15 years, that all bear the WD label, and for many of which i wasn't able to take advantage of the warranty as sending them across the Atlantic is just not economical or timely...time to find a new drive manufacturer with a better track-record and international warranty.

Quote

June 16, 201610 yr

Author

ok, the rebuild has completed, parity is synced and the array started again, but how would i now be able to get a SMART report from the failed disk...i am not able to see it anywhere...how would i be able to enable the cache feature and maybe select the failed disk, so that i can at least access it via telnet and maybe run some check-disk operations on it? or is there a better way to access it for potential repairs? (i am pretty sure there is)

Quote

June 16, 201610 yr

Community Expert

Did you reboot? If disk is still offline then try to get someone to power cycling the server, if after that it's still offline then it's probably dead.

Quote

June 16, 201610 yr

Author

no, i haven't rebooted yet, but will do so shortly and report back.

Quote

June 16, 201610 yr

Author

ok, after a reboot the disk is showing again, and i added it as a cache drive for the moment (without re-starting the array).

trying to get the SMART test results, but it's been taking 10 minutes already...will post when/if completed.

the Attributes tab produces the following results (excuse the formatting):

# Attribute Name Flag Value Worst Threshold Type Updated Failed Raw Value

1 Raw read error rate 0x002f 200 200 051 Pre-fail Always Never 4

3 Spin up time 0x0027 179 174 021 Pre-fail Always Never 6008

4 Start stop count 0x0032 100 100 000 Old age Always Never 486

5 Reallocated sector count 0x0033 200 200 140 Pre-fail Always Never 0

7 Seek error rate 0x002e 200 200 000 Old age Always Never 0

9 Power on hours 0x0032 084 084 000 Old age Always Never 12201 (1y, 4m, 22d, 9h)

10 Spin retry count 0x0032 100 100 000 Old age Always Never 0

11 Calibration retry count 0x0032 100 253 000 Old age Always Never 0

12 Power cycle count 0x0032 100 100 000 Old age Always Never 70

192 Power-off retract count 0x0032 200 200 000 Old age Always Never 15

193 Load cycle count 0x0032 200 200 000 Old age Always Never 470

194 Temperature celsius 0x0022 120 085 000 Old age Always Never 30

196 Reallocated event count 0x0032 200 200 000 Old age Always Never 0

197 Current pending sector 0x0032 200 200 000 Old age Always Never 1

198 Offline uncorrectable 0x0030 100 253 000 Old age Offline Never 0

199 UDMA CRC error count 0x0032 200 200 000 Old age Always Never 0

200 Multi zone error rate 0x0008 100 253 000 Old age Offline Never 0

Quote

June 16, 201610 yr

Community Expert

Disk has pending sector(s), that's why it was disabled, you can try preclearing it a few times and see if pending sectors go to and stay at 0, still there's a high probability of getting more in the future.

Quote

June 16, 201610 yr

Author

yeah, I am prepared to wave that drive goodbye, like all the other ones, but while i don't have physical access to the server, which may be the case for many weels to come, i might as well try to get some use out of it, that is non-critical.

what's the correct procedure to pre-clear this drive? is there a telnet command that executes this, and if so, how long does one pre-clearing pass take?

Quote

June 16, 201610 yr

Author

just found the preclear plugin in Community Apps, installed it and the preclear_disk.sh script, and will see how far i can get with this...looks like this process will take many days if i do the 3 cycles you recommend...here's to trying!

Quote

June 16, 201610 yr

Author

when opening the monitor pop-up window for this operation i get this:

/boot/config/plugins/preclear.disk/preclear_disk.sh -c 1 /dev/sdd 2>/tmp/preclear.log

root@unRAID:/usr/local/emhttp# /boot/config/plugins/preclear.disk/preclear_disk.sh -c 1 /dev/sdd 2>/tmp/preclear.log

Sorry: Device /dev/sdd is busy.: 1

root@unRAID:/usr/local/emhttp#

why is dev/sdd busy and how do i resolve this?

Quote

June 16, 201610 yr

Community Expert

Old preclear + plugin doesn't work on v6.2-beta, use the new beta:

http://lime-technology.com/forum/index.php?topic=39985.msg453938#msg453938

Quote

June 16, 201610 yr

Author

thanks so much for all your great pointers, johnnie...it is now doing the preclear:

sdd WDC_WD30EFRX-68AX9N0_WD-WMC1T0098667 31 C 3 TB Pre-Read: 0% @ 141 MB/s (0:01:18)

i am starting with 1 cycle and see where i'll end up...will do the other 2 cycles once i find out how long the first will take....could be a loooong time.

Quote

June 16, 201610 yr

Author

hmm, the last thing i saw was something about failing a write because of a broken pipe, and this is what the status window now shows:

############################################################################################################################

# #

# unRAID Server Pre-Clear of disk /dev/sdd #

# Cycle 1 of 1, partition start on sector 64. #

# #

# Step 1 of 5 - Pre-read verification: [2:15:17 @ 369 MB/s] SUCCESS #

# Step 2 of 5 - Zeroing the disk: [0:00:03 @ 0 MB/s] SUCCESS #

# Step 3 of 5 - Writing unRAID's Preclear signature: SUCCESS #

# Step 4 of 5 - Verifying unRAID's Preclear signature: FAIL #

# #

############################################################################################################################

# Cycle elapsed time: 2:16:09 | Total elapsed time: 2:16:10 #

############################################################################################################################

# #

# S.M.A.R.T. Status #

# #

# ATTRIBUTE INITIAL STATUS #

# 5-Reallocated_Sector_Ct 0 - #

# 9-Power_On_Hours 12203 - #

# 194-Temperature_Celsius 31 - #

# 196-Reallocated_Event_Count 0 - #

# 197-Current_Pending_Sector 1 - #

# 198-Offline_Uncorrectable 0 - #

# 199-UDMA_CRC_Error_Count 0 - #

# #

############################################################################################################################

# SMART overall-health self-assessment test result: PASSED #

###########################################################################################################################?

--> FAIL: unRAID's Preclear signature not valid.

should i try another one? although the disk now disappeared from the unassigned disk choices in the preclear settings.

Quote

June 16, 201610 yr

Community Expert

Disk probably dropped offline again, if yes you need to reboot for it to show up.

Quote

June 16, 201610 yr

Author

since this first cycle "only" took a bit over 2hrs (maybe because it stopped short of completing?), is there an advantage of choosing 3 cycles right off the bat? if so, i'll run it later, overnight. (i am in Germany, where it's 1:42pm now)

Quote

June 16, 201610 yr

Author

looking at the preclear status window i noticed that the pending sector count has increased from 1 to 2.

after doing a search on reiserfs pending sector fix i came across someone's post saying the following:

Recently I had this problem with a hard drive and a reiserfs
partition. There were 6 pending sectors. After reading

BadBlockHowTo.txt I was almost hopeless.

Anyway I did

# dd if=/dev/hdc1 of=/dev/null bs=512

on a faulty partition. Of course, it failed at the same block as

smartctl showed as bad.

Then I ran again same command

# dd if=/dev/hdc1 of=/dev/null bs=512

it stopped on another block.

After the third run of dd there were no bad blocks - they were all

relocated automagically... smartctl confirmed this. No clue why, but

it worked.

is this something that might work in my situation as well, and if so, is the dd command included in unRAID, or if not, maybe installable by someone as inexperienced with the command prompt as i am?

Quote

How do I use cache drive to replace a failed data drive remotely?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)