How do I use cache drive to replace a failed data drive remotely?


Recommended Posts

One problem solved and immediately I am struck with the next one.

 

Just got a notification that my unRAID array, which is all the way up in Berlin (500 miles away from me), is in a compromised state after disk 9 is marked as disabled after it produced 9 errors during a New Permissions operation.

 

Even though i do have a cache drive in my server, I am not really using it as such...it is not configured for its intended purpose, but only as an additional drive to copy data to and from during my usual file-maintenance.

 

Since the cache drive is also a 3TB drive, just like my failed data disk, what would the correct procedure be to make my cache drive the new data disk 9, without risking the loss of data?

 

 

Link to comment

I am amazed about how regularly these WD-Red drives fail on me...I must have had about 10 of them go bad on me in the past 3-4 years...very frustrating...I thought they are meant to be particularly long-lasting in these type of server applications.

Do they come from the same order? You might have a bad batch (or WD Red drives are crap, or you have bad luck, or any of those combinations  :-\).

 

I made it a point to never order the same drive from the same seller within a short period / same order. I read somewhere that drives on the same batch tend to have similar issues.

Link to comment

I wish it'd be that obvious a reason, testdasi, but i've been ordering WD-Green and Red drives for the past 15 years, from all over the place (Amazon, B&H, CyberPort, EggeHead) in both the USA and Germany, with no discernible commonalities (other than that they are all WD drives), so my guess is that it's a combination of bad quality and bad luck...what's worse is that WD's warranty isn't international, and that they won't swap German drives in the USA, or US drives in Germany, which makes it very difficult for someone like me who lives and works on two continents.

 

which drives do you use/recommend? I might have to start looking elsewhere and slowly start swapping out drives, one by one, to another manufacturer.

Link to comment

which drives do you use/recommend? I might have to start looking elsewhere and slowly start swapping out drives, one by one, to another manufacturer.

I really don't want to go into brand recommendation cuz the proverbial "YMMV" applies. My sample is also quite small, only about 10+ drives. Of which the only one that failed was a WD Black - but then it was because Amazon stupidly posted it in a thin cardboard envelope with zero padding.

 

Personally, I have always put Hitachi as my first choice (it appears to be confirmed by Backblaze in 2015 (<1% failure rate vs WD 2.5% and Seagate 3%). But it is now owned by WD (rebranded as "HGST") so who knows where things are heading.  ::)

Link to comment

points well taken, testdasi...i'm just getting tired of looking at a big stack of failed 1, 2 and 3TB drives from the past 15 years, that all bear the WD label, and for many of which i wasn't able to take advantage of the warranty as sending them across the Atlantic is just not economical or timely...time to find a new drive manufacturer with a better track-record and international warranty.

Link to comment

ok, the rebuild has completed, parity is synced and the array started again, but how would i now be able to get a SMART report from the failed disk...i am not able to see it anywhere...how would i be able to enable the cache feature and maybe select the failed disk, so that i can at least access it via telnet and maybe run some check-disk operations on it? or is there a better way to access it for potential repairs? (i am pretty sure there is)

Link to comment

ok, after a reboot the disk is showing again, and i added it as a cache drive for the moment (without re-starting the array).

 

trying to get the SMART test results, but it's been taking 10 minutes already...will post when/if completed.

 

the Attributes tab produces the following results (excuse the formatting):

 

# Attribute Name Flag Value Worst Threshold Type Updated Failed Raw Value

1 Raw read error rate 0x002f 200 200 051 Pre-fail Always Never 4

3 Spin up time 0x0027 179 174 021 Pre-fail Always Never 6008

4 Start stop count 0x0032 100 100 000 Old age Always Never 486

5 Reallocated sector count 0x0033 200 200 140 Pre-fail Always Never 0

7 Seek error rate 0x002e 200 200 000 Old age Always Never 0

9 Power on hours 0x0032 084 084 000 Old age Always Never 12201 (1y, 4m, 22d, 9h)

10 Spin retry count 0x0032 100 100 000 Old age Always Never 0

11 Calibration retry count 0x0032 100 253 000 Old age Always Never 0

12 Power cycle count 0x0032 100 100 000 Old age Always Never 70

192 Power-off retract count 0x0032 200 200 000 Old age Always Never 15

193 Load cycle count 0x0032 200 200 000 Old age Always Never 470

194 Temperature celsius 0x0022 120 085 000 Old age Always Never 30

196 Reallocated event count 0x0032 200 200 000 Old age Always Never 0

197 Current pending sector 0x0032 200 200 000 Old age Always Never 1

198 Offline uncorrectable 0x0030 100 253 000 Old age Offline Never 0

199 UDMA CRC error count 0x0032 200 200 000 Old age Always Never 0

200 Multi zone error rate 0x0008 100 253 000 Old age Offline Never 0

Link to comment

yeah, I am prepared to wave that drive goodbye, like all the other ones, but while i don't have physical access to the server, which may be the case for many weels to come, i might as well try to get some use out of it, that is non-critical.

 

what's the correct procedure to pre-clear this drive? is there a telnet command that executes this, and if so, how long does one pre-clearing pass take?

Link to comment

when opening the monitor pop-up window for this operation i get this:

 

/boot/config/plugins/preclear.disk/preclear_disk.sh  -c 1 /dev/sdd 2>/tmp/preclear.log

root@unRAID:/usr/local/emhttp# /boot/config/plugins/preclear.disk/preclear_disk.sh  -c 1 /dev/sdd 2>/tmp/preclear.log

Sorry: Device /dev/sdd is busy.: 1

root@unRAID:/usr/local/emhttp#

 

why is dev/sdd busy and how do i resolve this?

Link to comment

thanks so much for all your great pointers, johnnie...it is now doing the preclear:

 

sdd WDC_WD30EFRX-68AX9N0_WD-WMC1T0098667 31 C 3 TB Pre-Read: 0% @ 141 MB/s (0:01:18)

 

i am starting with 1 cycle and see where i'll end up...will do the other 2 cycles once i find out how long the first will take....could be a loooong time.

Link to comment

hmm, the last thing i saw was something about failing a write because of a broken pipe, and this is what the status window now shows:

 

############################################################################################################################

#                                                                                                                          #

#                                        unRAID Server Pre-Clear of disk /dev/sdd                                          #

#                                      Cycle 1 of 1, partition start on sector 64.                                        #

#                                                                                                                          #

#                                                                                                                          #

#  Step 1 of 5 - Pre-read verification:                                                  [2:15:17 @ 369 MB/s] SUCCESS    #

#  Step 2 of 5 - Zeroing the disk:                                                          [0:00:03 @ 0 MB/s] SUCCESS    #

#  Step 3 of 5 - Writing unRAID's Preclear signature:                                                          SUCCESS    #

#  Step 4 of 5 - Verifying unRAID's Preclear signature:                                                          FAIL    #

#                                                                                                                          #

#                                                                                                                          #

#                                                                                                                          #

#                                                                                                                          #

#                                                                                                                          #

#                                                                                                                          #

#                                                                                                                          #

#                                                                                                                          #

############################################################################################################################

#                                Cycle elapsed time: 2:16:09 | Total elapsed time: 2:16:10                                #

############################################################################################################################

 

 

############################################################################################################################

#                                                                                                                          #

#                                                  S.M.A.R.T. Status                                                      #

#                                                                                                                          #

#                                                                                                                          #

#  ATTRIBUTE                    INITIAL  STATUS                                                                          #

#  5-Reallocated_Sector_Ct      0        -                                                                                #

#  9-Power_On_Hours            12203    -                                                                                #

#  194-Temperature_Celsius      31      -                                                                                #

#  196-Reallocated_Event_Count  0        -                                                                                #

#  197-Current_Pending_Sector  1        -                                                                                #

#  198-Offline_Uncorrectable    0        -                                                                                #

#  199-UDMA_CRC_Error_Count    0        -                                                                                #

#                                                                                                                          #

#                                                                                                                          #

#                                                                                                                          #

#                                                                                                                          #

#                                                                                                                          #

############################################################################################################################

#  SMART overall-health self-assessment test result: PASSED                                                              #

###########################################################################################################################?

 

--> FAIL: unRAID's Preclear signature not valid.

 

 

should i try another one? although the disk now disappeared from the unassigned disk choices in the preclear settings.

Link to comment

looking at the preclear status window i noticed that the pending sector count has increased from 1 to 2.

 

after doing a search on reiserfs pending sector fix i came across someone's post saying the following:

 

Recently I had this problem with a hard drive and a reiserfs

partition.  There were 6 pending sectors. After reading

BadBlockHowTo.txt I was almost hopeless.

 

Anyway I did

 

# dd if=/dev/hdc1 of=/dev/null bs=512

 

on a faulty partition. Of course, it failed at the same block as

smartctl showed as bad.

 

Then I ran again same command

 

# dd if=/dev/hdc1 of=/dev/null bs=512

 

it stopped on another block.

 

After the third run of dd there were no bad blocks - they were all

relocated automagically... smartctl confirmed this. No clue why, but

it worked.

 

is this something that might work in my situation as well, and if so, is the dd command included in unRAID, or if not, maybe installable by someone as inexperienced with the command prompt as i am?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.