What do I do now???

October 11, 201015 yr

Hi guys,

I have built a 16 drive unRAID server (14X2TB storage, 1X2TB parity, and 1X1TB cache) and moved all of my data onto it, filling it up ~45% full. I wrote directly to the drives over my LAN. Then I enabled the parity drive and unRAID began the ~24 hour process of building parity. I have been monitoring its progress via unMenu and everything seemed to be going well. Now, about 1 to 2 hours before its scheduled completion, I just tried to access unMenu and I can no longer get in - same thing with the regular web UI - I just get a message from Firefox telling me that it is unable to connect. I don't want to reboot the machine when I am so close to being finished with the parity build, so I was wondering if there is any way to monitor its progress through the console. Or is it probable that the entire system has already crashed? I logged in as "root" and typed "ifconfig" and everything looks like it is running, but I am at a loss as to what I should do now.

Any help would be greatly appreciated.

Thanks!

October 11, 201015 yr

If you can't connect to the server it might be the router/switch in between.

In any case, you can log in at the console and type

killall awk

and it should kill the unMENU process. Then if it does not start itself back you can type

/boot/unmenu/uu

to re-start it.

To get a crude idea of the progress of the parity build you can type:

/root/mdcmd status | strings | grep sync

You can check if the web-management-server is running with

ps -ef | grep emhttp | grep -v grep

If it is not running (you do not see it listed when you run the prior command) you can re-start it with:

nohup /usr/local/sbin/emhttp &

Then, you might be able to see it with your browser.

October 11, 201015 yr

Author

Thanks, Joe L.!

Here are the responses I am getting:

killallawk

response - awk: no process killed

/boot/unmenu/uu

response - Welcome to Linux 2.6.32.9-unRAID (ttyl)

Tower1 login:

/root/mdcmd status | strings | grep sync

response (I will just give you what I think are the important ones)

mdResyncPrcnt=99.5

mdResyncFinish=5.7

It looks to me as if I am 99.5% done, right? What does the 5.7 mean for Finish?

ps -ef | grep emhttp | grep -v grep

response - no repsonse - it just give me another prompt "root@Tower1:~#

Welcome to Linux 2.6.32.9-unRAID (ttyl)

Tower1 login:

nohup /usr/local/sbin/emhttp &

response - [1]27912

root@Tower1:~# nohup: ignoring input and appending output to 'nohup.out'

Edit: Then I hit "return" and I got this response:

[1]+ Killed nohup /usr/local/sbin/emhttp

I still can't access anything via the web UI or unMENU...

October 11, 201015 yr

Author

Now I am getting a response of

mdResync=0

when I type in

/root/mdcmd status | strings | grep sync

Does this mean that the parity is finished or that it has been killed?

Should I just type "reboot" now?

October 11, 201015 yr

since it appears as if the web-interfaces are working, you should be able to connect (unless they are being killed because you are out of memory because your system log grew to be huge. (filled with error messages)

Type:

free

to see how much memory you have free. (most will be in the buffer cache, so do not worry if there is not much free)

Type

ifconfig eth0

to get the IP address of the server.

Type

ethtool eth0

to get the network current link status.

It looks like you are only a few minutes (5.7) from the end of the parity calculation process.

Do not just type "reboot" since that will not stop the server cleanly and parity will be marked as invalid.

Instead, since you cannot seem to get to the web-management console, you'll need to manually perform the steps it would take to shut down cleanly before rebooting.

Type

cd

/root/samba stop

umount /mnt/disk1

umount /mnt/disk2

umount /mnt/disk3

(do this for each of your disks. Note it is umount, not unmount. (the first "n" is missing in the command.)

Then, once all the disks are un-mounted you can issue the command to stop the array

/root/mdcmd stop

If it says OK, then type

reboot

October 11, 201015 yr

Author

It never says "ok" when i unmount any of my disks. I purposely ran the command up to disk15 (I only have 14 disks) and it reported "umount: /mnt/disk15: not found"

so I assume that my other 14 disks were unmounted correctly, as each time I ran the command I just went back to a prompt with no error message (but like I said, I did not get an "ok" response either).

Should I be safe to stop and reboot now?

October 11, 201015 yr

Author

Wow....this sucks...

I stopped the array and rebooted...now my disk2 is coming up with a red ball next to it. The good news is that the parity drive has a green ball...

Should I run the "trust my array" procedure in this case?

BTW, the disk with the red ball is about a week old 2TB WD "EARS" drive. I added it to the array, dumped a bunch of data on it, and then found the thread telling me that I should jump pins 7 and 8. I tried jumping pins 7 and 8 (last week) and the array would not boot up. I removed the jumper and everything seemed to be ok...until now...

October 11, 201015 yr

Wow....this sucks...

I stopped the array and rebooted...now my disk2 is coming up with a red ball next to it. The good news is that the parity drive has a green ball...

Should I run the "trust my array" procedure in this case?

BTW, the disk with the red ball is about a week old 2TB WD "EARS" drive. I added it to the array, dumped a bunch of data on it, and then found the thread telling me that I should jump pins 7 and 8. I tried jumping pins 7 and 8 (last week) and the array would not boot up. I removed the jumper and everything seemed to be ok...until now...

No... you have a disk where a write to it failed. Your parity is good (hopefully) The only way to recover all that data you

wrote to the drive is to re-construct it from parity and the other drives.

The likelihood of the disk being defective is very real. So it the likelihood of it being a loose connector to the drive.

You can try to run a smart report on it. If it responds, it might be a loos connection. If it does not respond, stop the array, power down, check the connections power up and see if it responds. If it does not, try a different cable/connector. If still no response, RMA it. It died.

Do NOT use the trust procedure unless ALL THE DRIVES IN YOUR ARRAY ARE PRESENT AND WORKING PROPERLY. Even then, recognize it will throw away any data you wrote to the failed drive since its failure. For 99% of those out there, it is NOT the right course of action. It is ONLY useful (without data loss) if it was a loose connector. Every other reason for a disk to be off-line is because a failure in writing to it occurred.

October 11, 201015 yr

Author

You can try to run a smart report on it. If it responds, it might be a loos connection. If it does not respond, stop the array, power down, check the connections power up and see if it responds. If it does not, try a different cable/connector. If still no response, RMA it. It died.

I have started the short smart test from the disk management section of unMENU. I already checked the connections when I powered down and everything is fine. Here is the report:

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x84) Offline data collection activity

was suspended by an interrupting command from host.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (36600) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x3035) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 190 190 051 Pre-fail Always - 449

3 Spin_Up_Time 0x0027 169 166 021 Pre-fail Always - 6533

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 71

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 224

10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 28

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 12

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 958

194 Temperature_Celsius 0x0022 121 117 000 Old_age Always - 29

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 31

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 224 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Should I now run the long smart test?

October 11, 201015 yr

Should I now run the long smart test?

are you trying to prove you can read all the sectors on the disk? That is all the "long" test does that is different than the "short" test which reads a sampling of the sectors.

It appears as if the disk is responding.

So...

Stop the array

Un-Assign the failed disk from the array (to get unRAID to forget its model/serial number and accept it as its own replacement)

Start the array with it un-assigned

Stop the array once more

Re-assign the the failed disk on the devices page back to its original slot in the array.

Start the array one more time. It will re-construct the failed disk onto itself, including any content that was not written when it failed.

When the re-construction is complete you'll be back as before with your data and parity protection against a subsequent failure.

Joe L.

October 11, 201015 yr

Author

Yup, the disk is being rebuilt... I guess I got my parity drive up and running just in the nick of time, heh?

If I have any other problems with that disk I will RMA it. As a matter of fact, I am not too keen on having 2 of those "EARS" drives in my array at all. If the restore goes well, I will copy all of the data off of the 2 EARS drives onto 2 of the empty Samsung F4 drives (which I like a LOT better) using Midnight Commander and then replace the EARS drives with Samsungs as soon as I can.

Thank you, thank you, thank you!!! I am honored that "the man" himself took the time to help out a rank novice like myself...

You rock, Joe!

October 11, 201015 yr

Not to get too off topic, but would it had been better to get parity up and running before copying all the data to the drives? I honestly don't know, is why I am asking.

October 11, 201015 yr

Not to get too off topic, but would it had been better to get parity up and running before copying all the data to the drives? I honestly don't know, is why I am asking.

The initial copy will go much faster with no parity disk assigned. The theory being is you still have a copy of your data on the PC you were copying from.

You would want to put the parity disk into place and make sure the array is working properly before deleting the original copies or migrating those disks to the unRAID array.

October 11, 201015 yr

Author

In my case, I had a HUGE amount of data to copy over to the server. I tried setting up the parity disk when I had transferred about 6tb of material (out of about 12tb total), but then I found that the transfer speed dropped to about 1/3 of what I was originally getting. At that point I disabled the parity drive, took a chance and copied the rest of my data over with the plan of creating a parity disk as soon as I was done....and that was exactly what I did. I was just lucky in that I just managed to squeak out the parity disk when one of my array disks failed, so I was protected (or at least I think so). Right now the rebuilding of the failed disk looks to be about a 2 day job (about 36 hours left), so I am keeping my fingers crossed that the "bad" disk does not actually have a problem and that it will be properly rebuilt. I also have to keep my fingers crossed that I don't have another disk go bad during this time.

I think at least part of this problem is due to the fact that I installed 2 of the WD "EARS" series of drives...drives that I bought new a couple of weeks ago before I read about the warnings in this forum. I then bought 7 of the Samsung Spinpoint F4 drives (not the exact model as mentioned in this forum), and they run a LOT cooler and seem to work very reliably. The EARS drives never seemed quite right to me even when they were working 100%, though I really can't put my finger on what it is that makes me feel that way. Since I am very concerned about long term reliability, my plan is to get the EARS drives out of my system as soon as it is feasible, money permitting.

BTW, is it safe to run Midnight Commander to transfer files from Disk1 to Disk9 within the array while the parity drive is doing its work on Disk2? Or is there a better way to get the job done?

October 11, 201015 yr

BTW, is it safe to run Midnight Commander to transfer files from Disk1 to Disk9 within the array while the parity drive is doing its work on Disk2? Or is there a better way to get the job done?

It is safe. It will slow down the parity process a bit, since the disk heads will have to seek back and forth between their two tasks.

October 11, 201015 yr

Author

Thanks again, Joe! I think it would be best to let the parity rebuild finish ASAP, so I won't choose to do anything that will slow it down...

October 12, 201015 yr

Yup, the disk is being rebuilt... I guess I got my parity drive up and running just in the nick of time, heh?

I'm not wishing to interfere with the valuable support which Joe is providing, but I'm just wanting to learn!

What proof do we have that the drive didn't fail before the parity build completed?

You can try to run a smart report on it. If it responds, it might be a loos connection. If it does not respond, stop the array, power down, check the connections power up and see if it responds. If it does not, try a different cable/connector. If still no response, RMA it. It died.

I have started the short smart test from the disk management section of unMENU. I already checked the connections when I powered down and everything is fine. Here is the report:

SMART overall-health self-assessment test result: PASSED

[snip]

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 190 190 051 Pre-fail Always - 449

3 Spin_Up_Time 0x0027 169 166 021 Pre-fail Always - 6533

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 71

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 224

10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 28

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 12

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 958

194 Temperature_Celsius 0x0022 121 117 000 Old_age Always - 29

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 31

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1

I would have thought that the values reported at ID # 1 and 197 were rather critical. I would also be suspicious of #200. These all have a value of zero on my EARS drive.

October 12, 201015 yr

Yup, the disk is being rebuilt... I guess I got my parity drive up and running just in the nick of time, heh?

I'm not wishing to interfere with the valuable support which Joe is providing, but I'm just wanting to learn!

No problem. I'm learning too.

What proof do we have that the drive didn't fail before the parity build completed?

None... but it is responding to smart commands. That is a good start.

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 190 190 051 Pre-fail Always - 449

All drives have raw read errors. Some report them so do not. The current value of 190 is nowhere near 051 (the failure threshold) so it is not an issue.

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 31

There are 31 sectors pending re-allocation when next written. But.. again the current value of 200 is nowhere near the failure threshold of 000.

I would have thought that the values reported at ID # 1 and 197 were rather critical. I would also be suspicious of #200. These all have a value of zero on my EARS drive.

I would not worry about value ID 200. Again, the raw values only have meaning to the manufacturer in most cases. The nnormalized value is un-changed at its starting value of 200 and nowhere near the failure threshold for that parameter.

October 12, 201015 yr

Experience of these forums has been that once you start having issues with pending / reallocated sectors, the problems get worse with each parity check. If that happens to you, the drive will need to be replaced.

October 12, 201015 yr

Author

Thanks for the lesson, Joe! I didn't know how to read that report other than the part that said that the drive passed, so I am glad to know the areas of interest and the values I should look for.

In retrospect, I think it would have been a better idea to unassign one of the unused Samsung drives (disk9, 10, and 11) and assign it to disk 2 (the failed drive) and then allowed the parity drive to rebuild the failed disk. BUT, I guess I would not really know if the WD drive is actually going bad or not, and at least this way I think I will know sooner by rebuilding it right away. If it fails again during the rebuild or soon after, I will pull it out of the array, RMA it and then assign one of the Samsungs to replace it and let the parity drive do its thing all over again. The bad part is that it just takes sooooo long to do a rebuild...I still have another 24 hours to go. Yikes!

October 12, 201015 yr

In retrospect, I think it would have been a better idea to unassign one of the unused Samsung drives (disk9, 10, and 11) and assign it to disk 2 (the failed drive) and then allowed the parity drive to rebuild the failed disk.

I agree - that might have been the better option.

BUT, I guess I would not really know if the WD drive is actually going bad or not, and at least this way I think I will know sooner by rebuilding it right away.

I would have thought that running the preclear script would have been just as good.

October 12, 201015 yr

You cannot replace a failed disk with another disk already in your array - doesn't matter if it empty or not.

October 12, 201015 yr

You cannot replace a failed disk with another disk already in your array - doesn't matter if it empty or not.

I was thinking the same thing.

The only thing you might have been able to do was copy the contents from the virtual disk to an actual disk in the array. I have done that before just to make sure I had the content in case a drive really did fail and I did not have one to replace it at hand.

October 12, 201015 yr

Author

You cannot replace a failed disk with another disk already in your array - doesn't matter if it empty or not.

Why not? Can't I remove an empty disk (the Samsung disk9 in this case) safely from the array without harm, or does the parity drive need to rebuild itself even if I remove an empty disk? If the former, then I could assign it to disk2 and have the parity drive rebuilt it as if it were a new replacement disk, but if the latter then you are correct, as the system would see 2 drive problems at the same time and would not be able to rebuild the bad disk. Gee, I still have a lot to learn...

October 12, 201015 yr

Exactly, if you remove a disk (empty or not) from an array that is already missing disk2, unRAID will see it as two lost disks and it won't be able to recover your disk2 data.

The only way to remove a drive from the array without losing parity protection still involves running initconfig at one point, which would mean that you would lose all the data on disk2. To remove a drive from the array without losing parity is only possible if you have a healthy array.

Just do what prostuff1 said: copy all the contents from the simulated disk2 onto the real replacement disk (disk9). It will be slow, but it will work. Run the copy from the command line to help speed it up (that way it will be an internal transfer and not a transfer over the network). Once it completes, you can then run initconfig to make unRAID forget about the failed disk2. At that point you could also move your current disk9 into the disk2 slot if you wish, or you could leave it as it is. It wouldn't matter.

For the future: if you are interested in the ability to recover from a failed drive with a drive already installed in the server, check out the warm spare concept. You can read about it in great detail here. Briefly, a warm spare is a cache drive that is as large or larger than your parity drive. Since the cache drive is outside the array, you can unassign it at any time without disrupting the array. So if an array drive dies, you simply unassign the dead drive and the warm spare/cache drive, then you assign the warm spare into the dead drive's slot. The rebuild can then begin immediately, giving you ample time to troubleshoot or RMA the dead drive.

What do I do now???

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)