Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

What do I do now???

Featured Replies

Hi guys,

 

I have built a 16 drive unRAID server (14X2TB storage, 1X2TB parity, and 1X1TB cache) and moved all of my data onto it, filling it up ~45% full. I wrote directly to the drives over my LAN. Then I enabled the parity drive and unRAID began the ~24 hour process of building parity. I have been monitoring its progress via unMenu and everything seemed to be going well. Now, about 1 to 2 hours before its scheduled completion, I just tried to access unMenu and I can no longer get in - same thing with the regular web UI - I just get a message from Firefox telling me that it is unable to connect. I don't want to reboot the machine when I am so close to being finished with the parity build, so I was wondering if there is any way to monitor its progress through the console. Or is it probable that the entire system has already crashed? I logged in as "root" and typed "ifconfig" and everything looks like it is running, but I am at a loss as to what I should do now.

 

Any help would be greatly appreciated.

 

Thanks!

If you can't connect to the server it might be the router/switch in between. 

 

In any case, you can log in at the console and type

killall awk

and it should kill the unMENU process.  Then if it does not start itself back you can type

/boot/unmenu/uu

to re-start it.

 

To get a crude idea of the progress of the parity build you can type:

/root/mdcmd status | strings | grep sync

 

You can check if the web-management-server is running with

ps -ef | grep emhttp | grep -v grep

 

If it is not running (you do not see it listed when you run the prior command) you can re-start it with:

nohup /usr/local/sbin/emhttp &

 

Then, you might be able to see it with your browser.

 

 

 

  • Author

Thanks, Joe L.!

 

Here are the responses I am getting:

 

killallawk

 

response - awk: no process killed

 

/boot/unmenu/uu

 

response - Welcome to Linux 2.6.32.9-unRAID (ttyl)

 

Tower1 login:

 

/root/mdcmd status | strings | grep sync

 

response (I will just give you what I think are the important ones)

 

mdResyncPrcnt=99.5

mdResyncFinish=5.7

 

It looks to me as if I am 99.5% done, right? What does the 5.7 mean for Finish?

 

ps -ef | grep emhttp | grep -v grep

 

response - no repsonse - it just give me another prompt "root@Tower1:~#

 

Welcome to Linux 2.6.32.9-unRAID (ttyl)

 

Tower1 login:

 

nohup /usr/local/sbin/emhttp &

 

response - [1]27912

 

root@Tower1:~# nohup: ignoring input and appending output to 'nohup.out'

 

Edit: Then I hit "return" and I got this response:

 

[1]+ Killed                  nohup /usr/local/sbin/emhttp

 

 

 

 

 

I still can't access anything via the web UI or unMENU...:(

  • Author

Now I am getting a response of

 

mdResync=0

 

when I type in

 

/root/mdcmd status | strings | grep sync

 

Does this mean that the parity is finished or that it has been killed?

 

Should I just type "reboot" now?

since it appears as if the web-interfaces are working, you should be able to connect (unless they are being killed because you are out of memory because your system log grew to be huge. (filled with error messages)

 

Type:

free

to see how much memory you have free. (most will be in the buffer cache, so do not worry if there is not much free)

 

Type

ifconfig eth0

to get the IP address of the server.

 

Type

ethtool eth0

to get the network current link status.

 

It looks like you are only a few minutes  (5.7) from the end of the parity calculation process.

 

Do not just type "reboot" since that will not stop the server cleanly and parity will be marked as invalid.

 

Instead, since you cannot seem to get to the web-management console, you'll need to manually perform the steps it would take to shut down cleanly before rebooting.

 

Type

cd

/root/samba stop

umount /mnt/disk1

umount /mnt/disk2

umount /mnt/disk3

                          (do this for each of your disks.  Note it is umount, not unmount.  (the first "n" is missing in the command.)

 

Then, once all the disks are un-mounted you can issue the command to stop the array

/root/mdcmd stop

 

If it says OK, then type

reboot

 

 

 

  • Author

It never says "ok" when i unmount any of my disks. I purposely ran the command up to disk15 (I only have 14 disks) and it reported "umount: /mnt/disk15: not found"

 

so I assume that my other 14 disks were unmounted correctly, as each time I ran the command I just went back to a prompt with no error message (but like I said, I did not get an "ok" response either).

 

Should I be safe to stop and reboot now?

 

  • Author

Wow....this sucks...:(

 

I stopped the array and rebooted...now my disk2 is coming up with a red ball next to it. The good news is that the parity drive has a green ball...:)

 

Should I run the "trust my array" procedure in this case?

 

BTW, the disk with the red ball is about a week old 2TB WD "EARS" drive. I added it to the array, dumped a bunch of data on it, and then found the thread telling me that I should jump pins 7 and 8. I tried jumping pins 7 and 8 (last week) and the array would not boot up. I removed the jumper and everything seemed to be ok...until now...:(

Wow....this sucks...:(

 

I stopped the array and rebooted...now my disk2 is coming up with a red ball next to it. The good news is that the parity drive has a green ball...:)

 

Should I run the "trust my array" procedure in this case?

 

BTW, the disk with the red ball is about a week old 2TB WD "EARS" drive. I added it to the array, dumped a bunch of data on it, and then found the thread telling me that I should jump pins 7 and 8. I tried jumping pins 7 and 8 (last week) and the array would not boot up. I removed the jumper and everything seemed to be ok...until now...:(

No... you have a disk where a write to it failed.  Your parity is good (hopefully)  The only way to recover all that data you

wrote to the drive is to re-construct it from parity and the other drives.

 

The likelihood of the disk being defective is very real.  So it the likelihood of it being a loose connector to the drive.

 

You can try to run a smart report on it.  If it responds, it might be a loos connection. If it does not respond, stop the array, power down, check the connections power up and see if it responds.  If it does not, try a different cable/connector.  If still no response, RMA it.  It died.

 

Do NOT use the trust procedure unless ALL THE DRIVES IN YOUR ARRAY ARE PRESENT AND WORKING PROPERLY.  Even then, recognize it will throw away any data you wrote to the failed drive since its failure.  For 99% of those out there, it is NOT the right course of action.  It is ONLY useful (without data loss) if it was a loose connector.  Every other reason for a disk to be off-line is because a failure in writing to it occurred.

 

 

  • Author
You can try to run a smart report on it.  If it responds, it might be a loos connection. If it does not respond, stop the array, power down, check the connections power up and see if it responds.  If it does not, try a different cable/connector.  If still no response, RMA it.  It died.

I have started the short smart test from the disk management section of unMENU. I already checked the connections when I powered down and everything is fine. Here is the report:

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x84)    Offline data collection activity

                    was suspended by an interrupting command from host.

                    Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0)    The previous self-test routine completed

                    without error or no self-test has ever

                    been run.

Total time to complete Offline

data collection:          (36600) seconds.

Offline data collection

capabilities:              (0x7b) SMART execute Offline immediate.

                    Auto Offline data collection on/off support.

                    Suspend Offline collection upon new

                    command.

                    Offline surface scan supported.

                    Self-test supported.

                    Conveyance Self-test supported.

                    Selective Self-test supported.

SMART capabilities:            (0x0003)    Saves SMART data before entering

                    power-saving mode.

                    Supports SMART auto save timer.

Error logging capability:        (0x01)    Error logging supported.

                    General Purpose Logging supported.

Short self-test routine

recommended polling time:      (  2) minutes.

Extended self-test routine

recommended polling time:      ( 255) minutes.

Conveyance self-test routine

recommended polling time:      (  5) minutes.

SCT capabilities:            (0x3035)    SCT Status supported.

                    SCT Feature Control supported.

                    SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  190  190  051    Pre-fail  Always      -      449

  3 Spin_Up_Time            0x0027  169  166  021    Pre-fail  Always      -      6533

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      71

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      224

10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      28

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      12

193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      958

194 Temperature_Celsius    0x0022  121  117  000    Old_age  Always      -      29

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      31

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      1

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%      224        -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Should I now run the long smart test?

 

 

Should I now run the long smart test?

are you trying to prove you can read all the sectors on the disk?  That is all the "long" test does that is different than the "short" test which reads a sampling of the sectors.

 

It appears as if the disk is responding.

 

So...

Stop the array

Un-Assign the failed disk from the array  (to get unRAID to forget its model/serial number and accept it as its own replacement)

Start the array with it un-assigned

Stop the array once more

Re-assign the the failed disk on the devices page back to its original slot in the array.

Start the array one more time.  It will re-construct the failed disk onto itself, including any content that was not written when it failed.

 

When the re-construction is complete you'll be back as before with your data and parity protection against a subsequent failure.

 

Joe L.

  • Author

Yup, the disk is being rebuilt...:) I guess I got my parity drive up and running just in the nick of time, heh?

 

If I have any other problems with that disk I will RMA it. As a matter of fact, I am not too keen on having 2 of those "EARS" drives in my array at all. If the restore goes well, I will copy all of the data off of the 2 EARS drives onto 2 of the empty Samsung F4 drives (which I like a LOT better) using Midnight Commander and then replace the EARS drives with Samsungs as soon as I can.

 

Thank you, thank you, thank you!!! I am honored that "the man" himself took the time to help out a rank novice like myself...:)

 

You rock, Joe!

Not to get too off topic, but would it had been better to get parity up and running before copying all the data to the drives? I honestly don't know, is why I am asking.

Not to get too off topic, but would it had been better to get parity up and running before copying all the data to the drives? I honestly don't know, is why I am asking.

The initial copy will go much faster with no parity disk assigned.  The theory being is you still have a copy of your data on the PC you were copying from.

 

You would want to put the parity disk into place and make sure the array is working properly before deleting the original copies or migrating those disks to the unRAID array.

  • Author

In my case, I had a HUGE amount of data to copy over to the server. I tried setting up the parity disk when I had transferred about 6tb of material (out of about 12tb total), but then I found that the transfer speed dropped to about 1/3 of what I was originally getting. At that point I disabled the parity drive, took a chance and copied the rest of my data over with the plan of creating a parity disk as soon as I was done....and that was exactly what I did. I was just lucky in that I just managed to squeak out the parity disk when one of my array disks failed, so I was protected (or at least I think so). Right now the rebuilding of the failed disk looks to be about a 2 day job (about 36 hours left), so I am keeping my fingers crossed that the "bad" disk does not actually have a problem and that it will be properly rebuilt. I also have to keep my fingers crossed that I don't have another disk go bad during this time.

 

I think at least part of this problem is due to the fact that I installed 2 of the WD "EARS" series of drives...drives that I bought new a couple of weeks ago before I read about the warnings in this forum. I then bought 7 of the Samsung Spinpoint F4 drives (not the exact model as mentioned in this forum), and they run a LOT cooler and seem to work very reliably. The EARS drives never seemed quite right to me even when they were working 100%, though I really can't put my finger on what it is that makes me feel that way. Since I am very concerned about long term reliability, my plan is to get the EARS drives out of my system as soon as it is feasible, money permitting.

 

BTW, is it safe to run Midnight Commander to transfer files from Disk1 to Disk9 within the array while the parity drive is doing its work on Disk2? Or is there a better way to get the job done?

BTW, is it safe to run Midnight Commander to transfer files from Disk1 to Disk9 within the array while the parity drive is doing its work on Disk2? Or is there a better way to get the job done?

It is safe.  It will slow down the parity process a bit, since the disk heads will have to seek back and forth between their two tasks.
  • Author

Thanks again, Joe! I think it would be best to let the parity rebuild finish ASAP, so I won't choose to do anything that will slow it down...;)

Yup, the disk is being rebuilt...:) I guess I got my parity drive up and running just in the nick of time, heh?

 

I'm not wishing to interfere with the valuable support which Joe is providing, but I'm just wanting to learn!

 

What proof do we have that the drive didn't fail before the parity build completed?

 

You can try to run a smart report on it.  If it responds, it might be a loos connection. If it does not respond, stop the array, power down, check the connections power up and see if it responds.  If it does not, try a different cable/connector.  If still no response, RMA it.  It died.

I have started the short smart test from the disk management section of unMENU. I already checked the connections when I powered down and everything is fine. Here is the report:

SMART overall-health self-assessment test result: PASSED

 

 

[snip]

 

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   190   190   051    Pre-fail  Always       -       449

  3 Spin_Up_Time            0x0027   169   166   021    Pre-fail  Always       -       6533

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       71

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       224

10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0

11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0

12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       28

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       12

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       958

194 Temperature_Celsius     0x0022   121   117   000    Old_age   Always       -       29

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       31

198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

 

 

I would have thought that the values reported at ID # 1 and 197 were rather critical.  I would also be suspicious of #200.  These all have a value of zero on my EARS drive.

Yup, the disk is being rebuilt...:) I guess I got my parity drive up and running just in the nick of time, heh?

 

I'm not wishing to interfere with the valuable support which Joe is providing, but I'm just wanting to learn!

No problem. I'm learning too.

What proof do we have that the drive didn't fail before the parity build completed?

None... but it is responding to smart commands. That is a good start.

 

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   190   190   051    Pre-fail  Always       -       449

All drives have raw read errors. Some report them so do not.  The current value of 190 is nowhere near 051 (the failure threshold) so it is not an issue.

 

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       31

There are 31 sectors pending re-allocation when next written.  But.. again the current value of 200 is nowhere near the failure threshold of 000. 

 

I would have thought that the values reported at ID # 1 and 197 were rather critical.  I would also be suspicious of #200.  These all have a value of zero on my EARS drive.

I would not worry about value ID 200.  Again, the raw values only have meaning to the manufacturer in most cases.  The nnormalized value is un-changed at its starting value of 200 and nowhere near the failure threshold for that parameter.

Experience of these forums has been that once you start having issues with pending / reallocated sectors, the problems get worse with each parity check. If that happens to you, the drive will need to be replaced.

  • Author

Thanks for the lesson, Joe! I didn't know how to read that report other than the part that said that the drive passed, so I am glad to know the areas of interest and the values I should look for.

 

In retrospect, I think it would have been a better idea to unassign one of the unused Samsung drives (disk9, 10, and 11) and assign it to disk 2 (the failed drive) and then allowed the parity drive to rebuild the failed disk. BUT, I guess I would not really know if the WD drive is actually going bad or not, and at least this way I think I will know sooner by rebuilding it right away. If it fails again during the rebuild or soon after, I will pull it out of the array, RMA it and then assign one of the Samsungs to replace it and let the parity drive do its thing all over again. The bad part is that it just takes sooooo long to do a rebuild...I still have another 24 hours to go. Yikes!

In retrospect, I think it would have been a better idea to unassign one of the unused Samsung drives (disk9, 10, and 11) and assign it to disk 2 (the failed drive) and then allowed the parity drive to rebuild the failed disk.

 

I agree - that might have been the better option.

 

BUT, I guess I would not really know if the WD drive is actually going bad or not, and at least this way I think I will know sooner by rebuilding it right away.

 

I would have thought that running the preclear script would have been just as good.

You cannot replace a failed disk with another disk already in your array - doesn't matter if it empty or not.

You cannot replace a failed disk with another disk already in your array - doesn't matter if it empty or not.

 

I was thinking the same thing.

 

 

The only thing you might have been able to do was copy the contents from the virtual disk to an actual disk in the array.  I have done that before just to make sure I had the content in case a drive really did fail and I did not have one to replace it at hand.

  • Author
You cannot replace a failed disk with another disk already in your array - doesn't matter if it empty or not.

Why not? Can't I remove an empty disk (the Samsung disk9 in this case) safely from the array without harm, or does the parity drive need to rebuild itself even if I remove an empty disk? If the former, then I could assign it to disk2 and have the parity drive rebuilt it as if it were a new replacement disk, but if the latter then you are correct, as the system would see 2 drive problems at the same time and would not be able to rebuild the bad disk. Gee, I still have a lot to learn...:)

 

 

Exactly, if you remove a disk (empty or not) from an array that is already missing disk2, unRAID will see it as two lost disks and it won't be able to recover your disk2 data.

 

The only way to remove a drive from the array without losing parity protection still involves running initconfig at one point, which would mean that you would lose all the data on disk2.  To remove a drive from the array without losing parity is only possible if you have a healthy array.

 

Just do what prostuff1 said: copy all the contents from the simulated disk2 onto the real replacement disk (disk9).  It will be slow, but it will work.  Run the copy from the command line to help speed it up (that way it will be an internal transfer and not a transfer over the network).  Once it completes, you can then run initconfig to make unRAID forget about the failed disk2.  At that point you could also move your current disk9 into the disk2 slot if you wish, or you could leave it as it is.  It wouldn't matter.

 

For the future: if you are interested in the ability to recover from a failed drive with a drive already installed in the server, check out the warm spare concept.  You can read about it in great detail here.  Briefly, a warm spare is a cache drive that is as large or larger than your parity drive.  Since the cache drive is outside the array, you can unassign it at any time without disrupting the array.  So if an array drive dies, you simply unassign the dead drive and the warm spare/cache drive, then you assign the warm spare into the dead drive's slot.  The rebuild can then begin immediately, giving you ample time to troubleshoot or RMA the dead drive.

Archived

This topic is now archived and is closed to further replies.

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.