Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

Failed Drive? Please point me in the right direction

Featured Replies

So after four blissfully uneventful years of service I'm having trouble with my Unraid tower running Server Plus 4.7. After just commenting to my SO that it's been so hands-free!!  ::)

 

I had a write error pop up in Mac OS while copying to a share, and after logging in with the web GUI, I see that the parity drive now has the ominous red circle indicating that it's gone offline. No worries, that's why I invested time and money into the Unraid system right?

 

There's so much dated material floating around, could someone please point me to the latest/greatest directions for replacing the drive, if in fact that's what needs to happen. If I remember correctly I only need take the array offline, power down, and replace the drive. I fumbled around with Terminal long enough to Telnet in and generate a syslog which is attached here.

 

Guessing I'm woefully behind on updates too... any help appreciated, thanks!

 

postscript: doubtless I've posted this in the wrong spot, which is also a pet peeve of mine... but the subs go only to v 4.5 and it appears they're closed...

 

syslog-2015-01-05.txt.zip

I haven't looked @ your syslog as I'm out and about but know that sometimes when you have a failure you just want to start the discussion. So here is my view:

 

If it is indeed your Parity drive that has died replace your parity drive & once installed and selected as a parity drive unRAID will recalculate parity. Done.

 

Remember though that at this time your Array will be unprotected and therefore most likely any further drive failure would result in some data loss on the failed drive. So backup what is important if you haven't already.

A couple quick thoughts.

 

4.7 doesn't support anything larger than 2TB (technically 2.2, but nobody makes a 2.2TB drive). That means you need a 2TB drive at most to do any recovery of a bad drive without changing versions, which probably isn't a good idea without more information.

 

A write error from the client is a bad sign, typically unraid will silently fail a drive and the only way you know is to check the gui. Hopefully you don't have any other bad drives. I'd try to get smart reports on all the drives, just to see where you are.

 

This is as good a place to post as any, 4.7 is indeed pretty old, at this point 5.06 is current, and it's been stable for a long time as 5.05 with a small security update to .06.

 

I personally wouldn't do anything with the array until you get more information about the health of all your drives.

I personally wouldn't do anything with the array until you get more information about the health of all your drives.

 

I obviously had assumed that all other drives were in good health. I did also miss the fact the OP was using 4.x.

 

A quick question though, why would you advise he pull SMART reports on the other drives. Based on what has been written I don't think he has any cause to believe the drives are anything but healthy. Unraid doesn't seem to have indicated to him (other than the write error) that any of the drives are bad. 

 

I for instance (granted I am using v5.04) rely on Unraid to tell me if a drive is going bad or not. There is the SMART indicator in the GUI telling me if all is well or not and also the status "balls" next to the drive. I would have thought the only time to pull SMART reports would be when Unraid indicates initially that there is an issue (so to get more detail) otherwise one can assume the drives are nice and healthy.

 

Have I some misplaced faith in Unraid telling me the health of drives here?

  • Author

EDITED to clarify...

 

Thanks for the input guys. I haven't had a chance to run Smart reports yet and am mucking my way through the info on that. I've run Smart reports on everything and see errors only on sdc, which is disk 2, please see report below.

 

Before doing this I ran a parity check and there were more than 60 million errors... but the GUI now says that Parity is valid... pretty sure that's not the case as many many video files are now absent.

 

Both the reallocated_sector_ct and current_pending_sector lines read "0" for four of the drives which as it should be according to the troubleshooting FAQ. sdc however shows 62 reallocated sector counts and lots of other errors as you can see in the Smart report below. Surely this is the big red flag I'm looking for?

 

So... does this mean I can simply replace take the array offline, replace sdc, start up the array and expect that it will rebuild itself? Please help! In some ways I feel like the more I learn, the less I know! Thanks in advance.

 

_______________________________________________________________________________

 

root@Tower:~# smartctl  -a  -d  ata  /dev/sdc

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Device Model:    Hitachi HDS723020BLA642

Serial Number:    MN1220F305ETDD

Firmware Version: MN6OA180

User Capacity:    2,000,398,934,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Thu Jan  8 19:40:33 2015 CST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x84) Offline data collection activity

was suspended by an interrupting command from host.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (18523) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

SCT capabilities:       (0x003d) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000b  086  086  016    Pre-fail  Always      -      9764904

  2 Throughput_Performance  0x0005  136  136  054    Pre-fail  Offline      -      83

  3 Spin_Up_Time            0x0007  150  150  024    Pre-fail  Always      -      350 (Average 412)

  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      2683

  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      62

  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0

  8 Seek_Time_Performance  0x0005  133  133  020    Pre-fail  Offline      -      27

  9 Power_On_Hours          0x0012  096  096  000    Old_age  Always      -      33233

10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      44

192 Power-Off_Retract_Count 0x0032  098  098  000    Old_age  Always      -      3233

193 Load_Cycle_Count        0x0012  098  098  000    Old_age  Always      -      3233

194 Temperature_Celsius    0x0002  206  206  000    Old_age  Always      -      29 (Lifetime Min/Max 18/51)

196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      62

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0

 

SMART Error Log Version: 1

ATA Error Count: 15 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

 

Error 15 occurred at disk power-on lifetime: 30880 hours (1286 days + 16 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 44 e4 b5 4e 07  Error: UNC 68 sectors at LBA = 0x074eb5e4 = 122598884

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 48 e0 b5 4e e0 08  1d+09:47:04.388  READ DMA EXT

  25 00 08 b8 84 5c e0 08  1d+09:47:04.383  READ DMA EXT

  25 00 08 40 01 b8 e0 08  1d+09:47:04.376  READ DMA EXT

  25 00 08 08 23 88 e0 08  1d+09:47:04.360  READ DMA EXT

  25 00 08 98 48 dc e0 08  1d+09:47:04.354  READ DMA EXT

 

Error 14 occurred at disk power-on lifetime: 30754 hours (1281 days + 10 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 10 98 0e e0 06  Error: UNC 16 sectors at LBA = 0x06e00e98 = 115347096

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 20 88 0e e0 e0 08  2d+07:06:03.453  READ DMA EXT

  ef 10 02 00 00 00 a0 08  2d+07:06:03.453  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 08  2d+07:06:03.453  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 08  2d+07:06:03.452  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 08  2d+07:06:03.452  SET FEATURES [set transfer mode]

 

Error 13 occurred at disk power-on lifetime: 30754 hours (1281 days + 10 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 10 98 0e e0 06  Error: UNC 16 sectors at LBA = 0x06e00e98 = 115347096

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 20 88 0e e0 e0 08  2d+07:06:00.671  READ DMA EXT

  ef 10 02 00 00 00 a0 08  2d+07:06:00.671  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 08  2d+07:06:00.670  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 08  2d+07:06:00.670  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 08  2d+07:06:00.669  SET FEATURES [set transfer mode]

 

Error 12 occurred at disk power-on lifetime: 30754 hours (1281 days + 10 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 10 98 0e e0 06  Error: UNC 16 sectors at LBA = 0x06e00e98 = 115347096

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 20 88 0e e0 e0 08  2d+07:05:57.888  READ DMA EXT

  ef 10 02 00 00 00 a0 08  2d+07:05:57.888  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 08  2d+07:05:57.888  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 08  2d+07:05:57.887  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 08  2d+07:05:57.887  SET FEATURES [set transfer mode]

 

Error 11 occurred at disk power-on lifetime: 30754 hours (1281 days + 10 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 10 98 0e e0 06  Error: UNC 16 sectors at LBA = 0x06e00e98 = 115347096

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 20 88 0e e0 e0 08  2d+07:05:55.106  READ DMA EXT

  ef 10 02 00 00 00 a0 08  2d+07:05:55.106  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 08  2d+07:05:55.106  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 08  2d+07:05:55.105  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 08  2d+07:05:55.105  SET FEATURES [set transfer mode]

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

 

Parity does not actually contain any of your data so there is no reason a parity check, correcting or not, should cause any of your data to be missing. Also, replacing parity will not allow any of your data to be recovered.

 

Parity can only be rebuilt if all data drives are good, and a single data drive can only be rebuilt if parity and all other data drives are good.

 

The syslog in your OP seems to suggest problems with multiple drives. I suspect some problem other than a single drive failure, and multiple drive failure is unlikely unless there was some smoke or sparks you haven't mentioned.

 

What is your hardware? Try reseating controller cards, data and power cables at both ends. After you have done that, post a new syslog and a screenshot.

  • Author

trurl, I was editing my previous post while you wrote in so there is some new info which may change your outlook?

 

Do the successful Smart reports from four of the five drives alleviate the concerns about multiple drive issues? What in the syslog would suggest a larger system (bus, cables, etc.) issue? Obviously I'd prefer not to disassemble/reassemble the system at this point but if that's the logical next step, then that's what will happen.

The log has been rotated. Additional logs, named syslog.1, syslog.2, etc., are located in the /var/log/ directory. Attach these logs.

  • Author

It's probably pretty obvious by now that I'm not at all Linux savvy, I used this command to pull a log via Telnet with Terminal from a networked Mac:

 

cat /var/log/syslog

 

substituting "syslog.1" or "syslog.2" in that command returns a "No such file or directory" error. Surely I have the syntax wrong here, can someone explain how to pull the rotated syslog.1, .2 files in simple English?

 

Or other next steps?

  • Author

Realization: "missing" files are actually still available on disk 3, disk 4, etc. but do not appear in mounted shares such as "movies." So it would appear that my data is all still here... Where to go next?

 

I can live through losing rips of Bluray/DVD and tv show downloads but have family photos and home video on this system which is irreplaceable. Yes, I know that I should have backups of this stuff, but I do not.

It's probably pretty obvious by now that I'm not at all Linux savvy, I used this command to pull a log via Telnet with Terminal from a networked Mac:

 

cat /var/log/syslog

 

substituting "syslog.1" or "syslog.2" in that command returns a "No such file or directory" error. Surely I have the syntax wrong here, can someone explain how to pull the rotated syslog.1, .2 files in simple English?

 

Or other next steps?

 

 

What does "ls /var/log/" output?

  • Author

What does "ls /var/log/" output?

 

acpid  cron  dmesg  images@  samba/  setup/  syslog

  • Author

bump

  • Author

At this point it doesn't look like there's much hope for getting this system back online, unless I just get lucky with blindly proceeding based on my limited understanding of threads with somewhat similar issues. It's frustrating to say the least, since it appears I could have had superior performance over the last few years by sacrificing the illusion of data redundancy that I've labored under by using this software. Sorry if this sounds "poor me" but the protection the system offers is worthless if at the single time of crisis that I've experienced, I can't work through a fix. Clearly there are many who have benefited from the software but it's obvious that I don't have the requisite skill sets required to make a go of it.

 

If anyone is aware of another forum where this issue could perhaps get some more traffic that would be appreciated.

 

If not I'll call it a loss and look into a commercial system to which I can transfer my data. I don't know whether there are any as robust, but if you need support and can't get it, it makes little difference.

Not sure why you want to give up so easily. You don't think you have lost any data according to a recent post. I suspect you have some hardware problem, and it might not actually be any drive that is the actual cause. Have you tried what I suggested about checking your connections? Could you give us a better idea of what your hardware is?

 

Even if there is a problem with parity and another disk, all of the data on the other disks should be OK. Each file in unRAID is completely contained on a single disk and if a disk goes out it does not affect the files on other disks. Even if one of the data disks is corrupt there is a good chance of recovering its files.

 

Even if you ultimately decide that unRAID is no longer for you after all these years, wouldn't you like to get your data back so you could put it on another system?

 

You need to be patient and work with us.

  • Author

trurl, you're probably right, this is just my frustration speaking... what may not be obvious to you is the difficulty I've had trying to troubleshoot, implement suggestions, and/or get a response after having done so.

 

It was suggested to run smart reports on all drives, which I did and posted the negative results for one of them above, the other four did not return any errors. In addition a message above indicated that the syslog I posted was no longer current and that I needed to pull a previous iteration, which I've had no luck finding. Setting that aside for the moment...

 

I've updated my sig to show the hardware config. Just so I'm clear, the recommended action is to take the array offline, reseat everything in the case, bring it back online, and post a new syslog? No risk involved in doing so?

After reviewing your syslog again I don't really see anything about disk2. Mostly disk0 (parity) and then some about disk1 near the end.

 

Post a screenshot of your main page so I can get a better idea of what unRAID thinks the current state is.

After reviewing the thread again, I have another idea of how we got to this point.

 

Do you do regular parity checks?

 

If not, then perhaps the parity disk was taken off line some time ago due to a write error, and you have been running without parity for a while. Then when you got the parity back on there really was a lot parity to be corrected.

 

You mentioned you don't have backups. Maybe that should be the first thing to consider as we go forward here.

 

  • Author

Haven't had a chance to pull the tower, power down and reseat everything yet but here's a screenshot of current state...

 

I do not perform regular parity checks so your suggestion of a "parity backlog" owing to the parity drive being taken offline makes sense.

 

I still see all files on drive 1, drive 2, etc. but they are not available in the shares.

 

2v8n0h1.jpg

  • 2 weeks later...
  • Author

Two days back I powered down the array, restarted the server and brought the array back online, and was pleased to see all files. I ran a parity check which returned zero errors. Great, but I didn't imagine that I was out of the woods yet and sure enough while copying files this afternoon received a write error and watched the parity drive go offline. I immediately captured a syslog which I hope will provide better info to diagnose the issue since this data actually captures the startup of the array. There are quite a few ACPI errors, hoping someone will weigh in and finally get some closure on the problem. Thanks in advance.

syslog-2015-01-22.txt.zip

Shutdown and run checkdisk on the flash drive in a PC or Mac. Then try again and post a new syslog.

  • Author

Pulled the flash drive, popped it into a laptop, and ran the Windows 7 GUI of CHKDSK using the error checking function under the tools pop-up under drive properties. No errors found. Reinstalled, array back online, and here's the log.

 

EDIT: This log was pulled after having opened the case, clearing out dust bunnies, and reseating connectors per trurl's suggestion, in addition to the disk check on the flash drive noted above.

syslog-2015-01-22_V2.txt.zip

  • Author

I'm reading that the bread block errors are often associated with flash drive failure so moved the flash drive to a new slot on the board and pulled another syslog... still seeing quite a few

 

How likely is it that a flash drive failing could have led to the problems noted above? Perhaps several issues here?

 

I should point out that I have no way of knowing whether I've had some of these errors all along, prior to the read/write failures which prompted the investigation.

 

I'm still inclined to replace the parity drive...

 

Thanks for the continued help

syslog-2015-01-22_V3.txt.zip

I'm reading that the bread block errors are often associated with flash drive failure so moved the flash drive to a new slot on the board and pulled another syslog... still seeing quite a few

 

How likely is it that a flash drive failing could have led to the problems noted above? Perhaps several issues here?

 

I should point out that I have no way of knowing whether I've had some of these errors all along, prior to the read/write failures which prompted the investigation.

 

I'm still inclined to replace the parity drive...

 

Thanks for the continued help

Most of what I have read about ACPI errors suggest updating BIOS or changing BIOS settings. You can find several posts in this forum related to that. Just in case you haven't figured out how searching works on this board, here are some hints. I say this because it is not entirely obvious how searching works here.

 

One of your drives is being used in IDE mode

Jan 22 21:30:47 Tower kernel: md: import disk0: [8,0] (sda) Hitachi HDS72302 MN1221F300BPYD size: 1953514552
Jan 22 21:30:47 Tower kernel: md: import disk1: [8,16] (sdb) Hitachi HDS72302 MN1221F300BUZD size: 1953514552
Jan 22 21:30:47 Tower kernel: md: import disk2: [8,32] (sdc) Hitachi HDS72302 MN1220F305ETDD size: 1953514552
Jan 22 21:30:47 Tower kernel: md: import disk3: [8,48] (sdd) Hitachi HDS72302 MN1220F3179K4D size: 1953514552
Jan 22 21:30:47 Tower kernel: md: import disk4: [3,0] (hda) WDC WD20EARX-00PASB0 WD-WCAZAF548320 size: 1953514552

This is often fixed in BIOS by making sure everything is set to use AHCI.

 

There does seem to be a real problem with your flash drive. In addition to the FAT: Directory bread fails, I also see

Jan 22 21:32:41 Tower kernel: md: could not write superblock from /boot/config/super.dat

super.dat is the file that contains your array configuration and if it is not writable then any changes you make to the array configuration cannot be saved. By array configuration, I mean which exact disks are in which exact slots. This definitely needs to be fixed. Having said that I am not sure the flash drive is your only problem.

 

You mentioned yesterday that you can still access your files, but you also mentioned on Jan 8 that you don't have a backup of some irreplaceable files. Have you made any backups yet? I would take care of that before trying to do anything else.

I'm still inclined to replace the parity drive...

You mentioned yesterday that you can still access your files, but you also mentioned on Jan 8 that you don't have a backup of some irreplaceable files. Have you made any backups yet? I would take care of that before trying to do anything else.

This.

 

Do NOT make any array changes until you have a backup of everything that means anything to you. If it were me, I wouldn't even power down the unraid box until all the backups are done.

Archived

This topic is now archived and is closed to further replies.

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.