Failed Drive? Please point me in the right direction

January 6, 201511 yr

So after four blissfully uneventful years of service I'm having trouble with my Unraid tower running Server Plus 4.7. After just commenting to my SO that it's been so hands-free!! ::)

I had a write error pop up in Mac OS while copying to a share, and after logging in with the web GUI, I see that the parity drive now has the ominous red circle indicating that it's gone offline. No worries, that's why I invested time and money into the Unraid system right?

There's so much dated material floating around, could someone please point me to the latest/greatest directions for replacing the drive, if in fact that's what needs to happen. If I remember correctly I only need take the array offline, power down, and replace the drive. I fumbled around with Terminal long enough to Telnet in and generate a syslog which is attached here.

Guessing I'm woefully behind on updates too... any help appreciated, thanks!

postscript: doubtless I've posted this in the wrong spot, which is also a pet peeve of mine... but the subs go only to v 4.5 and it appears they're closed...

syslog-2015-01-05.txt.zip

Quote

January 6, 201511 yr

I haven't looked @ your syslog as I'm out and about but know that sometimes when you have a failure you just want to start the discussion. So here is my view:

If it is indeed your Parity drive that has died replace your parity drive & once installed and selected as a parity drive unRAID will recalculate parity. Done.

Remember though that at this time your Array will be unprotected and therefore most likely any further drive failure would result in some data loss on the failed drive. So backup what is important if you haven't already.

Quote

January 6, 201511 yr

A couple quick thoughts.

4.7 doesn't support anything larger than 2TB (technically 2.2, but nobody makes a 2.2TB drive). That means you need a 2TB drive at most to do any recovery of a bad drive without changing versions, which probably isn't a good idea without more information.

A write error from the client is a bad sign, typically unraid will silently fail a drive and the only way you know is to check the gui. Hopefully you don't have any other bad drives. I'd try to get smart reports on all the drives, just to see where you are.

This is as good a place to post as any, 4.7 is indeed pretty old, at this point 5.06 is current, and it's been stable for a long time as 5.05 with a small security update to .06.

I personally wouldn't do anything with the array until you get more information about the health of all your drives.

Quote

January 6, 201511 yr

I personally wouldn't do anything with the array until you get more information about the health of all your drives.

I obviously had assumed that all other drives were in good health. I did also miss the fact the OP was using 4.x.

A quick question though, why would you advise he pull SMART reports on the other drives. Based on what has been written I don't think he has any cause to believe the drives are anything but healthy. Unraid doesn't seem to have indicated to him (other than the write error) that any of the drives are bad.

I for instance (granted I am using v5.04) rely on Unraid to tell me if a drive is going bad or not. There is the SMART indicator in the GUI telling me if all is well or not and also the status "balls" next to the drive. I would have thought the only time to pull SMART reports would be when Unraid indicates initially that there is an issue (so to get more detail) otherwise one can assume the drives are nice and healthy.

Have I some misplaced faith in Unraid telling me the health of drives here?

Quote

January 9, 201511 yr

Author

EDITED to clarify...

Thanks for the input guys. I haven't had a chance to run Smart reports yet and am mucking my way through the info on that. I've run Smart reports on everything and see errors only on sdc, which is disk 2, please see report below.

Before doing this I ran a parity check and there were more than 60 million errors... but the GUI now says that Parity is valid... pretty sure that's not the case as many many video files are now absent.

Both the reallocated_sector_ct and current_pending_sector lines read "0" for four of the drives which as it should be according to the troubleshooting FAQ. sdc however shows 62 reallocated sector counts and lots of other errors as you can see in the Smart report below. Surely this is the big red flag I'm looking for?

So... does this mean I can simply replace take the array offline, replace sdc, start up the array and expect that it will rebuild itself? Please help! In some ways I feel like the more I learn, the less I know! Thanks in advance.

_______________________________________________________________________________

root@Tower:~# smartctl -a -d ata /dev/sdc

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

=== START OF INFORMATION SECTION ===

Device Model: Hitachi HDS723020BLA642

Serial Number: MN1220F305ETDD

Firmware Version: MN6OA180

User Capacity: 2,000,398,934,016 bytes

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: ATA-8-ACS revision 4

Local Time is: Thu Jan 8 19:40:33 2015 CST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x84) Offline data collection activity

was suspended by an interrupting command from host.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (18523) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

SCT capabilities: (0x003d) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000b 086 086 016 Pre-fail Always - 9764904

2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 83

3 Spin_Up_Time 0x0007 150 150 024 Pre-fail Always - 350 (Average 412)

4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 2683

5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 62

7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0

8 Seek_Time_Performance 0x0005 133 133 020 Pre-fail Offline - 27

9 Power_On_Hours 0x0012 096 096 000 Old_age Always - 33233

10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 44

192 Power-Off_Retract_Count 0x0032 098 098 000 Old_age Always - 3233

193 Load_Cycle_Count 0x0012 098 098 000 Old_age Always - 3233

194 Temperature_Celsius 0x0002 206 206 000 Old_age Always - 29 (Lifetime Min/Max 18/51)

196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 62

197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1

ATA Error Count: 15 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 15 occurred at disk power-on lifetime: 30880 hours (1286 days + 16 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 44 e4 b5 4e 07 Error: UNC 68 sectors at LBA = 0x074eb5e4 = 122598884

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 48 e0 b5 4e e0 08 1d+09:47:04.388 READ DMA EXT

25 00 08 b8 84 5c e0 08 1d+09:47:04.383 READ DMA EXT

25 00 08 40 01 b8 e0 08 1d+09:47:04.376 READ DMA EXT

25 00 08 08 23 88 e0 08 1d+09:47:04.360 READ DMA EXT

25 00 08 98 48 dc e0 08 1d+09:47:04.354 READ DMA EXT

Error 14 occurred at disk power-on lifetime: 30754 hours (1281 days + 10 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 10 98 0e e0 06 Error: UNC 16 sectors at LBA = 0x06e00e98 = 115347096

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 20 88 0e e0 e0 08 2d+07:06:03.453 READ DMA EXT

ef 10 02 00 00 00 a0 08 2d+07:06:03.453 SET FEATURES [Reserved for Serial ATA]

27 00 00 00 00 00 e0 08 2d+07:06:03.453 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 a0 08 2d+07:06:03.452 IDENTIFY DEVICE

ef 03 46 00 00 00 a0 08 2d+07:06:03.452 SET FEATURES [set transfer mode]

Error 13 occurred at disk power-on lifetime: 30754 hours (1281 days + 10 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 10 98 0e e0 06 Error: UNC 16 sectors at LBA = 0x06e00e98 = 115347096

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 20 88 0e e0 e0 08 2d+07:06:00.671 READ DMA EXT

ef 10 02 00 00 00 a0 08 2d+07:06:00.671 SET FEATURES [Reserved for Serial ATA]

27 00 00 00 00 00 e0 08 2d+07:06:00.670 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 a0 08 2d+07:06:00.670 IDENTIFY DEVICE

ef 03 46 00 00 00 a0 08 2d+07:06:00.669 SET FEATURES [set transfer mode]

Error 12 occurred at disk power-on lifetime: 30754 hours (1281 days + 10 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 10 98 0e e0 06 Error: UNC 16 sectors at LBA = 0x06e00e98 = 115347096

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 20 88 0e e0 e0 08 2d+07:05:57.888 READ DMA EXT

ef 10 02 00 00 00 a0 08 2d+07:05:57.888 SET FEATURES [Reserved for Serial ATA]

27 00 00 00 00 00 e0 08 2d+07:05:57.888 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 a0 08 2d+07:05:57.887 IDENTIFY DEVICE

ef 03 46 00 00 00 a0 08 2d+07:05:57.887 SET FEATURES [set transfer mode]

Error 11 occurred at disk power-on lifetime: 30754 hours (1281 days + 10 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 10 98 0e e0 06 Error: UNC 16 sectors at LBA = 0x06e00e98 = 115347096

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 20 88 0e e0 e0 08 2d+07:05:55.106 READ DMA EXT

ef 10 02 00 00 00 a0 08 2d+07:05:55.106 SET FEATURES [Reserved for Serial ATA]

27 00 00 00 00 00 e0 08 2d+07:05:55.106 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 a0 08 2d+07:05:55.105 IDENTIFY DEVICE

ef 03 46 00 00 00 a0 08 2d+07:05:55.105 SET FEATURES [set transfer mode]

SMART Self-test log structure revision number 1

No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Quote

January 9, 201511 yr

Parity does not actually contain any of your data so there is no reason a parity check, correcting or not, should cause any of your data to be missing. Also, replacing parity will not allow any of your data to be recovered.

Parity can only be rebuilt if all data drives are good, and a single data drive can only be rebuilt if parity and all other data drives are good.

The syslog in your OP seems to suggest problems with multiple drives. I suspect some problem other than a single drive failure, and multiple drive failure is unlikely unless there was some smoke or sparks you haven't mentioned.

What is your hardware? Try reseating controller cards, data and power cables at both ends. After you have done that, post a new syslog and a screenshot.

Quote

January 9, 201511 yr

Author

trurl, I was editing my previous post while you wrote in so there is some new info which may change your outlook?

Do the successful Smart reports from four of the five drives alleviate the concerns about multiple drive issues? What in the syslog would suggest a larger system (bus, cables, etc.) issue? Obviously I'd prefer not to disassemble/reassemble the system at this point but if that's the logical next step, then that's what will happen.

Quote

January 9, 201511 yr

The log has been rotated. Additional logs, named syslog.1, syslog.2, etc., are located in the /var/log/ directory. Attach these logs.

Quote

January 9, 201511 yr

Author

It's probably pretty obvious by now that I'm not at all Linux savvy, I used this command to pull a log via Telnet with Terminal from a networked Mac:

cat /var/log/syslog

substituting "syslog.1" or "syslog.2" in that command returns a "No such file or directory" error. Surely I have the syntax wrong here, can someone explain how to pull the rotated syslog.1, .2 files in simple English?

Or other next steps?

Quote

January 9, 201511 yr

Author

Realization: "missing" files are actually still available on disk 3, disk 4, etc. but do not appear in mounted shares such as "movies." So it would appear that my data is all still here... Where to go next?

I can live through losing rips of Bluray/DVD and tv show downloads but have family photos and home video on this system which is irreplaceable. Yes, I know that I should have backups of this stuff, but I do not.

Quote

January 9, 201511 yr

It's probably pretty obvious by now that I'm not at all Linux savvy, I used this command to pull a log via Telnet with Terminal from a networked Mac:

cat /var/log/syslog

substituting "syslog.1" or "syslog.2" in that command returns a "No such file or directory" error. Surely I have the syntax wrong here, can someone explain how to pull the rotated syslog.1, .2 files in simple English?

Or other next steps?

What does "ls /var/log/" output?

Quote

January 9, 201511 yr

Author

What does "ls /var/log/" output?

acpid cron dmesg images@ samba/ setup/ syslog

Quote

January 10, 201511 yr

Author

bump

Quote

January 11, 201511 yr

Author

At this point it doesn't look like there's much hope for getting this system back online, unless I just get lucky with blindly proceeding based on my limited understanding of threads with somewhat similar issues. It's frustrating to say the least, since it appears I could have had superior performance over the last few years by sacrificing the illusion of data redundancy that I've labored under by using this software. Sorry if this sounds "poor me" but the protection the system offers is worthless if at the single time of crisis that I've experienced, I can't work through a fix. Clearly there are many who have benefited from the software but it's obvious that I don't have the requisite skill sets required to make a go of it.

If anyone is aware of another forum where this issue could perhaps get some more traffic that would be appreciated.

If not I'll call it a loss and look into a commercial system to which I can transfer my data. I don't know whether there are any as robust, but if you need support and can't get it, it makes little difference.

Quote

January 11, 201511 yr

Not sure why you want to give up so easily. You don't think you have lost any data according to a recent post. I suspect you have some hardware problem, and it might not actually be any drive that is the actual cause. Have you tried what I suggested about checking your connections? Could you give us a better idea of what your hardware is?

Even if there is a problem with parity and another disk, all of the data on the other disks should be OK. Each file in unRAID is completely contained on a single disk and if a disk goes out it does not affect the files on other disks. Even if one of the data disks is corrupt there is a good chance of recovering its files.

Even if you ultimately decide that unRAID is no longer for you after all these years, wouldn't you like to get your data back so you could put it on another system?

You need to be patient and work with us.

Quote

January 11, 201511 yr

Author

trurl, you're probably right, this is just my frustration speaking... what may not be obvious to you is the difficulty I've had trying to troubleshoot, implement suggestions, and/or get a response after having done so.

It was suggested to run smart reports on all drives, which I did and posted the negative results for one of them above, the other four did not return any errors. In addition a message above indicated that the syslog I posted was no longer current and that I needed to pull a previous iteration, which I've had no luck finding. Setting that aside for the moment...

I've updated my sig to show the hardware config. Just so I'm clear, the recommended action is to take the array offline, reseat everything in the case, bring it back online, and post a new syslog? No risk involved in doing so?

Quote

January 11, 201511 yr

After reviewing your syslog again I don't really see anything about disk2. Mostly disk0 (parity) and then some about disk1 near the end.

Post a screenshot of your main page so I can get a better idea of what unRAID thinks the current state is.

Quote

January 11, 201511 yr

After reviewing the thread again, I have another idea of how we got to this point.

Do you do regular parity checks?

If not, then perhaps the parity disk was taken off line some time ago due to a write error, and you have been running without parity for a while. Then when you got the parity back on there really was a lot parity to be corrected.

You mentioned you don't have backups. Maybe that should be the first thing to consider as we go forward here.

Quote

January 14, 201511 yr

Author

Haven't had a chance to pull the tower, power down and reseat everything yet but here's a screenshot of current state...

I do not perform regular parity checks so your suggestion of a "parity backlog" owing to the parity drive being taken offline makes sense.

I still see all files on drive 1, drive 2, etc. but they are not available in the shares.

Quote

January 23, 201511 yr

Author

Two days back I powered down the array, restarted the server and brought the array back online, and was pleased to see all files. I ran a parity check which returned zero errors. Great, but I didn't imagine that I was out of the woods yet and sure enough while copying files this afternoon received a write error and watched the parity drive go offline. I immediately captured a syslog which I hope will provide better info to diagnose the issue since this data actually captures the startup of the array. There are quite a few ACPI errors, hoping someone will weigh in and finally get some closure on the problem. Thanks in advance.

syslog-2015-01-22.txt.zip

Quote

January 23, 201511 yr

Shutdown and run checkdisk on the flash drive in a PC or Mac. Then try again and post a new syslog.

Quote

January 23, 201511 yr

Author

Pulled the flash drive, popped it into a laptop, and ran the Windows 7 GUI of CHKDSK using the error checking function under the tools pop-up under drive properties. No errors found. Reinstalled, array back online, and here's the log.

EDIT: This log was pulled after having opened the case, clearing out dust bunnies, and reseating connectors per trurl's suggestion, in addition to the disk check on the flash drive noted above.

syslog-2015-01-22_V2.txt.zip

Quote

January 23, 201511 yr

Author

I'm reading that the bread block errors are often associated with flash drive failure so moved the flash drive to a new slot on the board and pulled another syslog... still seeing quite a few

How likely is it that a flash drive failing could have led to the problems noted above? Perhaps several issues here?

I should point out that I have no way of knowing whether I've had some of these errors all along, prior to the read/write failures which prompted the investigation.

I'm still inclined to replace the parity drive...

Thanks for the continued help

syslog-2015-01-22_V3.txt.zip

Quote

January 23, 201511 yr

I'm reading that the bread block errors are often associated with flash drive failure so moved the flash drive to a new slot on the board and pulled another syslog... still seeing quite a few

How likely is it that a flash drive failing could have led to the problems noted above? Perhaps several issues here?

I should point out that I have no way of knowing whether I've had some of these errors all along, prior to the read/write failures which prompted the investigation.

I'm still inclined to replace the parity drive...

Thanks for the continued help

Most of what I have read about ACPI errors suggest updating BIOS or changing BIOS settings. You can find several posts in this forum related to that. Just in case you haven't figured out how searching works on this board, here are some hints. I say this because it is not entirely obvious how searching works here.

One of your drives is being used in IDE mode

Jan 22 21:30:47 Tower kernel: md: import disk0: [8,0] (sda) Hitachi HDS72302 MN1221F300BPYD size: 1953514552
Jan 22 21:30:47 Tower kernel: md: import disk1: [8,16] (sdb) Hitachi HDS72302 MN1221F300BUZD size: 1953514552
Jan 22 21:30:47 Tower kernel: md: import disk2: [8,32] (sdc) Hitachi HDS72302 MN1220F305ETDD size: 1953514552
Jan 22 21:30:47 Tower kernel: md: import disk3: [8,48] (sdd) Hitachi HDS72302 MN1220F3179K4D size: 1953514552
Jan 22 21:30:47 Tower kernel: md: import disk4: [3,0] (hda) WDC WD20EARX-00PASB0 WD-WCAZAF548320 size: 1953514552

This is often fixed in BIOS by making sure everything is set to use AHCI.

There does seem to be a real problem with your flash drive. In addition to the FAT: Directory bread fails, I also see

Jan 22 21:32:41 Tower kernel: md: could not write superblock from /boot/config/super.dat

super.dat is the file that contains your array configuration and if it is not writable then any changes you make to the array configuration cannot be saved. By array configuration, I mean which exact disks are in which exact slots. This definitely needs to be fixed. Having said that I am not sure the flash drive is your only problem.

You mentioned yesterday that you can still access your files, but you also mentioned on Jan 8 that you don't have a backup of some irreplaceable files. Have you made any backups yet? I would take care of that before trying to do anything else.

Quote

January 23, 201511 yr

I'm still inclined to replace the parity drive...

You mentioned yesterday that you can still access your files, but you also mentioned on Jan 8 that you don't have a backup of some irreplaceable files. Have you made any backups yet? I would take care of that before trying to do anything else.

This.

Do NOT make any array changes until you have a backup of everything that means anything to you. If it were me, I wouldn't even power down the unraid box until all the backups are done.

Quote

Failed Drive? Please point me in the right direction

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)