[SOLVED] something wrong! failing to start up array and unresponsive..

September 9, 201114 yr

I came out this morning, and the web interface to unraid was unresponsive, couldnt get to the shares etc via samba either.

Restarted it, and when i clicked start to start the array up it never managed to start up and the web interface froze up again...

Have attached a log, hopefully someone can help me and let me know what to do..

don't want to lose data obviously (that is why im using unraid)

Did a parity check 2 days ago and all was good.

syslog.zip

Quote

September 9, 201114 yr

Author

Did some smart checks on all the drives, one of these is turning up this

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 687 414069864

# 2 Short offline Completed: read failure 90% 687 414069864

SMART overall-health self-assessment test result: PASSED

Is that a serious issue?, would that be covered under warranty?

I don't really understand smart so well, it is saying a read failure but also saying that the tests have passed..

So would this because unraid to fail? or.....

Quote

September 9, 201114 yr

Post the entire smart report and syslog as well. that will provide the information needed to interpret the issue

...what "passes" for smart, doesn't necessarily mean it is a healthy drive. Your smart test aborted due to a failure to read the drive. It is telling the location of where the first read error occurrred. Likely a bad sector, but could be bad cabling.

Quote

September 9, 201114 yr

Author

syslog in first post. here is the full smart, array is offline so nothing should be reading the drives.

I cant start the array, it fails....

smartctl -a -d ata /dev/sdh

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

=== START OF INFORMATION SECTION ===

Device Model: WDC WD20EARX-00PASB0

Serial Number: WD-WCAZA8446260

Firmware Version: 51.0AB51

User Capacity: 2,000,398,934,016 bytes

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: Exact ATA specification draft version not indicated

Local Time is: Sat Sep 10 11:26:02 2011 NZST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 121) The previous self-test completed having

the read element of the test failed.

Total time to complete Offline

data collection: (38400) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x3035) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0

3 Spin_Up_Time 0x0027 190 170 021 Pre-fail Always - 5491

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 140

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 689

10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 19

193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1622

194 Temperature_Celsius 0x0022 123 118 000 Old_age Always - 27

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 3

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 2

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed: read failure 90% 689 414069864

# 2 Extended offline Completed: read failure 90% 689 414069864

# 3 Extended offline Completed: read failure 90% 687 414069864

# 4 Short offline Completed: read failure 90% 687 414069864

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Quote

September 10, 201114 yr

:-[ My bad - I started reading your syslog and switched to viewing another thread..then went back to yours.

Looks like parity is bad now...

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38224
Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38232

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38240

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38248

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38256

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38264

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38272

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38280

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38288

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38296

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38304

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38312

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38320

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38328

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38336

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38344

Smart report isn't bad

3 pending sectors...

but...

Sep 10 08:16:27 Tower kernel: sas: command 0xedb869c0, task 0xee2a63c0, timed out: BLK_EH_NOT_HANDLED

http://lime-technology.com/forum/index.php?topic=14934.msg141476#msg141476

Quote

September 10, 201114 yr

Author

My bad - I started reading your syslog and switched to viewing another thread..then went back to yours.

Looks like parity is bad now...

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38224
Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38232

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38240

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38248

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38256

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38264

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38272

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38280

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38288

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38296

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38304

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38312

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38320

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38328

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38336

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38344

Smart report isn't bad

3 pending sectors...

but...

Sep 10 08:16:27 Tower kernel: sas: command 0xedb869c0, task 0xee2a63c0, timed out: BLK_EH_NOT_HANDLED

http://lime-technology.com/forum/index.php?topic=14934.msg141476#msg141476

So it seems those blk_eh_not_handled errors are unresolved and doesn't seem they know what causes them yet....

As for these parity errors... what am i to do?

doing a parity check would correct the parity disk.. but what if the parity disk is correct already and its another disk with the errors? which I assume is my case as this disk is having smart errors all the sudden. so wouldnt i want to be rebuilding the data on this disk?

Quote

September 10, 201114 yr

See this for PSU info: http://lime-technology.com/forum/index.php?topic=12219.0

Quote

September 10, 201114 yr

Author

See this for PSU info: http://lime-technology.com/forum/index.php?topic=12219.0

Im using the CORSAIR Enthusiast Series TX650 V2 650W and only have 8 drives, 7 of which are western digital green drives.

Quote

September 10, 201114 yr

@Kilack - post your issues with the BLK errors in the 5b12a thread to get more attention. Reading through the thread, appears it was also in 5b12.

More info shortly on the parity check question...

Quote

September 10, 201114 yr

I worried about the same thing, because I didn't understand how it really worked.

More precise information:

http://lime-technology.com/wiki/index.php?title=FAQ#How_does_parity_work.3F

http://lime-technology.com/wiki/index.php?title=FAQ#Why_am_I_getting_repeated_parity_errors.3F

1. First, do you have any errors reported in the gui on /sdh?

2. Review your smart report for the parity drive. Check for pending sectors or reallocated sectors.

2. Run a parity check-read only.

3. Watch the progress, speed of the check, and note if you have any read errors from /sdh...this will indicate that you will have more pending sectors. You will start to get messages in your syslog similar to this: <that is mine from tonight>

Sep 8 19:24:33 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)

Sep 8 19:24:33 Tower kernel: ata8.00: irq_stat 0x48000000 (Drive related)

Sep 8 19:24:33 Tower kernel: ata8.00: failed command: READ DMA EXT (Minor Issues)

Sep 8 19:24:33 Tower kernel: ata8.00: cmd 25/00:08:90:d5:65/00:00:64:00:00/e0 tag 0 dma 4096 in (Drive related)

4. If your disk reads ok <unlikely based on the short test result> - the number of errors will remain the same.

5. IF so, then run a parity check to get the zero sync errors message.

Quote

September 10, 201114 yr

Author

http://lime-technology.com/wiki/index.php?title=FAQ#Why_am_I_getting_repeated_parity_errors.3F

1. First, do you have any errors reported in the gui on /sdh?

2. Run another parity check.

I am doing a parity check currently with the fix any errors unchecked..

It is up to 448 errors on /sdh so far.. I have noticed the smart report has 6 sectors pending now for reallocation on that drive too...

I precleared all these drives a few weeks ago when I bought them and they had 0 issues then.

This drive is only 30 days old..

Quote

September 10, 201114 yr

I have 535 hours on a new HD204UI. I ran a verify only parity because I thought the same way as you did regarding parity calculations....plus I thought the drive would puke anyways. Every time I try to run a parity check it starts @60MB/s...and eventually drops to <100KB/s. Just waiting for the UPS guy to show up with the new drive.

I may start to run 2 preclears on each drive....and maybe more on the drive I take out to make there is no chance of rejecting the RMA.

Quote

September 10, 201114 yr

Author

@mbryanr

Ok thanks for your help and info, I didn't think id be at this stage of unraid so soon... bloody hard drives, so unreliable...

so even running the parity check without the fix parity drive, will restore the correct data to the disk? - that isn't made so clear in the GUI.

Also, given that the drive is just 30 days old, should I RMA it? or is that just the way it goes with drives?

(then the next task is replacing the drive and working out how to rebuild the new drive with data!) ooh the drama...

Quote

September 10, 201114 yr

Unchecked is fine.

. If a drive is going bad, then bad sectors may be increasing, and each one may cause parity errors, because the parity check has to use the sector data returned to it, and a bad sector cannot return the correct data.

Quote

September 10, 201114 yr

RMA it....maybe run a pre-clear on it to kill it.

I have got my RMA paperwork ready when I swap it out Monday.

so even running the parity check without the fix parity drive, will restore the correct data to the disk? - that isn't made so clear in the GUI.

It doesn't restore the correct data to the disk if unchecked. But in 4.7 at least, the GUI says it was corrected..

I'm a novice too, just happened to have a drive start to fail 2-3 days ago.

Quote

September 10, 201114 yr

Good thread on the parity check and various scenarios

http://lime-technology.com/forum/index.php?topic=11515.0

Quote

September 10, 201114 yr

Author

RMA it....maybe run a pre-clear on it to kill it.

I have got my RMA paperwork ready when I swap it out Monday.

so even running the parity check without the fix parity drive, will restore the correct data to the disk? - that isn't made so clear in the GUI.

It doesn't restore the correct data to the disk if unchecked. But in 4.7 at least, the GUI says it was corrected..

I'm a novice too, just happened to have a drive start to fail 2-3 days ago.

I'll swap it at the store tomorow then.

I am a little confused about unraid now...... I mean.. I understand how parity works etc. what i dont understand is..

if we basically know drive xx is the one with issues, then surely there should be a way to run a parity check and when it comes accross parity errors tell it to rewrite the correct data to drive xx based on the parity of the other drives which are good. It seems that the only option is... rebuild the parity drive which would be pointless because we know drive xx now has faulty data.. or just do check without making any changes which wont fix anything but gives you information?

Am I missing something?

It is kind of temping to go to zfs where everthing is self healing etc... but i love unraid because of its flexibility.. and ease of use...

Quote

September 10, 201114 yr

or just do check without making any changes which wont fix anything but gives you information?

^^^I believe this is safer.

I believe the assumption is/was if a data drive had bad sectors - it would throw an error and not send any information (bit) to the parity drive..nor flip the bit on the parity drive.. That is why in most cases it is ok to run a correcting parity check with a failing data disk. The scenario described in the other thread though questions that assumption.

Same reasons I prefer unRAID over the other options. Just load it up and go. Guess I had to have something happen, and it made me learn how to diagnose the problem.

Quote

September 10, 201114 yr

It seems that the only option is... rebuild the parity drive which would be pointless because we know drive xx now has faulty data.. or just do check without making any changes which wont fix anything but gives you information?

Other options are to replace or rebuild the failing disk.

Quote

September 10, 201114 yr

Author

It seems that the only option is... rebuild the parity drive which would be pointless because we know drive xx now has faulty data.. or just do check without making any changes which wont fix anything but gives you information?

Other options are to replace or rebuild the failing disk.

How do you rebuild a disk that has some errors?

Quote

September 10, 201114 yr

almost the same as if you replace a drive with a new drive.

1. Stop the array.

2. Go to the Devices page and un-assign the disk.

3. Go to the main page and start the array.

4. Stop the array again.

5. Go to the Devices page and re-assign the disk.

6. Go to the Main page - system should indicate there is a "new" drive to replace the disabled one. Check the confirmation box and click Start to start a parity-reconstruct of the disk.

Following this procedure - you are assuming that no further errors will occur with the drive <pending sectors/reallocated sectors>, or a write fails to the drive (Redball Example). If there are errors - you are back to replacing the drive with a new drive.

In my case, since I didn't have another drive on-hand...I didn't want to take the chance of degrading the array, and obviously I couldn't replace the drive immediately.

Edit:

Should note that there are a significant number of sectors that can be reallocated and still function - but I prefer not to find the limit. I am probably incorrect, but a drive that fails a write operation is likely very different than one that fails a read. A write error, but can still be read is a result of a different failure mode than one where the drive cannot be read completely. The drive that is failing in my array could not complete a smart test or parity check...therefore I decided to replace the drive.

Quote

September 10, 201114 yr

Author

I don't suppose there is an easy way to see which files have the bad parity blocks?

As one of the main aims for unraid was to store media.. it would be quite easy to download again corrupted files if we could easily see which files were affected.

Unraid, when it does a parity check and finds errors should automatically create a log with which files are affected - would be an awesome feature.

Then at least people could double check those files etc when its the case of just a few bad errors on a disk.

Quote

September 10, 201114 yr

I don't suppose there is an easy way to see which files have the bad parity blocks?

As one of the main aims for unraid was to store media.. it would be quite easy to download again corrupted files if we could easily see which files were affected.

Unraid, when it does a parity check and finds errors should automatically create a log with which files are affected - would be an awesome feature.

Then at least people could double check those files etc when its the case of just a few bad errors on a disk.

Not that I know of currently with unRAID.

From what I have gathered from the few posts mentioning it bubbaQ is involved with investigating hard drives and their contents <high tech> in some capacity....he most likely would have a solution for unRAID.

That would be great and almost a necessity- especially as drive sizes increase....sounds like a lounge post topic!

Quote

[SOLVED] something wrong! failing to start up array and unresponsive..

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)