Jump to content

[SOLVED] something wrong! failing to start up array and unresponsive..


Kilack

Recommended Posts

I came out this morning, and the web interface to unraid was unresponsive, couldnt get to the shares etc via samba either.

Restarted it, and when i clicked start to start the array up it never managed to start up and the web interface froze up again...

 

Have attached a log, hopefully someone can help me and let me know what to do..

don't want to lose data obviously (that is why im using unraid) :)

 

Did a parity check 2 days ago and all was good.

 

 

syslog.zip

Link to comment

Did some smart checks on all the drives, one of these is turning up this

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure      90%      687        414069864

# 2  Short offline      Completed: read failure      90%      687        414069864

 

SMART overall-health self-assessment test result: PASSED

 

Is that a serious issue?, would that be covered under warranty?

 

I don't really understand smart so well, it is saying a read failure but also saying that the tests have passed..

So would this because unraid to fail?  or.....

Link to comment

Post the entire smart report and syslog as well. that will provide the information needed to interpret the issue

 

...what "passes" for smart, doesn't necessarily mean it is a healthy drive.  Your smart test aborted due to a failure to read the drive.  It is telling the location of where the first read error occurrred.  Likely a bad sector, but could be bad cabling.

Link to comment

syslog in first post. here is the full smart, array is offline so nothing should be reading the drives.

I cant start the array, it fails....

 

smartctl -a -d ata /dev/sdh

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Device Model:    WDC WD20EARX-00PASB0

Serial Number:    WD-WCAZA8446260

Firmware Version: 51.0AB51

User Capacity:    2,000,398,934,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Sat Sep 10 11:26:02 2011 NZST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      ( 121) The previous self-test completed having

the read element of the test failed.

Total time to complete Offline

data collection: (38400) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  2) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (  5) minutes.

SCT capabilities:       (0x3035) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0027  190  170  021    Pre-fail  Always      -      5491

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      140

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      689

10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      38

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      19

193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      1622

194 Temperature_Celsius    0x0022  123  118  000    Old_age  Always      -      27

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      3

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      2

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed: read failure      90%      689        414069864

# 2  Extended offline    Completed: read failure      90%      689        414069864

# 3  Extended offline    Completed: read failure      90%      687        414069864

# 4  Short offline      Completed: read failure      90%      687        414069864

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

:-[ My bad - I started reading your syslog and switched to viewing another thread..then went back to yours.

Looks like parity is bad now...

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38224

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38232

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38240

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38248

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38256

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38264

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38272

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38280

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38288

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38296

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38304

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38312

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38320

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38328

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38336

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38344

 

Smart report isn't bad

3 pending sectors...

 

 

but...

Sep 10 08:16:27 Tower kernel: sas: command 0xedb869c0, task 0xee2a63c0, timed out: BLK_EH_NOT_HANDLED

http://lime-technology.com/forum/index.php?topic=14934.msg141476#msg141476

Link to comment

:-[ My bad - I started reading your syslog and switched to viewing another thread..then went back to yours.

Looks like parity is bad now...

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38224

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38232

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38240

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38248

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38256

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38264

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38272

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38280

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38288

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38296

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38304

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38312

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38320

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38328

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38336

Sep 10 08:14:59 Tower kernel: md: parity incorrect: 38344

 

Smart report isn't bad

3 pending sectors...

 

 

but...

Sep 10 08:16:27 Tower kernel: sas: command 0xedb869c0, task 0xee2a63c0, timed out: BLK_EH_NOT_HANDLED

http://lime-technology.com/forum/index.php?topic=14934.msg141476#msg141476

 

So it seems those blk_eh_not_handled errors are unresolved and doesn't seem they know what causes them yet....

 

As for these parity errors... what am i to do?

doing a parity check would correct the parity disk.. but what if the parity disk is correct already and its another disk with the errors? which I assume is my case as this disk is having smart errors all the sudden.  so wouldnt i want to be rebuilding the data on this disk?

 

 

Link to comment

I worried about the same thing, because I didn't understand how it really worked.

More precise information:

http://lime-technology.com/wiki/index.php?title=FAQ#How_does_parity_work.3F

 

 

http://lime-technology.com/wiki/index.php?title=FAQ#Why_am_I_getting_repeated_parity_errors.3F

1. First, do you have any errors reported in the gui on /sdh?

2. Review your smart report for the parity drive.  Check for pending sectors or reallocated sectors.

2. Run a parity check-read only.  

3.  Watch the progress, speed of the check, and note if you have any read errors from /sdh...this will indicate that you will have more pending sectors. You will start to get messages in your syslog similar to this: ;) <that is mine from tonight>

Sep  8 19:24:33 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)

Sep  8 19:24:33 Tower kernel: ata8.00: irq_stat 0x48000000 (Drive related)

Sep  8 19:24:33 Tower kernel: ata8.00: failed command: READ DMA EXT (Minor Issues)

Sep  8 19:24:33 Tower kernel: ata8.00: cmd 25/00:08:90:d5:65/00:00:64:00:00/e0 tag 0 dma 4096 in (Drive related)

 

4.  If your disk reads ok <unlikely based on the short test result> - the number of errors will remain the same.

5.  IF so, then run a parity check to get the zero sync errors message. :)

 

 

 

Link to comment

http://lime-technology.com/wiki/index.php?title=FAQ#Why_am_I_getting_repeated_parity_errors.3F

1. First, do you have any errors reported in the gui on /sdh?

2. Run another parity check.  

 

 

I am doing a parity check currently with the fix any errors unchecked..

It is up to 448 errors on /sdh so far.. I have noticed the smart report has 6 sectors pending now for reallocation on that drive too...

I precleared all these drives a few weeks ago when I bought them and they had 0 issues then.

 

This drive is only 30 days old..

 

 

Link to comment

I have 535 hours on a new HD204UI. I ran a verify only parity because I thought the same way as you did regarding parity calculations....plus I thought the drive would puke anyways.  Every time I try to run a parity check it starts @60MB/s...and eventually drops to <100KB/s.  Just waiting for the UPS guy to show up with the new drive.

 

I may start to run 2 preclears on each drive....and maybe more on the drive I take out to make there is no chance of rejecting the RMA.

Link to comment

@mbryanr

 

Ok thanks for your help and info, I didn't think id be at this stage of unraid so soon... bloody hard drives, so unreliable...

so even running the parity check without the fix parity drive, will restore the correct data to the disk? - that isn't made so clear in the GUI.

 

Also, given that the drive is just 30 days old, should I RMA it? or is that just the way it goes with drives?

(then the next task is replacing the drive and working out how to rebuild the new drive with data!) ooh the drama...

 

 

 

 

 

 

 

 

Link to comment

RMA it....maybe run a pre-clear on it to kill it.

 

I have got my RMA paperwork ready when I swap it out Monday. ;D

 

so even running the parity check without the fix parity drive, will restore the correct data to the disk? - that isn't made so clear in the GUI.

It doesn't restore the correct data to the disk if unchecked.  But in 4.7 at least, the GUI says it was corrected..

 

I'm a novice too, just happened to have a drive start to fail 2-3 days ago.

Link to comment

RMA it....maybe run a pre-clear on it to kill it.

 

I have got my RMA paperwork ready when I swap it out Monday. ;D

 

so even running the parity check without the fix parity drive, will restore the correct data to the disk? - that isn't made so clear in the GUI.

It doesn't restore the correct data to the disk if unchecked.  But in 4.7 at least, the GUI says it was corrected..

 

I'm a novice too, just happened to have a drive start to fail 2-3 days ago.

 

I'll swap it at the store tomorow then.

 

I am a little confused about unraid now......  I mean.. I understand how parity works etc.  what i dont understand is..

if we basically know drive xx is the one with issues, then surely there should be a way to run a parity check and when it comes accross parity errors tell it to rewrite the correct data to drive xx based on the parity of the other drives which are good.  It seems that the only option is... rebuild the parity drive which would be pointless because we know drive xx now has faulty data.. or just do check without making any changes which wont fix anything but gives you information?

 

Am I missing something?

 

It is kind of temping to go to zfs where everthing is self healing etc... but i love unraid because of its flexibility.. and ease of use...

 

Link to comment
or just do check without making any changes which wont fix anything but gives you information?

            ^^^I believe this is safer.

 

I believe the assumption is/was if a data drive had bad sectors - it would throw an error and not send any information (bit) to the parity drive..nor flip the bit on the parity drive..  That is why in most cases it is ok to run a correcting parity check with a failing data disk.  The scenario described in the other thread though questions that assumption.

 

Same reasons I prefer unRAID over the other options. Just load it up and go.  Guess I had to have something happen, and it made me learn how to diagnose the problem.

 

Link to comment
It seems that the only option is... rebuild the parity drive which would be pointless because we know drive xx now has faulty data.. or just do check without making any changes which wont fix anything but gives you information?

 

Other options are to replace or rebuild the failing disk.

Link to comment

It seems that the only option is... rebuild the parity drive which would be pointless because we know drive xx now has faulty data.. or just do check without making any changes which wont fix anything but gives you information?

 

Other options are to replace or rebuild the failing disk.

 

How do you rebuild a disk that has some errors?

Link to comment

almost the same as if you replace a drive with a new drive.

1. Stop the array.

2. Go to the Devices page and un-assign the disk.

3. Go to the main page and start the array.

4. Stop the array again.

5. Go to the Devices page and re-assign the disk.

6. Go to the Main page - system should indicate there is a "new" drive to replace the disabled one. Check the confirmation box and click Start to start a parity-reconstruct of the disk.

 

Following this procedure - you are assuming that no further errors will occur with the drive <pending sectors/reallocated sectors>, or a write fails to the drive (Redball Example).  If there are errors - you are back to replacing the drive with a new drive.  

 

In my case, since I didn't have another drive on-hand...I didn't want to take the chance of degrading the array, and obviously I couldn't replace the drive immediately.

 

Edit:

Should note that there are a significant number of sectors that can be reallocated and still function - but I prefer not to find the limit. I am probably incorrect, but a drive that fails a write operation is likely very different than one that fails a read.  A write error, but can still be read is a result of a different failure mode than one where the drive cannot be read completely. The drive that is failing in my array could not complete a smart test or parity check...therefore I decided to replace the drive.

 

 

 

Link to comment

I don't suppose there is an easy way to see which files have the bad parity blocks?

 

As one of the main aims for unraid was to store media.. it would be quite easy to download again corrupted files if we could easily see which files were affected.

 

Unraid, when it does a parity check and finds errors should automatically create a log with which files are affected - would be an awesome feature.

Then at least people could double check those files etc when its the case of just a few bad errors on a disk.

 

 

Link to comment

I don't suppose there is an easy way to see which files have the bad parity blocks?

 

As one of the main aims for unraid was to store media.. it would be quite easy to download again corrupted files if we could easily see which files were affected.

 

Unraid, when it does a parity check and finds errors should automatically create a log with which files are affected - would be an awesome feature.

Then at least people could double check those files etc when its the case of just a few bad errors on a disk.

Not that I know of currently with unRAID.

From what I have gathered from the few posts mentioning it bubbaQ is involved with investigating hard drives and their contents <high tech> in some capacity....he most likely would have a solution for unRAID.

 

That would be great and almost a necessity- especially as drive sizes increase....sounds like a lounge post topic!

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...