Drive is now marked Disabled - how to proceed [SOLVED]?


Recommended Posts

System running on 5.0beta12 for months. Last weekend I replaced a 1TB drive with a 2TB and ran a parity check 3 days ago without error.  Tonite I noticed the server's drive light was flashing even though the UI showed no drives spinning. One of my older 1.5TB  drives (not involved in the upgrade) was marked as disabled.

 

I stopped the array, captured the attached log and power down - all cables seem to be connected properly.

 

Should I preclear a brand new 2TB drive and upgrade to it?

 

thanks for any suggestions on how to proceed.

 

Syslog is attached.

 

Smart status report after reboot (drive hdb still showing as disabled):

 

Statistics for /dev/hdb ST31500341AS_9VS0G44C

smartctl -a -d ata /dev/hdb

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Model Family:    Seagate Barracuda 7200.11 family

Device Model:    ST31500341AS

Serial Number:    9VS0G44C

Firmware Version: CC1J

User Capacity:    1,500,301,910,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Wed Jan 30 22:11:54 2013 MST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 617) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (  2) minutes.

SCT capabilities:       (0x103f) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  119  099  006    Pre-fail  Always      -      209637772

  3 Spin_Up_Time            0x0003  096  092  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  099  099  020    Old_age  Always      -      1171

  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000f  069  060  030    Pre-fail  Always      -      10316718

  9 Power_On_Hours          0x0032  062  062  000    Old_age  Always      -      33405

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      1

12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      166

184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Command_Timeout        0x0032  100  098  000    Old_age  Always      -      262152

189 High_Fly_Writes        0x003a  023  023  000    Old_age  Always      -      77

190 Airflow_Temperature_Cel 0x0022  070  063  045    Old_age  Always      -      30 (Min/Max 24/30)

194 Temperature_Celsius    0x0022  030  040  000    Old_age  Always      -      30 (0 13 0 0)

195 Hardware_ECC_Recovered  0x001a  054  032  000    Old_age  Always      -      209637772

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      3

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      85993835204398

241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      724182207

242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      399207213

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%    33405        -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

syslog-2013-01-30-hdb_issues.zip

Link to comment

You have several issues, but the drive itself is probably fine, the SMART report certainly looks good.  At (Jan 30) 18:10:32, something major happened to this server, probably a serious electrical spike.  Perhaps there was lightning in the area at that time?  The server stayed up, but 7 of the 9 drives were hit, and reported trouble.  6 of the 7 drives were configured correctly as SATA drives, and the SATA exception handler was able to hard reset them and recover them successfully (although one of them needed a second hard reset).  The other drive was hdb (Disk 1) and both it and your parity drive are configured in an IDE emulation mode, and the IDE handler did not know how to recover hdb, which is actually a SATA drive.

 

You have IDE emulation turned on for 2 of your SATA drives, in particular the Parity drive and Disk 1.  When you next boot, go into the BIOS settings and look for the SATA mode, and change it to a native SATA mode, preferably AHCI, anything but IDE emulation mode.

 

You are still running UnRAID v5.0-beta12, which is rather old and includes non-patched Realtek support, and has other issues.  I strongly recommend upgrading to v5.0-rc11, the latest release.

 

When you restart, Disk 1 will still be missing, but you should be able to stop the array if started, unassign Disk 1, start the array, stop the array again, re-assign Disk 1, and start the array again, rebuilding Disk 1.

 

You also have numerous 'bad method' messages being reported, and filling up much of the syslog, at an astonishing rate, faster than 5 per second!  Search the forums for bad method, and you will find info about it.  I'm dozing off here, so hope I haven't made any mistakes.

Link to comment

Thanks for the detailed suggestions!

 

I'm still on beta12 because it has been working for my system for a long time and I wanted to keep everything the same for my drive upgrade last weekend. Was planning to move to rc11 this weekend, but am glad I didn't change versions until I get this figured out.

 

the bad method messages are from unmenu, which I'll disable until I get the major issues corrected. I don't think they've always been there, maybe that is in combination with beta12 or something.

 

I did notice this value I the smart report, which a wiki page said could be a SATA cabling issue:

  199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      3

I'll recheck my cabling again today.

Link to comment

Thanks for the detailed suggestions!

 

I'm still on beta12 because it has been working for my system for a long time and I wanted to keep everything the same for my drive upgrade last weekend. Was planning to move to rc11 this weekend, but am glad I didn't change versions until I get this figured out.

 

the bad method messages are from unmenu, which I'll disable until I get the major issues corrected. I don't think they've always been there, maybe that is in combination with beta12 or something.

 

I did notice this value I the smart report, which a wiki page said could be a SATA cabling issue:

  199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      3

I'll recheck my cabling again today.

It typically is caused by un-shielded cables picking up noise from adjacent cables.  If you have tie-wrapped your SATA cables to make them neat looking, you've made them MORE susceptible to pick up induced noise from the adjacent cables.  Your best bet if you are getting CRC errors is to keep them at a distance to each other. (or use shielded cables, and nearly none of the internal SATA cables sold are shielded.)
Link to comment

the bad method messages are from unmenu, which I'll disable until I get the major issues corrected. I don't think they've always been there, maybe that is in combination with beta12 or something.

Yes, it is unMENU that is reporting them, but it is another device on your LAN constantly probing the devices on your LAN that is the cause.  Disabling unMENU will not stop the probing on your LAN and extra network traffic that basically only slows down your desired network traffic.

 

Joe L.

Link to comment

I noticed my signature had old hardware info. I've updated it to be current.

 

Rob - I have an AmiBios, the latest version available for my BoiStar board, and I think the SATA settings you're referring to are on the Southbridge page. The settings that generated the IDE mode you saw were:

OnChip SATA Type = IDE -> AHCI

SATA IDE Combined = enabled

SATA-III Mode = Auto

 

Changing from 'IDE -> AHCI' to 'AHCI' does not allow the system to boot, and I end up at a hardware summary page. Is this the proper page and settings?

 

I checked the wiring and made minor changes, restarted and reran the SMART test - the UDMA value remains the same:

 

Statistics for /dev/hdb ST31500341AS_9VS0G44C

smartctl -a -d ata /dev/hdb

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Model Family:    Seagate Barracuda 7200.11 family

Device Model:    ST31500341AS

Serial Number:    9VS0G44C

Firmware Version: CC1J

User Capacity:    1,500,301,910,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Thu Jan 31 11:16:21 2013 MST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 617) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (  2) minutes.

SCT capabilities:       (0x103f) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  119  099  006    Pre-fail  Always      -      209659794

  3 Spin_Up_Time            0x0003  097  092  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  099  099  020    Old_age  Always      -      1173

  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000f  069  060  030    Pre-fail  Always      -      10322993

  9 Power_On_Hours          0x0032  062  062  000    Old_age  Always      -      33406

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      1

12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      168

184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Command_Timeout        0x0032  100  098  000    Old_age  Always      -      262152

189 High_Fly_Writes        0x003a  023  023  000    Old_age  Always      -      77

190 Airflow_Temperature_Cel 0x0022  069  063  045    Old_age  Always      -      31 (Min/Max 30/31)

194 Temperature_Celsius    0x0022  031  040  000    Old_age  Always      -      31 (0 13 0 0)

195 Hardware_ECC_Recovered  0x001a  053  032  000    Old_age  Always      -      209659794

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      3

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      49581102468911

241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      724182207

242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      399208342

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%    33406        -

# 2  Short offline      Completed without error      00%    33405        -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

Link to comment

I'm still on beta12 because it has been working for my system for a long time and I wanted to keep everything the same for my drive upgrade last weekend. Was planning to move to rc11 this weekend, but am glad I didn't change versions until I get this figured out.

The sooner the better, I think.  I believe that RC11 would be more reliable than beta12, many things have been fixed.

 

the bad method messages are from unmenu, which I'll disable until I get the major issues corrected. I don't think they've always been there, maybe that is in combination with beta12 or something.

As Joe said, it is creating a fair amount of network traffic, so when you have a chance, I would track that down.

 

I did notice this value I the smart report, which a wiki page said could be a SATA cabling issue:

  199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      3

I'll recheck my cabling again today.

I did not mention it because that drive has 33406 usage hours on it, so 3 is nothing to worry about, unless they are recent, and there is no indication of that.  I imagine you have been using this drive for 3 to 4 years?  The SMART report does not give us any indication as to when those 3 CRC errors occurred, so they very well could have occurred years ago.  Never hurts to recheck the cabling though.

 

I noticed my signature had old hardware info. I've updated it to be current.

 

Rob - I have an AmiBios, the latest version available for my BoiStar board, and I think the SATA settings you're referring to are on the Southbridge page. The settings that generated the IDE mode you saw were:

OnChip SATA Type = IDE -> AHCI

SATA IDE Combined = enabled

SATA-III Mode = Auto

 

Changing from 'IDE -> AHCI' to 'AHCI' does not allow the system to boot, and I end up at a hardware summary page. Is this the proper page and settings?

 

I checked the wiring and made minor changes, restarted and reran the SMART test - the UDMA value remains the same:

 

I did notice in the syslog that you had a BioStar, so ignored the sig, but that didn't really matter, since I don't know what people see in their BIOS settings anyway.  I just try to be general enough, when I advise.  What you saw and did was correct.  Once in a while, when a drive change is made, the BIOS will rearrange the starting boot drive order.  Check to see if your UnRAID flash drive is still the first boot drive.

 

That "SATA IDE Combined" option is new to me.  I think it might be better to disable it, you don't want anything IDE related, except for true IDE drives.

Link to comment

I'll get to rc11 as soon as I get my disk 1 back. Tom's notes say to only upgrade a stable system.

 

Thanks for noting the 'SATA IDE Combined' setting. This got me to work on disabling everything IDE...

First I disabled the PCI IDE Busmaster setting in the main setting page.

Then I disabled 'SATA IDE Combined' on the SB page, and the hda/hdb drives were not presented in the BIOS list.

Then I changed to SB 'OnChip SATA Type = AHCI' setting and the bios list showed all drives as boot options (oddly as IDE-blah-blah).

 

Now, running unraid shows all drives as sd !!! First time I've seen that with this motherboard!

 

Question: if I use the 'unassign disk1 ... reassign disk1' steps, do I need to run pre-clear?  Or more exact, will the drive be cleared before the data is written, or is it just written?

 

I really appreciate the detailed thought and info you've given!

Link to comment

Question: if I use the 'unassign disk1 ... reassign disk1' steps, do I need to run pre-clear?  Or more exact, will the drive be cleared before the data is written, or is it just written?

 

I really appreciate the detailed thought and info you've given!

 

I admit I like a little appreciation now and then!  Thank you.

 

The rebuild will completely overwrite the drive with the correct data, so no clearing necessary.  If the SMART report had showed anything suspicious, then it might have been a good idea to Preclear it, for the thorough testing it includes, but the report looks fine.

Link to comment

WOW! that change made a huge difference - this rebuild is running about 10 times faster (90MB/sec) than the one last weekend.

 

[Edit] ended up about 60MB/sec, but still much faster than when parity and disk 1 were IDE drives

 

for the 'bad method' messages, it seems the plan of attack is to monitor the unmenu log and remove devices from the LAN until they stop, correct? Interestingly, I'm not seeing the entries today. Will have to start turning on devices.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.