Jump to content

uraid v 4.7 intermittent trouble & will not shutdown/reboot smoothly


Recommended Posts

I have been having trouble with this server lately.  I thought the trouble were due to a failing MB, which was replaced (new DDR3 & AMD chip as well).  However, for the past several days it just keeps hanging up.  I look in the syslog & I see that there are some kind of errors, but I don't really now how to proceed from here.  Mostly I use this server for media storage for 7MC playback.  As long as I am watching a Blu ray or whatever things are fine.  It's when I stop & come back several hours later or the next day.  Individual disks spin down, which is good, but they don't spin back up (in 8 out of 10 times) & I have to manually hit the spin up all disks.  Then I can normally proceed. In the past 2 or 3 days, even that is starting to fail & I have to shutdown or reboot (from the console).  Now even there it just hangs & I have to hit the reset button.

 

I just layed out about 300 bucks for the aformentioned hardware & a new 2TB WD Green drive.  I really don't want to spend anymore money if I don't have to.  Attached is the syslog after the last reset button reboot.

 

valhalla-syslog-12-may-2012.txt

Link to comment

May 12 15:51:38 valhalla emhttp: get_config_idx: fopen /boot/config/shares/lost+found.cfg: No such file or directory - assigning defaults (Other emhttp)

May 12 15:51:38 valhalla emhttp: shcmd (35): killall -HUP smbd (Minor Issues)

May 12 15:51:38 valhalla emhttp: shcmd (36): /etc/rc.d/rc.nfsd restart | logger (Other emhttp)

May 12 15:51:40 valhalla kernel: ata5.00: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen (Errors)

May 12 15:51:40 valhalla kernel: ata5.00: irq_stat 0x01400000, PHY RDY changed (Drive related)

May 12 15:51:40 valhalla kernel: ata5: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B } (Errors)

May 12 15:51:40 valhalla kernel: ata5.00: failed command: READ DMA EXT (Minor Issues)

May 12 15:51:40 valhalla kernel: ata5.00: cmd 25/00:00:a7:1e:0d/00:04:00:00:00/e0 tag 0 dma 524288 in (Drive related)

May 12 15:51:40 valhalla kernel:          res 50/00:00:a6:1e:0d/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) (Errors)

May 12 15:51:40 valhalla kernel: ata5.00: status: { DRDY } (Drive related)

May 12 15:51:40 valhalla kernel: ata5: hard resetting link (Minor Issues)

 

First, run chkdsk on your flash drive.  Check cabling on "ata5".

Link to comment

I reseated/checked the cables of all drives & ran chkdsk on the USB flash in my Vista main system.  It reported no errors, but then my unraid server stopped seeing the USB drive.  I had to reset the BIOS just to get it to post.  Server is back up with the following boot errors: (& still that trouble with ata5) - i guess I need to just buy some more sata cables.)

 

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: No handler for Region [sACS] (f7494438) [PCI_Config] (20090903/evregion-319) (Errors)

May 12 22:07:00 valhalla kernel: ACPI Error: Region PCI_Config(2) has no handler (20090903/exfldio-295) (Errors)

May 12 22:07:00 valhalla kernel: ata4: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen (Errors)

May 12 22:07:00 valhalla kernel: ata4: SError: { RecovComm Persist PHYRdyChg 10B8B } (Errors)

May 12 22:07:02 valhalla kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen (Errors)

May 12 22:07:02 valhalla kernel: ata4: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B } (Errors)

May 12 22:07:02 valhalla kernel:          res 50/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x50 (ATA bus error) (Errors)

May 12 22:07:10 valhalla kernel: generic-usb: probe of 0003:0557:2213.0003 failed with error -110 (Errors)

May 12 22:10:34 valhalla kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen (Errors)

May 12 22:10:34 valhalla kernel: ata4: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B } (Errors)

May 12 22:10:34 valhalla kernel:          res 50/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x50 (ATA bus error) (Errors)

May 12 22:10:59 valhalla kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen (Errors)

May 12 22:10:59 valhalla kernel: ata4: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B } (Errors)

May 12 22:10:59 valhalla kernel:          res 50/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x50 (ATA bus error) (Errors)

May 12 22:11:06 valhalla kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x90a00 action 0xe frozen (Errors)

May 12 22:11:06 valhalla kernel: ata4: SError: { Persist HostInt PHYRdyChg 10B8B } (Errors)

May 12 22:11:06 valhalla kernel:          res 50/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x50 (ATA bus error) (Errors)

May 12 22:11:14 valhalla kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x90a00 action 0xe frozen (Errors)

May 12 22:11:14 valhalla kernel: ata4: SError: { Persist HostInt PHYRdyChg 10B8B } (Errors)

May 12 22:11:14 valhalla kernel:          res 50/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x50 (ATA bus error) (Errors)

 

Link to comment

Here is the smart status report.  I also believe that my USB flash drive may be corrupted.  I have read here that I can just copy the security key off of that drive, reformat, & then put it back on, but which file(s) is that directly?

 

 

smartctl -a -d ata /dev/sdd

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Model Family:    Western Digital Caviar Green family

Device Model:    WDC WD10EADS-00P6B0

Serial Number:    WD-WCAV5S617251

Firmware Version: 01.00A01

User Capacity:    1,000,204,886,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  8

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Thu May 17 07:00:33 2012 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      ( 118) The previous self-test completed having

the read element of the test failed.

Total time to complete Offline

data collection: (21600) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  2) minutes.

Extended self-test routine

recommended polling time: ( 248) minutes.

Conveyance self-test routine

recommended polling time: (  5) minutes.

SCT capabilities:       (0x3037) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0027  253  189  021    Pre-fail  Always      -      1641

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      430

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  100  253  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  090  090  000    Old_age  Always      -      7872

10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  100  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      262

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      223

193 Load_Cycle_Count        0x0032  170  170  000    Old_age  Always      -      92695

194 Temperature_Celsius    0x0022  117  104  000    Old_age  Always      -      30

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed: read failure      60%      7860        53524

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

# 1  Short offline      Completed: read failure      60%      7860        53524

 

Likely failure of the drive.

 

The file is Pro.key for me...

It indicates a "short" smart test was performed and it encountered an un-readable sector.

The statistics show no sectors pending re-allocation, so that sector must have been re-tried and the re-try was successful.  (No re-allocation occurred)

 

It does not indicate a likely failure of the drive.

 

Joe L.

Link to comment

Run another short SMART test.

 

Ran it again.  The overall was -passed- but I see where it did find read error.

 

smartctl -a -d ata /dev/sdd

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Model Family:    Western Digital Caviar Green family

Device Model:    WDC WD10EADS-00P6B0

Serial Number:    WD-WCAV5S617251

Firmware Version: 01.00A01

User Capacity:    1,000,204,886,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  8

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Thu May 17 22:37:33 2012 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      ( 249) Self-test routine in progress...

90% of test remaining.

Total time to complete Offline

data collection: (21600) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  2) minutes.

Extended self-test routine

recommended polling time: ( 248) minutes.

Conveyance self-test routine

recommended polling time: (  5) minutes.

SCT capabilities:       (0x3037) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0027  239  189  021    Pre-fail  Always      -      4016

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      433

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  100  253  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  090  090  000    Old_age  Always      -      7887

10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  100  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      262

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      223

193 Load_Cycle_Count        0x0032  170  170  000    Old_age  Always      -      92706

194 Temperature_Celsius    0x0022  117  104  000    Old_age  Always      -      30

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Aborted by host              90%      7887        -

# 2  Short offline      Completed without error      00%      7887        -

# 3  Short offline      Aborted by host              20%      7887        -

# 4  Short offline      Completed: read failure      60%      7860        53524

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...