Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

Strange behavior: all green, array won't start, disk comm errors, rebuilt - OK?

Featured Replies

Some strange things were happening with the server yesterday. 

 

It all started when I tried to rename a directory on the server in Windows, and I couldn't because it said the disk was in use.  I then opened a telnet session, and tried to rename it that way.  Again, I couldn't because it said it was a read only system.  After doing some searching, I thought maybe the filesystem was corrupt, so I proceeded to follow the instructions in the wiki.  When I ran the check command, it immediate said I couldn't due to bad blocks (?).  So, I tried to check the filesystems on the other disks in that share, and same result.  I went ahead and power cycled the machine, and ran into problems with a disk not responding.  After it booted up, I looked into the web server, and noticed that the configuration was valid, all disks were green but the array wasn't started.  I tried to start the array, and it seemed to try to start, but when I refreshed the page, it still had not started or was attempting to start.

 

I checked the system logs, and was surprised to find two huge >300MB logs, filled with lines like this which was before I rebooted:

 

Mar 12 04:40:01 Tower syslogd 1.4.1: restart.
Mar 12 04:40:10 Tower kernel: mvsas 0000:03:00.0: mvsas exec failed[-132]!
Mar 12 04:40:10 Tower kernel: sas: lldd_execute_task returned: -132
Mar 12 04:40:10 Tower kernel: ata1: no sense translation for status: 0x50
Mar 12 04:40:10 Tower kernel: ata1: translated ATA stat/err 0x50/00 to SCSI SK/ASC/ASCQ 0xb/00/00
Mar 12 04:40:10 Tower kernel: ata1: status=0x50 { DriveReady SeekComplete }
Mar 12 04:40:10 Tower kernel: mvsas 0000:03:00.0: mvsas exec failed[-132]!
Mar 12 04:40:10 Tower kernel: sas: lldd_execute_task returned: -132
Mar 12 04:40:10 Tower kernel: ata1: no sense translation for status: 0x50
Mar 12 04:40:10 Tower kernel: ata1: translated ATA stat/err 0x50/00 to SCSI SK/ASC/ASCQ 0xb/00/00
Mar 12 04:40:10 Tower kernel: ata1: status=0x50 { DriveReady SeekComplete }
Mar 12 04:40:10 Tower kernel: mvsas 0000:03:00.0: mvsas exec failed[-132]!
Mar 12 04:40:10 Tower kernel: sas: lldd_execute_task returned: -132
Mar 12 04:40:10 Tower kernel: ata1: no sense translation for status: 0x50
Mar 12 04:40:10 Tower kernel: ata1: translated ATA stat/err 0x50/00 to SCSI SK/ASC/ASCQ 0xb/00/00
Mar 12 04:40:10 Tower kernel: ata1: status=0x50 { DriveReady SeekComplete }

 

After I rebooted, I checked the new log to try and determine the exact disk that was failing to respond. 

 

Mar 12 22:54:35 Tower kernel: sd 8:0:0:0: [sdo] Attached SCSI disk
Mar 12 22:54:35 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Mar 12 22:54:35 Tower kernel: ata10.00: failed command: READ FPDMA QUEUED
Mar 12 22:54:35 Tower kernel: ata10.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
Mar 12 22:54:35 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 12 22:54:35 Tower kernel: ata10.00: status: { DRDY }
Mar 12 22:54:35 Tower kernel: ata10: hard resetting link
Mar 12 22:54:35 Tower kernel: ata10: link is slow to respond, please be patient (ready=0)
Mar 12 22:54:35 Tower kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Mar 12 22:54:35 Tower kernel: ata10.00: configured for UDMA/133
Mar 12 22:54:35 Tower kernel: ata10.00: device reported invalid CHS sector 0
Mar 12 22:54:35 Tower kernel: ata10: EH complete
Mar 12 22:54:35 Tower kernel: ata10.00: NCQ disabled due to excessive errors
Mar 12 22:54:35 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Mar 12 22:54:35 Tower kernel: ata10.00: failed command: READ FPDMA QUEUED
Mar 12 22:54:35 Tower kernel: ata10.00: cmd 60/08:00:00:00:00/00:00:00:00:00/40 tag 0 ncq 4096 in
Mar 12 22:54:35 Tower kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 12 22:54:35 Tower kernel: ata10.00: status: { DRDY }
Mar 12 22:54:35 Tower kernel: ata10: hard resetting link

 

I think it was sdk which was the disk with serial WCAVY5476488 since I also noticed this (and the fact that it was 12 hours old(!) which was not right):

 

Mar 12 22:56:56 Tower smartctl[10433]: === START OF INFORMATION SECTION ===
Mar 12 22:56:56 Tower smartctl[10433]: Model Family:     Western Digital Caviar Green family
Mar 12 22:56:56 Tower smartctl[10433]: Device Model:     WDC WD20EADS-00S2B0
Mar 12 22:56:56 Tower smartctl[10433]: Serial Number:    WD-WCAVY5476488
Mar 12 22:56:56 Tower smartctl[10433]: Firmware Version: 01.00A01
Mar 12 22:56:56 Tower smartctl[10433]: User Capacity:    2,000,398,934,016 bytes
Mar 12 22:56:56 Tower smartctl[10433]: Device is:        In smartctl database [for details use: -P show]
Mar 12 22:56:56 Tower smartctl[10433]: ATA Version is:   8
Mar 12 22:56:56 Tower smartctl[10433]: ATA Standard is:  Exact ATA specification draft version not indicated
Mar 12 22:56:56 Tower smartctl[10433]: Local Time is:    Tue Mar 12 22:56:56 2013 PDT
Mar 12 22:56:56 Tower smartctl[10433]: SMART support is: Available - device has SMART capability.
Mar 12 22:56:56 Tower smartctl[10433]: SMART support is: Enabled
Mar 12 22:56:56 Tower smartctl[10433]: Power mode is:    ACTIVE or IDLE
Mar 12 22:56:56 Tower smartctl[10433]: 
Mar 12 22:56:56 Tower smartctl[10433]: === START OF READ SMART DATA SECTION ===
Mar 12 22:56:56 Tower smartctl[10433]: SMART overall-health self-assessment test result: PASSED
Mar 12 22:56:56 Tower smartctl[10433]: 
Mar 12 22:56:56 Tower smartctl[10433]: General SMART Values:
Mar 12 22:56:56 Tower smartctl[10433]: Offline data collection status:  (0x80)        Offline data collection activity
Mar 12 22:56:56 Tower smartctl[10433]:                                         was never started.
Mar 12 22:56:56 Tower smartctl[10433]:                                         Auto Offline Data Collection: Enabled.
Mar 12 22:56:56 Tower smartctl[10433]: Self-test execution status:      (   0)        The previous self-test routine completed
Mar 12 22:56:56 Tower smartctl[10433]:                                         without error or no self-test has ever 
Mar 12 22:56:56 Tower smartctl[10433]:                                         been run.
Mar 12 22:56:56 Tower smartctl[10433]: Total time to complete Offline 
Mar 12 22:56:56 Tower smartctl[10433]: data collection:                  (41580) seconds.
Mar 12 22:56:56 Tower smartctl[10433]: Offline data collection
Mar 12 22:56:56 Tower smartctl[10433]: capabilities:                          (0x7b) SMART execute Offline immediate.
Mar 12 22:56:56 Tower smartctl[10433]:                                         Auto Offline data collection on/off support.
Mar 12 22:56:56 Tower smartctl[10433]:                                         Suspend Offline collection upon new
Mar 12 22:56:56 Tower smartctl[10433]:                                         command.
Mar 12 22:56:56 Tower smartctl[10433]:                                         Offline surface scan supported.
Mar 12 22:56:56 Tower smartctl[10433]:                                         Self-test supported.
Mar 12 22:56:56 Tower smartctl[10433]:                                         Conveyance Self-test supported.
Mar 12 22:56:56 Tower smartctl[10433]:                                         Selective Self-test supported.
Mar 12 22:56:56 Tower smartctl[10433]: SMART capabilities:            (0x0003)        Saves SMART data before entering
Mar 12 22:56:56 Tower smartctl[10433]:                                         power-saving mode.
Mar 12 22:56:56 Tower smartctl[10433]:                                         Supports SMART auto save timer.
Mar 12 22:56:56 Tower smartctl[10433]: Error logging capability:        (0x01)        Error logging supported.
Mar 12 22:56:56 Tower smartctl[10433]:                                         General Purpose Logging supported.
Mar 12 22:56:56 Tower smartctl[10433]: Short self-test routine 
Mar 12 22:56:56 Tower smartctl[10433]: recommended polling time:          (   2) minutes.
Mar 12 22:56:56 Tower smartctl[10433]: Extended self-test routine
Mar 12 22:56:56 Tower smartctl[10433]: recommended polling time:          ( 255) minutes.
Mar 12 22:56:56 Tower smartctl[10433]: Conveyance self-test routine
Mar 12 22:56:56 Tower smartctl[10433]: recommended polling time:          (   5) minutes.
Mar 12 22:56:56 Tower smartctl[10433]: SCT capabilities:                (0x303f)        SCT Status supported.
Mar 12 22:56:56 Tower smartctl[10433]:                                         SCT Feature Control supported.
Mar 12 22:56:56 Tower smartctl[10433]:                                         SCT Data Table supported.
Mar 12 22:56:56 Tower smartctl[10433]: 
Mar 12 22:56:56 Tower smartctl[10433]: SMART Attributes Data Structure revision number: 16
Mar 12 22:56:56 Tower smartctl[10433]: Vendor Specific SMART Attributes with Thresholds:
Mar 12 22:56:56 Tower smartctl[10433]: ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
Mar 12 22:56:56 Tower smartctl[10433]:   1 Raw_Read_Error_Rate     0x002f   200   194   051    Pre-fail  Always       -       0
Mar 12 22:56:56 Tower smartctl[10433]:   3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
Mar 12 22:56:56 Tower smartctl[10433]:   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       2
Mar 12 22:56:56 Tower smartctl[10433]:   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
Mar 12 22:56:56 Tower smartctl[10433]:   7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
Mar 12 22:56:56 Tower smartctl[10433]:   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       12
Mar 12 22:56:56 Tower smartctl[10433]:  10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
Mar 12 22:56:56 Tower smartctl[10433]:  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
Mar 12 22:56:56 Tower smartctl[10433]:  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
Mar 12 22:56:56 Tower smartctl[10433]: 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
Mar 12 22:56:56 Tower smartctl[10433]: 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       9
Mar 12 22:56:56 Tower smartctl[10433]: 194 Temperature_Celsius     0x0022   119   118   000    Old_age   Always       -       33
Mar 12 22:56:56 Tower smartctl[10433]: 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
Mar 12 22:56:56 Tower smartctl[10433]: 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
Mar 12 22:56:56 Tower smartctl[10433]: 198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
Mar 12 22:56:56 Tower smartctl[10433]: 199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
Mar 12 22:56:56 Tower smartctl[10433]: 200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0
Mar 12 22:56:56 Tower smartctl[10433]: 
Mar 12 22:56:56 Tower smartctl[10433]: SMART Error Log Version: 1
Mar 12 22:56:56 Tower smartctl[10433]: ATA Error Count: 2153 (device log contains only the most recent five errors)
Mar 12 22:56:56 Tower smartctl[10433]:         CR = Command Register [HEX]
Mar 12 22:56:56 Tower smartctl[10433]:         FR = Features Register [HEX]
Mar 12 22:56:56 Tower smartctl[10433]:         SC = Sector Count Register [HEX]
Mar 12 22:56:56 Tower smartctl[10433]:         SN = Sector Number Register [HEX]
Mar 12 22:56:56 Tower smartctl[10433]:         CL = Cylinder Low Register [HEX]
Mar 12 22:56:56 Tower smartctl[10433]:         CH = Cylinder High Register [HEX]
Mar 12 22:56:56 Tower smartctl[10433]:         DH = Device/Head Register [HEX]
Mar 12 22:56:56 Tower smartctl[10433]:         DC = Device Command Register [HEX]
Mar 12 22:56:56 Tower smartctl[10433]:         ER = Error register [HEX]
Mar 12 22:56:56 Tower smartctl[10433]:         ST = Status register [HEX]
Mar 12 22:56:56 Tower smartctl[10433]: Powered_Up_Time is measured from power on, and printed as
Mar 12 22:56:56 Tower smartctl[10433]: DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
Mar 12 22:56:56 Tower smartctl[10433]: SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Mar 12 22:56:56 Tower smartctl[10433]: 
Mar 12 22:56:56 Tower smartctl[10433]: Error 2153 occurred at disk power-on lifetime: 12 hours (0 days + 12 hours)
Mar 12 22:56:56 Tower smartctl[10433]:   When the command that caused the error occurred, the device was active or idle.
Mar 12 22:56:56 Tower smartctl[10433]: 
Mar 12 22:56:56 Tower smartctl[10433]:   After command completion occurred, registers were:
Mar 12 22:56:56 Tower smartctl[10433]:   ER ST SC SN CL CH DH
Mar 12 22:56:56 Tower smartctl[10433]:   -- -- -- -- -- -- --
Mar 12 22:56:56 Tower smartctl[10433]:   04 51 08 00 00 00 e0  Error: ABRT 8 sectors at LBA = 0x00000000 = 0
Mar 12 22:56:56 Tower smartctl[10433]: 
Mar 12 22:56:56 Tower smartctl[10433]:   Commands leading to the command that caused the error were:
Mar 12 22:56:56 Tower smartctl[10433]:   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
Mar 12 22:56:56 Tower smartctl[10433]:   -- -- -- -- -- -- -- --  ----------------  --------------------
Mar 12 22:56:56 Tower smartctl[10433]:   c8 00 08 00 00 00 e0 00      00:04:47.765  READ DMA
Mar 12 22:56:56 Tower smartctl[10433]:   ef 10 02 00 00 00 a0 00      00:04:47.765  SET FEATURES [Reserved for Serial ATA]
Mar 12 22:56:56 Tower smartctl[10433]:   ec 00 00 00 00 00 a0 00      00:04:47.762  IDENTIFY DEVICE
Mar 12 22:56:56 Tower smartctl[10433]:   ef 03 42 00 00 00 a0 00      00:04:47.762  SET FEATURES [set transfer mode]
Mar 12 22:56:56 Tower smartctl[10433]: 
Mar 12 22:56:56 Tower smartctl[10433]: Error 2152 occurred at disk power-on lifetime: 12 hours (0 days + 12 hours)
Mar 12 22:56:56 Tower smartctl[10433]:   When the command that caused the error occurred, the device was active or idle.
Mar 12 22:56:56 Tower smartctl[10433]: 
Mar 12 22:56:56 Tower smartctl[10433]:   After command completion occurred, registers were:
Mar 12 22:56:56 Tower smartctl[10433]:   ER ST SC SN CL CH DH
Mar 12 22:56:56 Tower smartctl[10433]:   -- -- -- -- -- -- --
Mar 12 22:56:56 Tower smartctl[10433]:   04 51 08 00 00 00 e0  Error: ABRT 8 sectors at LBA = 0x00000000 = 0
Mar 12 22:56:56 Tower smartctl[10433]: 
Mar 12 22:56:56 Tower smartctl[10433]:   Commands leading to the command that caused the error were:
Mar 12 22:56:56 Tower smartctl[10433]:   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
Mar 12 22:56:56 Tower smartctl[10433]:   -- -- -- -- -- -- -- --  ----------------  --------------------
Mar 12 22:56:56 Tower smartctl[10433]:   c8 00 08 00 00 00 e0 00      00:04:47.758  READ DMA
Mar 12 22:56:56 Tower smartctl[10433]:   ef 10 02 00 00 00 a0 00      00:04:47.758  SET FEATURES [Reserved for Serial ATA]
Mar 12 22:56:56 Tower smartctl[10433]:   ec 00 00 00 00 00 a0 00      00:04:47.756  IDENTIFY DEVICE
Mar 12 22:56:56 Tower smartctl[10433]:   ef 03 42 00 00 00 a0 00      00:04:47.756  SET FEATURES [set transfer mode]

 

I replaced the disk, rebuilt the new disk, and all seems fine now.  So, my questions are:

 

1) Can I trust the server now?  Monthly parity checks were run, as well as long SMART tests, and nothing seemed to indicate anything out of the ordinary.

2) I think I also read about a data corruption bug in the 4.X release whereby corruption may occur if the array was being written to as a disk was being rebuilt.  Unfortunately, I forgot to shut off the mover script while the disk was being rebuilt.  How can I check to see if there was corruption?  After the disk was rebuilt, it said the parity was correct, with no errors.  Am I supposed to run a NOCORRECT parity check again as well?

 

I've enclosed the full system log (after the reboot with the bad disk) for reference.

 

Thanks!

syslog-20130312-232952.zip

I think you handled it correctly.  Looks to me as if sdk was zapped, perhaps an electrical spike?  The SMART report appears to show a drive that had been reinitialized, but was having serious issues.  If I ignore the drive and look solely at the errors, they essentially seem to show serious device issues, which could mean a bad or crashed device (in this case, the drive sdk) or bad power issues to the device.  You might look for a loose power connection, such as a bad power splitter in its power path.  Or the drive firmware just crashed, bad.  Either way, the drive had to be replaced, and your system is probably fine, in my opinion.  Do you have good surge protection for your server?

  • Author

Hi RobJ -

 

thanks for the speedy reply!  It was a scary few moments until I found that I could rebuild the drive!  To answer your question about the surge protection: Yes, I have a whole house surge protection at the panel, and the server is also on an APC UPS.  The NOCORRECT parity check is almost done and it shows no errors.  I'll pull the sever out and replace the power splitter to that backplane.

 

Strange how none of the other drives seemed to be affected, and that I was having issues renaming folders which resided on other drives.  In any case, I appreciate your reassurance that I can breathe easy.

Archived

This topic is now archived and is closed to further replies.

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.