April 23, 201016 yr Just finished up this new unRAID. http://lime-technology.com/forum/index.php?topic=2031.msg59139#msg59139 I got a notification today that array was down. Few minutes after I got another saying it was up again. When I checked logs it was polluted with md: disk8: ATA_OP_STANDBYNOW1 ioctl error: -5 mdcmd (23751): spindown 8 I went to the web console but everything looked fine. Decided to spin-up all the drives. Disk8 won't and it doesn't report any temperature. md: disk8: ATA_OP_STANDBYNOW1 ioctl error: -5 mdcmd (23752): spinup 0 mdcmd (23753): spinup 1 mdcmd (23754): spinup 2 mdcmd (23755): spinup 3 mdcmd (23756): spinup 4 mdcmd (23757): spinup 5 mdcmd (23758): spinup 6 mdcmd (23759): spinup 7 mdcmd (23760): spinup 8 md: disk8: ATA_OP_SETIDLE1 ioctl error: -5 mdcmd (23761): spinup 9 I then received another notification regarding a faulty drive. This message is a status update for unRAID babylon ----------------------------------------------------------------- Server Name: babylon Status: The unRaid array needs attention. One or more disks are disabled or invalid. Date: Fri Apr 23 14:35:52 GMT-2 2010 Disk Temperature Status ----------------------------------------------------------------- Parity Disk [sde]: 20°C (DiskId: SAMSUNG_HD154UI_S1Y6J1MS706123) Disk 1 [sdf]: 18°C (DiskId: SAMSUNG_HD154UI_S1Y6J1KS802253) Disk 2 [sdg]: 19°C (DiskId: SAMSUNG_HD154UI_S1Y6J1KS801960) Disk 3 [sdi]: 17°C (DiskId: SAMSUNG_HD154UI_S1Y6J1KS802247) Disk 4 [sdj]: 22°C (DiskId: SAMSUNG_HD154UI_S1XWJ1KSC25631) Disk 5 [sdk]: 24°C (DiskId: WDC_WD5000AACS-0_WD-WCASU3710167) Disk 6 [sda]: 20°C (DiskId: SAMSUNG_HD501LJ_S0MUJ1KP400270) Disk 7 [sdb]: 23°C (DiskId: WDC_WD5000AACS-0_WD-WCASU3676884) Disk 8 [sdc]: Not-Reported (DiskId: WDC_WD15EARS-00Z_WD-WMAVU2365339) Disk 9 [sdd]: 23°C (DiskId: WDC_WD15EARS-00Z_WD-WMAVU2826445) Disk SMART Health Status ----------------------------------------------------------------- Parity Disk PASSED (DiskId: SAMSUNG_HD154UI_S1Y6J1MS706123) Disk 1 PASSED (DiskId: SAMSUNG_HD154UI_S1Y6J1KS802253) Disk 2 PASSED (DiskId: SAMSUNG_HD154UI_S1Y6J1KS801960) Disk 3 PASSED (DiskId: SAMSUNG_HD154UI_S1Y6J1KS802247) Disk 4 PASSED (DiskId: SAMSUNG_HD154UI_S1XWJ1KSC25631) Disk 5 PASSED (DiskId: WDC_WD5000AACS-0_WD-WCASU3710167) Disk 6 PASSED (DiskId: SAMSUNG_HD501LJ_S0MUJ1KP400270) Disk 7 PASSED (DiskId: WDC_WD5000AACS-0_WD-WCASU3676884) Disk 8 Not-Reported (DiskId: WDC_WD15EARS-00Z_WD-WMAVU2365339) Disk 9 PASSED (DiskId: WDC_WD15EARS-00Z_WD-WMAVU2826445) Output of /proc/mdcmd: ----------------------------------------------------------------- cmdOper=status cmdResult=ok sbName=/boot/config/super.dat sbVersion=0.95.3 sbCreated=1259677173 sbUpdated=1272026147 sbEvents=210 sbState=0 sbNumDisks=10 sbSynced=1271024709 sbSyncErrs=0 mdVersion=0.95.4 mdState=STARTED mdNumProtected=10 mdNumDisabled=1 mdDisabledDisk=8 mdNumInvalid=1 mdInvalidDisk=8 mdNumMissing=0 mdMissingDisk=0 mdNumNew=0 mdResync=0 diskNumber.0=0 diskName.0= diskSize.0=1465138552 diskState.0=7 diskModel.0=SAMSUNG HD154UI diskSerial.0=S1Y6J1MS706123 diskId.0=SAMSUNG_HD154UI_S1Y6J1MS706123 rdevNumber.0=0 rdevStatus.0=DISK_OK rdevName.0=sde rdevSize.0=1465138552 rdevModel.0=SAMSUNG HD154UI rdevSerial.0=S1Y6J1MS706123 rdevId.0=SAMSUNG_HD154UI_S1Y6J1MS706123 rdevNumErrors.0=0 rdevLastIO.0=1272026148 rdevSpinupGroup.0=0 diskNumber.1=1 diskName.1=md1 diskSize.1=1465138552 diskState.1=7 diskModel.1=SAMSUNG HD154UI diskSerial.1=S1Y6J1KS802253 diskId.1=SAMSUNG_HD154UI_S1Y6J1KS802253 rdevNumber.1=1 rdevStatus.1=DISK_OK rdevName.1=sdf rdevSize.1=1465138552 rdevModel.1=SAMSUNG HD154UI rdevSerial.1=S1Y6J1KS802253 rdevId.1=SAMSUNG_HD154UI_S1Y6J1KS802253 rdevNumErrors.1=0 rdevLastIO.1=1272026147 rdevSpinupGroup.1=0 diskNumber.2=2 diskName.2=md2 diskSize.2=1465138552 diskState.2=7 diskModel.2=SAMSUNG HD154UI diskSerial.2=S1Y6J1KS801960 diskId.2=SAMSUNG_HD154UI_S1Y6J1KS801960 rdevNumber.2=2 rdevStatus.2=DISK_OK rdevName.2=sdg rdevSize.2=1465138552 rdevModel.2=SAMSUNG HD154UI rdevSerial.2=S1Y6J1KS801960 rdevId.2=SAMSUNG_HD154UI_S1Y6J1KS801960 rdevNumErrors.2=0 rdevLastIO.2=1272026147 rdevSpinupGroup.2=0 diskNumber.3=3 diskName.3=md3 diskSize.3=1465138552 diskState.3=7 diskModel.3=SAMSUNG HD154UI diskSerial.3=S1Y6J1KS802247 diskId.3=SAMSUNG_HD154UI_S1Y6J1KS802247 rdevNumber.3=3 rdevStatus.3=DISK_OK rdevName.3=sdi rdevSize.3=1465138552 rdevModel.3=SAMSUNG HD154UI rdevSerial.3=S1Y6J1KS802247 rdevId.3=SAMSUNG_HD154UI_S1Y6J1KS802247 rdevNumErrors.3=0 rdevLastIO.3=1272026147 rdevSpinupGroup.3=0 diskNumber.4=4 diskName.4=md4 diskSize.4=1465138552 diskState.4=7 diskModel.4=SAMSUNG HD154UI diskSerial.4=S1XWJ1KSC25631 diskId.4=SAMSUNG_HD154UI_S1XWJ1KSC25631 rdevNumber.4=4 rdevStatus.4=DISK_OK rdevName.4=sdj rdevSize.4=1465138552 rdevModel.4=SAMSUNG HD154UI rdevSerial.4=S1XWJ1KSC25631 rdevId.4=SAMSUNG_HD154UI_S1XWJ1KSC25631 rdevNumErrors.4=0 rdevLastIO.4=1272026148 rdevSpinupGroup.4=0 diskNumber.5=5 diskName.5=md5 diskSize.5=488386552 diskState.5=7 diskModel.5=WDC WD5000AACS-0 diskSerial.5=WD-WCASU3710167 diskId.5=WDC_WD5000AACS-0_WD-WCASU3710167 rdevNumber.5=5 rdevStatus.5=DISK_OK rdevName.5=sdk rdevSize.5=488386552 rdevModel.5=WDC WD5000AACS-0 rdevSerial.5=WD-WCASU3710167 rdevId.5=WDC_WD5000AACS-0_WD-WCASU3710167 rdevNumErrors.5=0 rdevLastIO.5=1272026147 rdevSpinupGroup.5=0 diskNumber.6=6 diskName.6=md6 diskSize.6=488386552 diskState.6=7 diskModel.6=SAMSUNG HD501LJ diskSerial.6=S0MUJ1KP400270 diskId.6=SAMSUNG_HD501LJ_S0MUJ1KP400270 rdevNumber.6=6 rdevStatus.6=DISK_OK rdevName.6=sda rdevSize.6=488386552 rdevModel.6=SAMSUNG HD501LJ rdevSerial.6=S0MUJ1KP400270 rdevId.6=SAMSUNG_HD501LJ_S0MUJ1KP400270 rdevNumErrors.6=0 rdevLastIO.6=1272026147 rdevSpinupGroup.6=0 diskNumber.7=7 diskName.7=md7 diskSize.7=488386552 diskState.7=7 diskModel.7=WDC WD5000AACS-0 diskSerial.7=WD-WCASU3676884 diskId.7=WDC_WD5000AACS-0_WD-WCASU3676884 rdevNumber.7=7 rdevStatus.7=DISK_OK rdevName.7=sdb rdevSize.7=488386552 rdevModel.7=WDC WD5000AACS-0 rdevSerial.7=WD-WCASU3676884 rdevId.7=WDC_WD5000AACS-0_WD-WCASU3676884 rdevNumErrors.7=0 rdevLastIO.7=1272026147 rdevSpinupGroup.7=0 diskNumber.8=8 diskName.8=md8 diskSize.8=1465138552 diskState.8=4 diskModel.8=WDC WD15EARS-00Z diskSerial.8=WD-WMAVU2365339 diskId.8=WDC_WD15EARS-00Z_WD-WMAVU2365339 rdevNumber.8=8 rdevStatus.8=DISK_DSBL rdevName.8=sdc rdevSize.8=1465138552 rdevModel.8=WDC WD15EARS-00Z rdevSerial.8=WD-WMAVU2365339 rdevId.8=WDC_WD15EARS-00Z_WD-WMAVU2365339 rdevNumErrors.8=4 rdevLastIO.8=1272026147 rdevSpinupGroup.8=0 diskNumber.9=9 diskName.9=md9 diskSize.9=1465138552 diskState.9=7 diskModel.9=WDC WD15EARS-00Z diskSerial.9=WD-WMAVU2826445 diskId.9=WDC_WD15EARS-00Z_WD-WMAVU2826445 rdevNumber.9=9 rdevStatus.9=DISK_OK rdevName.9=sdd rdevSize.9=1465138552 rdevModel.9=WDC WD15EARS-00Z rdevSerial.9=WD-WMAVU2826445 rdevId.9=WDC_WD15EARS-00Z_WD-WMAVU2826445 rdevNumErrors.9=0 rdevLastIO.9=1272026147 rdevSpinupGroup.9=0 This drive doesn't contain any data so I stopped the array and unassigned it. I can start the array but without any protection. This drive is connected to the Adaptec 1430SA. Next step is to identify the root cause (cable, power supply, drive, SATA card, memory...). Any recommendations on where to start? Stay tuned!
April 23, 201016 yr un-assigning the drive does not remove it from the array. It is exactly the same to unRAID as a failed drive until you save a new configuration or replace it with a new drive. What I'm trying to say is you are currently un-protected from a second drive failure. You can test this by attempting to read or write tot he drive you just un-assigned. You'll see it is still possible. Aided by parity and the remaining data drives. It is not unusual for a drive to fail early in its life. It is just as easy for a cable (either data OR power) to work itself loose with changes in heat and/or vibration. Step 1. Before you do anything more, post a syslog. It will give some clue as to how the drive went offline. Step 2. Try to get a smart report on the drive. From what you've said, odds are it will not respond. Step 3. Stop the array, power down, re-seat the cables to the drive. Power back up, if you are lucky you'll now be able to get a smart report. If it does not show any problems with the drive's health, you can stop the array and re-assign it. When you next Start the array, by pressing start, the disk will be re-constructed. Do NOT press the button labeled "restore" as it is actually a "Delete Existing Disk Configuration and Parity" button. If you do wish to leave the un-assigned disk out of the array because there is no data on it and you will not be replacing it, then you will want to press the button labeled as "restore" because it is a "Delete Existing Disk Configuration and Parity" function. After pressing it you may then press the "Start" button which will set a new disk configuration based on the then assigned and working disks, and the array will begin a full parity calculation on those disks. You will not have parity protection again until that is complete. Be careful working in your server. Try not to dislodge other cables, as then you'll have multiple failed drives and it is more difficult to recover from that situation. (you might lose your ability to re-construct from parity) Do NOT press a "Format" button if you see one. if you do have a replacement disk, and the original failed, after installing the replacement all you need to to get back running is to press the "Start" button. (you might need to check the checkbox under it to enable it) Joe L.
April 23, 201016 yr Author Thanks Joe. For safety reasons, my unRAID was turned off the rest of the day. I'm going to investigate the hardware. In the meantime here is my syslog: http://natzo.com/logs/syslog-20100423-145542.txt
April 23, 201016 yr The first error seems to be here: Apr 22 12:20:02 babylon kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Apr 22 12:20:02 babylon kernel: ata3.00: failed command: SMART Apr 22 12:20:02 babylon kernel: ata3.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 Apr 22 12:20:02 babylon kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Apr 22 12:20:02 babylon kernel: ata3.00: status: { DRDY } Apr 22 12:20:02 babylon kernel: ata3: hard resetting link Apr 22 12:20:08 babylon kernel: ata3: link is slow to respond, please be patient (ready=0) Apr 22 12:20:12 babylon kernel: ata3: SRST failed (errno=-16) Apr 22 12:20:12 babylon kernel: ata3: hard resetting link Apr 22 12:20:18 babylon kernel: ata3: link is slow to respond, please be patient (ready=0) Apr 22 12:20:22 babylon kernel: ata3: SRST failed (errno=-16) Apr 22 12:20:22 babylon kernel: ata3: hard resetting link Apr 22 12:20:28 babylon kernel: ata3: link is slow to respond, please be patient (ready=0) Apr 22 12:20:57 babylon kernel: ata3: SRST failed (errno=-16) Apr 22 12:20:57 babylon kernel: ata3: limiting SATA link speed to 1.5 Gbps Apr 22 12:20:57 babylon kernel: ata3: hard resetting link Apr 22 12:21:02 babylon kernel: ata3: SRST failed (errno=-16) Apr 22 12:21:02 babylon kernel: ata3: reset failed, giving up Apr 22 12:21:02 babylon kernel: ata3.00: disabled Apr 22 12:21:02 babylon kernel: ata3: EH complete Apr 22 12:25:28 babylon kernel: md: disk8: ATA_OP_STANDBYNOW1 ioctl error: -5 Apr 22 12:25:38 babylon kernel: mdcmd (3430): spindown 8 Apr 22 12:25:38 babylon kernel: md: disk8: ATA_OP_STANDBYNOW1 ioctl error: -5 Apr 22 12:25:38 babylon kernel: mdcmd (3431): spindown 9 Apr 22 12:25:49 babylon kernel: mdcmd (3433): spindown 8 Apr 22 12:25:49 babylon kernel: md: disk8: ATA_OP_STANDBYNOW1 ioctl error: -5 Apr 22 12:25:59 babylon kernel: mdcmd (3435): spindown 8 Apr 22 12:25:59 babylon kernel: md: disk8: ATA_OP_STANDBYNOW1 ioctl error: -5 Apr 22 12:26:09 babylon kernel: mdcmd (3437): spindown 8 Apr 22 12:26:09 babylon kernel: md: disk8: ATA_OP_STANDBYNOW1 ioctl error: -5 Basically the drive timed out where given a command and even though the driver tried several times to reset it, it could not.
April 23, 201016 yr Author Powered up the machine and guess what, drive is showing up again. I re-assigned it and got the following reports: HDParm info /dev/sdc: ATA device, with non-removable media Model Number: WDC WD15EARS-00Z5B1 Serial Number: WD-WMAVU2365339 Firmware Revision: 80.00A80 Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5 Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 2930277168 device size with M = 1024*1024: 1430799 MBytes device size with M = 1000*1000: 1500301 MBytes (1500 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 0 Recommended acoustic management value: 128, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * DOWNLOAD_MICROCODE Power-Up In Standby feature set * SET_FEATURES required to spinup after power up SET_MAX security extension Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * 64-bit World wide name * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Host-initiated interface power management * Phy event counters * unknown 76[12] DMA Setup Auto-Activate optimization * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) unknown 206[13] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked frozen not expired: security count supported: enhanced erase 324min for SECURITY ERASE UNIT. 324min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 50014ee0acc3b191 NAA : 5 IEEE OUI : 14ee Unique ID : 0acc3b191 Checksum: correct Smart Status Report Statistics for /dev/sdc _ smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD15EARS-00Z5B1 Serial Number: WD-WMAVU2365339 Firmware Version: 80.00A80 User Capacity: 1,500,301,910,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri Apr 23 22:19:04 2010 GMT-2 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (31800) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3031) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 202 190 021 Pre-fail Always - 4866 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 20 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 001 001 000 Old_age Always - 65535 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 37 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 11 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 167 194 Temperature_Celsius 0x0022 127 117 000 Old_age Always - 23 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 36 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. From now I can decide to re-assign and re-construct the array. However I'm a little bit scared about that situation considering I could put into danger the integrity of the array if another drive fails. I really don't know which hardware component made the drive not to respond to commands (drive itself, Adaptec 1430SA, power...). At that point I would appreciate any advice on what to do. Thanks Alphazo PS: I ran a short Smart Test and it went well. I'm now running a Long Smart Test.
April 23, 201016 yr I'd let the long test complete, then Stop the array Re-assign the drive Use the "Trust-my-parity" procedure as described in the wiki after pressing the "restore" button and BEFORE you press "Start" make sure you type the command as described in the wiki and see the Ok response. Once you see the correct response, then press "Start" Let the parity check complete. Then, Stop the array, Power down check for loose cables. Power back up press "Start" to start the array if it does not start by itself. Joe L.
April 26, 201016 yr Author Long SMART test passed. I checked all SATA and power cables. Thanks for the "I trust my parity drive" trick. However I restored the array by going through a full (useless in my case) data rebuild process. Array is back online and fully protected. Now let see how this disk8 will behave in the future. Thanks a lot for the assistance. Alphazo
Archived
This topic is now archived and is closed to further replies.