January 1, 201313 yr Would need some help. Woke up this morning, not with a hangover, but with issues on my unRaid server. At 00:30, the parity sync started and apparently there were some issues. I saw the Parity disk with a red ball this morning (Parity had 700 or so Write errors). All other disks had a green ball. I stopped the array and to my horror I'm seeing now not only the Parity disk with a red ball but also one of the array disks. Syslog and screenshot attached. Is there a way I can still recover from this? syslog : https://dl.dropbox.com/u/3121169/syslog.zip screenshot: https://dl.dropbox.com/u/3121169/unraid.png
January 1, 201313 yr Since they both say "no device", odds are it is power related, or disk controller related, or one disk blocking the other on the same controller. Before doing anything, make a copy of your "config" folder. You can regroup more easily with it in case you need to revert to a known state. Since a "single" write error will make a drive go red, I'd power down completely, then try un-plugging and re-seating anything common with those two drives first and rebooting. It might be a power splitter, or a back-plane... Odds of two disks both going bad at the exact same time is pretty slim. Right now, both are marked in the syslog as either "missing" or "removed" You've got a lot of drives... is your power supply up to the task? (what exact make/model are you using?) Joe L.
January 2, 201313 yr Author Thanks Joe! I did as you said. After power down and reseating everything, Disk1 came up green, the Parity drive had a blue ball. I rebuild parity and all is green now. However, I do see some 'weird' things : 1. a really huge amount of writes... 2. many errors on Disk 1 3. "Parity has not been checked yet." while I did a rebuild of the parity... Are these things to be concerned about? As for the power supply, it's a Trust 520Watt : http://www.trust.com/products/product.aspx?artnr=14996 There are 6 drives in the server, shouldn't 520Watt be enough? (I was thinking of adding some 3Tb drives soon).
January 3, 201313 yr I would suspect that is a multi-rail power supply with a capacity of between 16 and 18 amps on the 12 volt rail powering the disks. I see no indication of it being a single-rail supply, and that feature would be prominently marketed if it existed. With 6 non-green drives, each using 3 amps at spin up, you are probably at/or over the capacity of the power supply, especially since that does not consider the power used by the motherboard or the fans. Typically, the second rail is exclusively limited to the PCIe connectors and not available to the disk power connectors. I would strongly suggest a single-rail power supply of sufficient capacity for your eventual expansion needs.
January 3, 201313 yr Post a SMART report for disk1. +1. Need to see a smart report for disk1. A rebuild of parity is not a "check" of parity. (Your rebuild wrote parity to the parity disk, but you have not yet attempted to read it to verify/check it is correct.)
January 3, 201313 yr Author Thanks! I don't seem to be able to run a smart report on Disk1 or the Parity disk though. I first spun up all disks. After doing so, I'm getting : For Disk 1 (similar with Parity disk) : root@Tower:~# smartctl -t short /dev/sda smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Smartctl open device: /dev/sda failed: No such device root@Tower:~# For Disk2 (or all other disks) : root@Tower:~# smartctl -t short /dev/sdb smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Thu Jan 3 18:55:47 2013 Use smartctl -X to abort test. Syslog from just before I spun up the disks : ... Jan 3 18:48:29 Tower kernel: mdcmd (84618): spindown 0 Jan 3 18:48:29 Tower emhttp: mdcmd: write: No such device or address Jan 3 18:48:29 Tower kernel: mdcmd (84619): spindown 1 Jan 3 18:48:32 Tower emhttp: mdcmd: write: No such device or address Jan 3 18:48:32 Tower kernel: mdcmd (84620): spindown 0 Jan 3 18:48:32 Tower kernel: mdcmd (84621): spindown 1 Jan 3 18:48:32 Tower emhttp: mdcmd: write: No such device or address Jan 3 18:48:33 Tower last message repeated 2 times Jan 3 18:48:33 Tower kernel: mdcmd (84622): spindown 0 Jan 3 18:48:33 Tower kernel: mdcmd (84623): spindown 1 Jan 3 18:48:34 Tower emhttp: mdcmd: write: No such device or address Jan 3 18:48:34 Tower emhttp: mdcmd: write: No such device or address Jan 3 18:48:34 Tower kernel: mdcmd (84624): spindown 0 Jan 3 18:48:34 Tower kernel: mdcmd (84625): spindown 1 Jan 3 18:48:37 Tower emhttp: mdcmd: write: No such device or address Jan 3 18:48:37 Tower kernel: mdcmd (84626): spindown 0 Jan 3 18:48:37 Tower kernel: mdcmd (84627): spindown 1 Jan 3 18:48:37 Tower emhttp: mdcmd: write: No such device or address Jan 3 18:48:39 Tower last message repeated 2 times Jan 3 18:48:39 Tower kernel: mdcmd (84628): spindown 0 Jan 3 18:48:39 Tower kernel: mdcmd (84629): spindown 1 Jan 3 18:48:39 Tower emhttp: Spinning up all drives... Jan 3 18:48:39 Tower emhttp: shcmd (332): /usr/sbin/hdparm -S0 /dev/sde &> /dev/null Jan 3 18:48:39 Tower kernel: mdcmd (84630): spinup 0 Jan 3 18:48:39 Tower kernel: mdcmd (84631): spinup 1 Jan 3 18:48:39 Tower kernel: mdcmd (84632): spinup 2 Jan 3 18:48:39 Tower kernel: mdcmd (84633): spinup 3 Jan 3 18:48:39 Tower kernel: mdcmd (84634): spinup 5 Jan 3 18:49:29 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed Jan 3 18:49:29 Tower kernel: hdparm: sending ioctl 2285 to a partition! Jan 3 18:49:33 Tower last message repeated 5 times Jan 3 18:49:33 Tower kernel: smartctl: sending ioctl 2285 to a partition! Jan 3 18:49:33 Tower last message repeated 3 times Jan 3 18:49:34 Tower kernel: scsi_verify_blk_ioctl: 14 callbacks suppressed Jan 3 18:49:34 Tower kernel: smartctl: sending ioctl 2285 to a partition! Jan 3 18:49:34 Tower last message repeated 9 times Jan 3 18:50:34 Tower kernel: scsi_verify_blk_ioctl: 12 callbacks suppressed Jan 3 18:50:34 Tower kernel: hdparm: sending ioctl 2285 to a partition! Jan 3 18:50:38 Tower last message repeated 5 times Jan 3 18:50:38 Tower kernel: smartctl: sending ioctl 2285 to a partition! Jan 3 18:50:38 Tower last message repeated 3 times Jan 3 18:51:39 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed Jan 3 18:51:39 Tower kernel: hdparm: sending ioctl 2285 to a partition! Jan 3 18:51:42 Tower last message repeated 5 times Jan 3 18:51:42 Tower kernel: smartctl: sending ioctl 2285 to a partition! Jan 3 18:51:42 Tower last message repeated 3 times Jan 3 18:52:43 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed Jan 3 18:52:43 Tower kernel: hdparm: sending ioctl 2285 to a partition! Jan 3 18:52:46 Tower last message repeated 5 times Jan 3 18:52:46 Tower kernel: smartctl: sending ioctl 2285 to a partition! Jan 3 18:52:46 Tower last message repeated 3 times Jan 3 18:53:47 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed Jan 3 18:53:47 Tower kernel: hdparm: sending ioctl 2285 to a partition! Jan 3 18:53:51 Tower last message repeated 5 times ... Disk2 is on the same controller as the Parity disk and Disk1. No issues with Disk2 though. Any suggestions or should I first check the parity and then shut everything down again and try once more to reseat modules/swap cables/... ?
January 3, 201313 yr It really sounds like your hard disks get underpowered... I would follow the suggestion of Joe and look for a PSU replacement!
January 4, 201313 yr As for the power supply, it's a Trust 520Watt : http://www.trust.com/products/product.aspx?artnr=14996 There are 6 drives in the server, shouldn't 520Watt be enough? (I was thinking of adding some 3Tb drives soon). Whats your Specs for your build? CPU/MB/RAM? http://extreme.outervision.com/psucalculatorlite.jsp check how much power u need.. 520W should be good but can't say without the other info.. Celeron or i3/i5 power hogs. I think a good PSU is the most important part of your build. More drives more power..Never heard of Trust so cant trust it.. go with a name you know.. 1. XFX 2. Corsair 3. PC Power & Cooling/SeaSonic 4. Antec 5. OCZ What happens when you disconnect all drives except those 2 in Question? If not a power issue check cables. Is a SMART test essentially the same as the SMART option in the BIOS.. ? sorry if this is a thread Hijack but seems like a good time to ask.
January 5, 201313 yr As Joe said, amperage on the 12 volt rail is more important than wattage. The link given for the Trust 520Watt does not include any information on the number or rails or their amperage. Most inexpensive PSUs contain 2 or more 12 volt rails with only 17 or 18 amps per rail. These types of PSUs are not suitable for for systems with many HDDs. See here: http://lime-technology.com/forum/index.php?topic=12219.0
January 8, 201313 yr Author Finally got some time to do some tests. I have a kit to connect external drives via USB. Used the power supply of it to power one of those 2 disks (Parity and Disk1). Now, with one of them on the external power, things seem better. I'm able to do smart tests now. For Disk1, I think (please correct me if I'm wrong) all is fine : root@Tower:~# smartctl -a -d ata /dev/sda smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: WDC WD2002FAEX-007BA0 Serial Number: WD-WMAY03237929 Firmware Version: 05.01D05 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue Jan 8 14:04:36 2013 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (30180) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3037) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 3500 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1194 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4010 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 947 192 Power-Off_Retract_Count 0x0032 199 199 000 Old_age Always - 935 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 258 194 Temperature_Celsius 0x0022 123 105 000 Old_age Always - 29 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 4010 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. root@Tower:~# For the Parity disk however, I think the RAW_VALUE numbers are good but there are read errors (did the short smart test 3 times, 3 times read error) : root@Tower:~# smartctl -a -d ata /dev/sdd smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD20EADS-32S2B0 Serial Number: WD-WCAVY2809029 Firmware Version: 01.00A01 User Capacity: 2,000,398,934,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue Jan 8 14:26:49 2013 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (40380) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 204 142 021 Pre-fail Always - 6758 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2424 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 13808 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1441 192 Power-Off_Retract_Count 0x0032 199 199 000 Old_age Always - 1354 193 Load_Cycle_Count 0x0032 172 172 000 Old_age Always - 86282 194 Temperature_Celsius 0x0022 121 105 000 Old_age Always - 31 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 13808 2138713936 # 2 Short offline Completed: read failure 90% 13807 2138713936 # 3 Short offline Completed: read failure 80% 13807 2138713936 # 4 Short offline Completed without error 00% 4524 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. root@Tower:~# Is this a bad thing or should I just write & check parity again? I will get myself a decent Power supply. Thanks for the pointers, will first read up on it. Another question I have : I enabled an automatic parity check every 1st of the month. I get an email with the results. Is it possible to do a smart test on all drives automatically once in a while and get also an email with the results? (preferable in a user friendly way, just enable something in unmenu or simplefeatures, not with setting up cron scripts, etc. (if that's the only way to do it then I can try to set it up)) As for the specs of my system : unRAID Version: unRAID Server Pro, Version 5.0-rc8a Motherboard: ASUSTeK - P8B75-M Processor: Intel® CoreTM i5-3470 CPU @ 3.20GHz - 3.2 GHz Cache: L1 = 32 kB L2 = 256 kB L3 = 6144 kB Memory: 8 GB - DIMM0 = 1600 MHz DIMM1 = 1600 MHz DIMM2 = 1600 MHz DIMM3 = 1600 MHz Network: 1000Mb/s - Full Duplex How would this MB be classified on http://extreme.outervision.com/psucalculatorlite.jsp ? Desktop/Server/Regular/High End ? Thanks everybody for all the help I got already on this! Really appreciate it!
January 8, 201313 yr The parity disk needs to be totally rewritten. Use the New Config button on the Utils tab. After the rebuild the Current_Pending_Sector RAW_VALUE should be zero. The Current_Pending_Sector RAW_VALUE must be zero.
January 11, 201313 yr Author I did a preclear of the parity disk. After that, the smart report is without errors. Result of preclear : 1 sector was pending re-allocation before the start of the preclear. 1 sector was pending re-allocation after pre-read in cycle 1 of 1. 0 sectors were pending re-allocation after zero of disk in cycle 1 of 1. 0 sectors are pending re-allocation at the end of the preclear, a change of -1 in the number of sectors pending re-allocation. 0 sectors had been re-allocated before the start of the preclear. 0 sectors are re-allocated at the end of the preclear, the number of sectors re-allocated did not change. Then I used the New Config util and synced parity again. Did a parity check afterwards. All is working fine now. I ordered a Corsair AX760 PSU, it will arrive in some days. Thanks everybody for the help on this.
Archived
This topic is now archived and is closed to further replies.