October 25, 200916 yr Red always equals bad, so i am trying to figure out what's happened. If I go to Main screen in UnMenu, beside my one drive it says: DISK_DSBL I can still browse to the drive, but it looks like it has fallen out of the UnRaid array (its not protected now). I'm also thinking the "errors" at the end may also be a big concern. The drive has 1.5TB of data on it and its only a month old. Suggestions on what to do from here?
October 25, 200916 yr 1. Replace the drive OR copy the data from the virtual drive to somewhere else. 2. Remove drive from array, and do diagnostics on it.... LOTS of diagnostics.... and either RMA it or junk it.
October 25, 200916 yr Author Is their a specific series of steps I should follow to complete this? 1. Stop the array 2. Shut machine off 3. I've just removed another 1.5TB drive from my Popcorn Hour, so I can add it to the system and run pre_clear on it. After I do this, do I just use my Windows Explorer and drag and drop the contents from the failing drive over to the new drive Where to go after this? http://tower:8080/array_management and the select "Check and Correct Parity" ? I don't want to mess this up Thanks
October 25, 200916 yr Red always equals bad, so i am trying to figure out what's happened. If I go to Main screen in UnMenu, beside my one drive it says: DISK_DSBL I can still browse to the drive, but it looks like it has fallen out of the UnRaid array (its not protected now). I'm also thinking the "errors" at the end may also be a big concern. The drive has 1.5TB of data on it and its only a month old. Suggestions on what to do from here? The very first thing you should do is post a copy of your syslog. Since you have unMENU loaded, it is as simple as clicking on the link on the syslog plug-in page and then attaching the file to your next post to this thread. Only after looking at how the drive was disabled can we know if the drive itself is at fault, or something else. Also, the disk is disabled because a write to it failed. It could be the drive itself, or a loose cable, or a loose interface card. Since you have unMENU installed, you can request "SMART" status reports easily through it. Post the "status report" output. Also, since you have one disabled drive, it is being "simulated" by reading parity and all the other data drives. Until you get this resolved, do not add or remove ANY other drives or you will lose the data being simulated by parity and the other disks. You should correct the problem as soon as possible, since if a second drive were to fail, you would lose the data on both failed drives. It is not just the data on the failed drive that is at risk, all of your data is at the same risk of a second concurrent drive failure. See this link in the wiki: http://lime-technology.com/wiki/index.php?title=Troubleshooting#What_do_I_do_if_I_get_a_red_ball_next_to_a_hard_disk.3F Do not be misled by the fact that you can still read and write to the drive with a red ball indicator. You are, in fact, writing to the parity drive as if the failed drive was working. When reading, you are reading all of the remaining drives and re-constructing the data on the failed drive. If a drive has a red ball on the unRAID management page, it has been taken out of service. You will need to take corrective action, as a second concurrent disk failure will almost certainly result in lost data. DO NOT press the button labeled "Restore" on the unRAID interface. It does not restore a disk, but instead sets a new initial configuration based on the current assigned and WORKING disks. It immediately throws away any old parity data. If you were to press it now you would erase all knowledge of the failed disk and anything that was on it. If you replace the disk, you only need to press the "Start" button to get your data rebuilt onto it. So, first post a syslog, before you reboot, before you power down to check the cabling. Joe L.
October 25, 200916 yr Is their a specific series of steps I should follow to complete this? 1. Stop the array 2. Shut machine off 3. I've just removed another 1.5TB drive from my Popcorn Hour, so I can add it to the system and run pre_clear on it. After I do this, do I just use my Windows Explorer and drag and drop the contents from the failing drive over to the new drive Where to go after this? http://tower:8080/array_management and the select "Check and Correct Parity" ? I don't want to mess this up Thanks No, that is NOT what to do. You do not need to copy the files and in fact you cannot add any drives to the protected array while it is in a degraded state. You will not be allowed to check parity... You have only one possible solution. 1. Post the syslog. 2. Get SMART and hdparm reportss for the failed drive. If it has actually failed, as opposed to a bad or loose connection, then replace the drive, press "Start" after checking the checkbox under it, and have unRAID build your old contents onto it. If we suspect a bad connection, you will stop the array, un-assign the drive, re-start the array (It will show it as missing, but still simulate it with the contents) then stop the array once more, re-assign the disk, then press "Start" once more. It will re-construct the drive. If you are certain it was a loose cable, you can use the "Trust My Parity" procedure as described in the wiki. (It is a special procedure where you do press the button I said not to, but invoke a special command after pressing it but before starting the array so it does not invalidate your parity and throw away all your data) Ask questions BEFORE you do anything, they are a lot easier to answer BEFORE you do something to endanger your data. Joe L.
October 25, 200916 yr Author OK... I will stop doing anything with it right now. 1. Syslog Attached 2. Smart Status Report SMART status Info for /dev/sdg smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD15EADS-00P8B0 Serial Number: WD-WMAVU0072618 Firmware Version: 01.00A01 User Capacity: 1,500,301,910,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Oct 25 17:34:15 2009 GMT+5 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (32760) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 200 179 021 Pre-fail Always - 4975 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 119 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 669 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 57 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 44 193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 3403 194 Temperature_Celsius 0x0022 122 115 000 Old_age Always - 28 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. 3. HDParm Info HDParm Info for /dev/sdg /dev/sdg: ATA device, with non-removable media Model Number: WDC WD15EADS-00P8B0 Serial Number: WD-WMAVU0072618 Firmware Revision: 01.00A01 Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5 Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 2930277168 device size with M = 1024*1024: 1430799 MBytes device size with M = 1000*1000: 1500301 MBytes (1500 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 0 Recommended acoustic management value: 128, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * DOWNLOAD_MICROCODE Power-Up In Standby feature set * SET_FEATURES required to spinup after power up SET_MAX security extension Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * 64-bit World wide name * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Host-initiated interface power management * Phy event counters * unknown 76[12] DMA Setup Auto-Activate optimization * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Long Sector Access (AC1) * SCT LBA Segment Access (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) unknown 206[13] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 334min for SECURITY ERASE UNIT. 334min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 50014ee017dfab1 NAA : 5 IEEE OUI : 14ee Unique ID : 017dfab1 Checksum: correct
October 25, 200916 yr Author My syslog was too big for one message, so here is everything before today ( the first one had today's logs ) For reference, all I have done, over the past few days, is link my Popcorn Hour to the server and then week it ( ie. get all the file names proper, add some .nfo files, images, etc )
October 25, 200916 yr The syslog you posted is filled with repeating messages. The syslog rotation has already copied your original syslog to an alternate file. Type ls -l /var/log/syslog* to see them all. As an example, on my server it looks like this: ls -l /var/log/syslog* -rw-r--r-- 1 root root 17347 Oct 25 01:54 /var/log/syslog -rw-r--r-- 1 root root 1088573 Oct 16 01:36 /var/log/syslog.1 The syslog.1 file is the first part of my syslog. when it filled, it was "rotated" out so a single file would not use up all my ram. You may have a few syslog files in your /var/log folder. We need to see the earlier one, where the error first occurred. You may need to copy it to your flash drive first and then upload it. Joe L.
October 25, 200916 yr Author If the issue isn't in the log I just posted, it had to be in this one ( as this the balance of my syslog ) from a week ago. This is a new drive from a few weeks ago that I pre-cleared multiple times. Here are some of the pre-clear results from when I first got it ( if it helps any ) Disk Temperature: 27C, Elapsed Time: 22:33:36 =========================================================================== = unRAID server Pre-Clear disk /dev/sdb = cycle 1 of 1 = Disk Pre-Clear-Read completed DONE = Step 1 of 10 - Copying zeros to first 2048k bytes DONE = Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE = Step 3 of 10 - Disk is now cleared from MBR onward. DONE = Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4 DONE = Step 5 of 10 - Clearing MBR code area DONE = Step 6 of 10 - Setting MBR signature bytes DONE = Step 7 of 10 - Setting partition 1 to precleared state DONE = Step 8 of 10 - Notifying kernel we changed the partitioning DONE = Step 9 of 10 - Creating the /dev/disk/by* entries DONE = Step 10 of 10 - Testing if the clear has been successful. DONE = Post-Read in progress: 99% complete. ( 1,500,291,072,000 of 1,500,301,910,016 bytes read ) 43.7 MB/s Disk Temperature: 27C, Elapsed Time: 22:34:47 =========================================================================== = unRAID server Pre-Clear disk /dev/sdb = cycle 1 of 1 = Disk Pre-Clear-Read completed DONE = Step 1 of 10 - Copying zeros to first 2048k bytes DONE = Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE = Step 3 of 10 - Disk is now cleared from MBR onward. DONE = Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4 DONE = Step 5 of 10 - Clearing MBR code area DONE = Step 6 of 10 - Setting MBR signature bytes DONE = Step 7 of 10 - Setting partition 1 to precleared state DONE = Step 8 of 10 - Notifying kernel we changed the partitioning DONE = Step 9 of 10 - Creating the /dev/disk/by* entries DONE = Step 10 of 10 - Testing if the clear has been successful. DONE = Disk Post-Clear-Read completed DONE Disk Temperature: 26C, Elapsed Time: 22:35:56 ============================================================================ == == Disk /dev/sdb has been successfully precleared == ============================================================================ S.M.A.R.T. error count differences detected after pre-clear note, some 'raw' values may change, but not be an indication of a problem 63c63 < 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 49 --- > 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 51 ============================================================================ root@Tower:/boot# =========================================================================== = unRAID server Pre-Clear disk /dev/sdb = cycle 1 of 1 = Disk Pre-Clear-Read completed DONE = Step 1 of 10 - Copying zeros to first 2048k bytes DONE = Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE = Step 3 of 10 - Disk is now cleared from MBR onward. DONE = Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4 DONE = Step 5 of 10 - Clearing MBR code area DONE = Step 6 of 10 - Setting MBR signature bytes DONE = Step 7 of 10 - Setting partition 1 to precleared state DONE = Step 8 of 10 - Notifying kernel we changed the partitioning DONE = Step 9 of 10 - Creating the /dev/disk/by* entries DONE = Step 10 of 10 - Testing if the clear has been successful. DONE = Post-Read in progress: 99% complete. ( 1,498,646,016,000 of 1,500,301,910,016 bytes read ) 46.1 MB/s Disk Temperature: 25C, Elapsed Time: 22:34:39 =========================================================================== = unRAID server Pre-Clear disk /dev/sdb = cycle 1 of 1 = Disk Pre-Clear-Read completed DONE = Step 1 of 10 - Copying zeros to first 2048k bytes DONE = Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE = Step 3 of 10 - Disk is now cleared from MBR onward. DONE = Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4 DONE = Step 5 of 10 - Clearing MBR code area DONE = Step 6 of 10 - Setting MBR signature bytes DONE = Step 7 of 10 - Setting partition 1 to precleared state DONE = Step 8 of 10 - Notifying kernel we changed the partitioning DONE = Step 9 of 10 - Creating the /dev/disk/by* entries DONE = Step 10 of 10 - Testing if the clear has been successful. DONE = Post-Read in progress: 99% complete. ( 1,500,291,072,000 of 1,500,301,910,016 bytes read ) 46.2 MB/s Disk Temperature: 25C, Elapsed Time: 22:35:48 =========================================================================== = unRAID server Pre-Clear disk /dev/sdb = cycle 1 of 1 = Disk Pre-Clear-Read completed DONE = Step 1 of 10 - Copying zeros to first 2048k bytes DONE = Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE = Step 3 of 10 - Disk is now cleared from MBR onward. DONE = Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4 DONE = Step 5 of 10 - Clearing MBR code area DONE = Step 6 of 10 - Setting MBR signature bytes DONE = Step 7 of 10 - Setting partition 1 to precleared state DONE = Step 8 of 10 - Notifying kernel we changed the partitioning DONE = Step 9 of 10 - Creating the /dev/disk/by* entries DONE = Step 10 of 10 - Testing if the clear has been successful. DONE = Disk Post-Clear-Read completed DONE Disk Temperature: 25C, Elapsed Time: 22:36:57 ============================================================================ == == Disk /dev/sdb has been successfully precleared == ============================================================================ S.M.A.R.T. error count differences detected after pre-clear note, some 'raw' values may change, but not be an indication of a problem 58c58 < 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 --- > 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 63c63 < 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 76 --- > 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 77 ============================================================================ root@Tower:/boot#
October 25, 200916 yr It's in the latter log... look at Oct 24 20:57:52 I'm wondering if the power blip you had had some lingering effects. What make and size of UPS do you have? I suggest changing your disks from AHCI back to IDE.
October 25, 200916 yr You read my mind... nice... Yes, it looks like a poor connection more than a failed drive. The SMART report and hdparm look good. The first errors I see are here: Oct 24 20:57:52 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Oct 24 20:57:52 Tower kernel: ata2.00: irq_stat 0x40000001 Oct 24 20:57:52 Tower kernel: ata2.00: cmd 25/00:08:87:40:d4/00:00:31:00:00/e0 tag 0 dma 4096 in Oct 24 20:57:52 Tower kernel: res 41/04:00:87:40:d4/00:00:31:00:00/e0 Emask 0x1 (device error) Oct 24 20:57:52 Tower kernel: ata2.00: status: { DRDY ERR } Oct 24 20:57:52 Tower kernel: ata2.00: error: { ABRT } Oct 24 20:57:52 Tower kernel: ata2.00: configured for UDMA/133 Oct 24 20:57:52 Tower kernel: ata2: EH complete Oct 24 20:58:02 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Oct 24 20:58:02 Tower kernel: ata2.00: irq_stat 0x40000001 Oct 24 20:58:02 Tower kernel: ata2.00: cmd 25/00:08:87:40:d4/00:00:31:00:00/e0 tag 0 dma 4096 in Oct 24 20:58:02 Tower kernel: res 41/04:00:87:40:d4/00:00:31:00:00/e0 Emask 0x1 (device error) Oct 24 20:58:02 Tower kernel: ata2.00: status: { DRDY ERR } Oct 24 20:58:02 Tower kernel: ata2.00: error: { ABRT } Oct 24 20:58:02 Tower kernel: ata2.00: configured for UDMA/133 Oct 24 20:58:02 Tower kernel: ata2: EH complete Oct 24 20:58:12 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Oct 24 20:58:12 Tower kernel: ata2.00: irq_stat 0x40000001 Oct 24 20:58:12 Tower kernel: ata2.00: cmd 25/00:08:87:40:d4/00:00:31:00:00/e0 tag 0 dma 4096 in Oct 24 20:58:12 Tower kernel: res 41/04:00:87:40:d4/00:00:31:00:00/e0 Emask 0x1 (device error) Oct 24 20:58:12 Tower kernel: ata2.00: status: { DRDY ERR } Oct 24 20:58:12 Tower kernel: ata2.00: error: { ABRT } Oct 24 20:58:12 Tower kernel: ata2.00: configured for UDMA/133 Oct 24 20:58:12 Tower kernel: ata2: EH complete Oct 24 20:58:22 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Oct 24 20:58:22 Tower kernel: ata2.00: irq_stat 0x40000001 Oct 24 20:58:22 Tower kernel: ata2.00: cmd 25/00:08:87:40:d4/00:00:31:00:00/e0 tag 0 dma 4096 in Oct 24 20:58:22 Tower kernel: res 41/04:00:87:40:d4/00:00:31:00:00/e0 Emask 0x1 (device error) Oct 24 20:58:22 Tower kernel: ata2.00: status: { DRDY ERR } Oct 24 20:58:22 Tower kernel: ata2.00: error: { ABRT } Oct 24 20:58:22 Tower kernel: ata2.00: configured for UDMA/133 Oct 24 20:58:22 Tower kernel: ata2: EH complete Oct 24 20:58:31 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Oct 24 20:58:31 Tower kernel: ata2.00: irq_stat 0x40000001 Oct 24 20:58:31 Tower kernel: ata2.00: cmd 25/00:08:87:40:d4/00:00:31:00:00/e0 tag 0 dma 4096 in Oct 24 20:58:31 Tower kernel: res 41/04:00:87:40:d4/00:00:31:00:00/e0 Emask 0x1 (device error) Oct 24 20:58:31 Tower kernel: ata2.00: status: { DRDY ERR } Oct 24 20:58:31 Tower kernel: ata2.00: error: { ABRT } Oct 24 20:58:31 Tower kernel: ata2.00: configured for UDMA/133 Oct 24 20:58:31 Tower kernel: ata2: EH complete Oct 24 20:58:41 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Oct 24 20:58:41 Tower kernel: ata2.00: irq_stat 0x40000001 Oct 24 20:58:41 Tower kernel: ata2.00: cmd 25/00:08:87:40:d4/00:00:31:00:00/e0 tag 0 dma 4096 in Oct 24 20:58:41 Tower kernel: res 41/04:00:87:40:d4/00:00:31:00:00/e0 Emask 0x1 (device error) Oct 24 20:58:41 Tower kernel: ata2.00: status: { DRDY ERR } Oct 24 20:58:41 Tower kernel: ata2.00: error: { ABRT } Oct 24 20:58:41 Tower kernel: ata2.00: configured for UDMA/133 Oct 24 20:58:41 Tower kernel: sd 1:0:0:0: [sdb] Result: hostbyte=0x00 driverbyte=0x08 Oct 24 20:58:41 Tower kernel: sd 1:0:0:0: [sdb] Sense Key : 0xb [current] [descriptor] Oct 24 20:58:41 Tower kernel: Descriptor sense data with sense descriptors (in hex): Oct 24 20:58:41 Tower kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 Oct 24 20:58:41 Tower kernel: 31 d4 40 87 Oct 24 20:58:41 Tower kernel: sd 1:0:0:0: [sdb] ASC=0x0 ASCQ=0x0 Oct 24 20:58:41 Tower kernel: end_request: I/O error, dev sdb, sector 835993735 Oct 24 20:58:41 Tower kernel: ata2: EH complete Oct 24 20:58:41 Tower kernel: md: disk1 read error I'd stop the array, power down, re-seat the connectors (both power AND data) especially if you are using any kind of drive trays, and then power back up. If the disk still looks good via hdparm and smart reports and ALL the other disks are still "green" you are a candidate for the trust my parity procedure (if the hdparm and smartctl reports print it is good, if they are unavailable, the communications to the disk failed again) This SHOULD NOT be used if the disk has been written to since the failure since it will assume the physical disk is correct and the virtual disk is not. The procedure is described here: http://lime-technology.com/wiki/index.php?title=Make_unRAID_Trust_the_Parity_Drive,_Avoid_Rebuilding_Parity_Unnecessarily You must follow it exactly. Read it through and ask questions before starting it. You must issue a mdcmd set invalidslot 99 command and get the expected response BEFORE pressing the "Start" button, but after pressing the button labeled "restore"
October 25, 200916 yr What mobo do you have... and does it have an Intel chipset? My money is on AHIC issues... changing the drives to IDE will eliminate that.
October 25, 200916 yr Author It's in the latter log... look at Oct 24 20:57:52 I'm wondering if the power blip you had had some lingering effects. What make and size of UPS do you have? I suggest changing your disks from AHCI back to IDE. It's in the latter log... look at Oct 24 20:57:52 I'm wondering if the power blip you had had some lingering effects. What make and size of UPS do you have? I suggest changing your disks from AHCI back to IDE. I have a APC 550V UPS, which seems to handle things fine. If I recall correctly, I think I changed IDE to AHCI as my server was picking up one of my drives as IDE and not SATA ( as the port may have been for PATA ). Should I worry about this change now or later?
October 25, 200916 yr It's in the latter log... look at Oct 24 20:57:52 I'm wondering if the power blip you had had some lingering effects. What make and size of UPS do you have? I suggest changing your disks from AHCI back to IDE. It's in the latter log... look at Oct 24 20:57:52 I'm wondering if the power blip you had had some lingering effects. What make and size of UPS do you have? I suggest changing your disks from AHCI back to IDE. I have a APC 550V UPS, which seems to handle things fine. If I recall correctly, I think I changed IDE to AHCI as my server was picking up one of my drives as IDE and not SATA ( as the port may have been for PATA ). Should I worry about this change now or later? I don't think it has anything to do with this... (but I've been proven wrong in the past... It looks more like a cabling issue than anything else so far)
October 25, 200916 yr Author What mobo do you have... and does it have an Intel chipset? My money is on AHIC issues... changing the drives to IDE will eliminate that. I bought one of the recommended ones in the forums: Gigabyte MA74GM-S2 with AMD
October 25, 200916 yr Author This SHOULD NOT be used if the disk has been written to since the failure since it will assume the physical disk is correct and the virtual disk is not. I didn't realize it was down until I posted here, so I suspect I have done some minor writing to it. I have been adding .nfo files to some of the folders, however as they are adding through shares/virtual drives, I can't say exactly which drives I have written to ( but it can be assumed at least once I hit the drive in question ).
October 25, 200916 yr That's got a good chipset (AMD SB700) and the Smartctl output is fine so the disk itself does not seem to be the problem, so see what happens with reseated cables.
October 25, 200916 yr Author I've shutdown, re-seated cables ( even replaced the SATA cable to the drive in question ). I've started it back up and it still shows as DISK_DSBL I've re-ran those two reports below. What next? HDParm Info for /dev/sdb WDC_WD15EADS-00P8B0_WD-WMAVU0072618 /dev/sdb: ATA device, with non-removable media Model Number: WDC WD15EADS-00P8B0 Serial Number: WD-WMAVU0072618 Firmware Revision: 01.00A01 Transport: Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5 Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 2930277168 device size with M = 1024*1024: 1430799 MBytes device size with M = 1000*1000: 1500301 MBytes (1500 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 0 Recommended acoustic management value: 128, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * DOWNLOAD_MICROCODE Power-Up In Standby feature set * SET_FEATURES required to spinup after power up SET_MAX security extension Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * 64-bit World wide name * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Host-initiated interface power management * Phy event counters * unknown 76[12] DMA Setup Auto-Activate optimization * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Long Sector Access (AC1) * SCT LBA Segment Access (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) unknown 206[13] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 334min for SECURITY ERASE UNIT. 334min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 50014ee017dfab1 NAA : 5 IEEE OUI : 14ee Unique ID : 017dfab1 Checksum: correct Statistics for /dev/sdb WDC_WD15EADS-00P8B0_WD-WMAVU0072618 smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD15EADS-00P8B0 Serial Number: WD-WMAVU0072618 Firmware Version: 01.00A01 User Capacity: 1,500,301,910,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Oct 25 18:34:57 2009 GMT+5 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (32760) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 200 179 021 Pre-fail Always - 4975 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 120 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 670 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 58 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 44 193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 3414 194 Temperature_Celsius 0x0022 123 115 000 Old_age Always - 27 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
October 25, 200916 yr [i've started it back up and it still shows as DISK_DSBL I've re-ran those two reports below. What next? Once failed, it will not reuse it unless it "thinks" you have replaced it. To do that you must make it "forget" the serial number of the old assigned drive. You now only need to Stop the array Un-assign the disk that has failed Start the array with the disk un-assigned Stop the array again Re-assign the disk that had failed Start the array (using the "Start" button), to allow the failed disk to be rebuilt onto the "replacement" since the slot was previously "un-assigned" it will now use the same drive as its own "replacement" when the rebuild is complete, you will have parity protection once more. The "rebuild" will take about the same time as a full parity calc. Until it is done, you are at risk of losing data if concurrently another drive were to 'really" fail. Joe L. Edit: fixed my cut-and-replace...
October 25, 200916 yr Author You now only need to Stop the array Un-assign the disk that has failed Start the array with the disk un-assigned Stop the array again Re-assign the disk that had failed --- Done to Here ---- Start the array (using the "Start" button), to allow the failed disk to be rebuilt onto the "replacement" since the slot was previously "un-assigned" it will now use the same drive as its own "replacement" when the rebuild is complete, you will have parity protection once more. The "rebuild" will take about the same time as a full parity calc. Until it is done, you are at risk of losing data if concurrently another drive were to 'really" fail. Start is "grey" right now, unless I select: Start will bring the array on-line, start Data-Rebuild, and then expand the file system (if possible). I'm sure I want to do this Im sure.... right ? ( You are playing the role of me here ) Joe L. Edit: fixed my cut-and-replace...
October 25, 200916 yr Author I didn't transfer any new DVD's over to the array today, just spend all day doing maintenance to get my Popcorn Hour reading the folders properly to generate it's index.html file. If its only new stuff done today that is lost, then I can recover from it.
October 26, 200916 yr You now only need to Stop the array Un-assign the disk that has failed Start the array with the disk un-assigned Stop the array again Re-assign the disk that had failed --- Done to Here ---- Start the array (using the "Start" button), to allow the failed disk to be rebuilt onto the "replacement" since the slot was previously "un-assigned" it will now use the same drive as its own "replacement" when the rebuild is complete, you will have parity protection once more. The "rebuild" will take about the same time as a full parity calc. Until it is done, you are at risk of losing data if concurrently another drive were to 'really" fail. Start is "grey" right now, unless I select: Start will bring the array on-line, start Data-Rebuild, and then expand the file system (if possible). I'm sure I want to do this Im sure.... right ? ( You are playing the role of me here ) Joe L. Edit: fixed my cut-and-replace... Yes, I think you are "sure" you want to do the "Data-Rebuild" (You might need to check the checkbox under "Start" to enable it.) You must use "Start" to rebuild the disk. Joe L.
October 26, 200916 yr Author Done. While you are being me for the night, do you think I should break up with my girlfriend or let things go a little long to see if a future does exist for us?
October 26, 200916 yr Done. While you are being me for the night, do you think I should break up with my girlfriend or let things go a little long to see if a future does exist for us? If she looks like the girl in your avatar pic then I think you should break up with her and send her my way
Archived
This topic is now archived and is closed to further replies.