spinbot Posted June 29, 2010 Share Posted June 29, 2010 I seem to have run into a problem this morning. I was trying to copy over a few DVD's that I ripped, however I got some errors in the process and as a result of them, the parity drive appears to have been disabled. I stopped the transfer that was in progress ( the only thing moved over was a jpg ) and I removed that one folder and one image from the server. I then stopped the array and wanted to run a parity check, however that isn't one of my options. I can only select from: - Start will bring the array on-line (array will be unprotected). - Restore will initialize the stored array configuration; all drives will appear as New, but data disk contents are not affected. Then the usual, Spin Up, Spin Down, Reboot, Shutdown. I have 5 data drives and 1 parity drive, all of which are running at around 96% full ( that leaves me about 250GB of space remaining ). I had have one of the 8 Port Sata controllers to install and I have 2 more 1.5TB drives in the mail, along with sata cables and a breakout cable. I'd like to be cautious right now, as I won't want to jeopardize losing my data. If some of the experts (ie. those who know 1% more than I) could direct me on what my next step is, I would appreciate it Thanks! Here's my most recent syslog entries. Jun 29 05:52:34 Tower unmenu[1715]: cat: /sys/block/sde/stat: No such file or directory Jun 29 05:53:35 Tower last message repeated 2 times Jun 29 05:53:36 Tower unmenu[1715]: cat: /sys/block/sde/stat: No such file or directory Jun 29 05:54:34 Tower kernel: md: disk0 read error Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15688/0, count: 1 Jun 29 05:54:34 Tower kernel: md: disk0 read error Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15696/0, count: 1 Jun 29 05:54:34 Tower kernel: md: disk0 read error Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15704/0, count: 1 Jun 29 05:54:34 Tower kernel: md: disk0 read error Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15712/0, count: 1 Jun 29 05:54:34 Tower kernel: md: disk0 read error Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15720/0, count: 1 Jun 29 05:54:34 Tower kernel: md: disk0 read error Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15728/0, count: 1 Jun 29 05:54:34 Tower kernel: md: disk0 read error Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15736/0, count: 1 Jun 29 05:54:34 Tower kernel: md: disk0 read error Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15744/0, count: 1 Jun 29 05:54:34 Tower kernel: md: disk0 read error Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15752/0, count: 1 Jun 29 05:54:34 Tower kernel: md: disk0 read error Jun 29 05:54:34 Tower kernel: handle_stripe read error: 15760/0, count: 1 Jun 29 05:54:36 Tower kernel: md: disk0 read error Jun 29 05:54:36 Tower kernel: handle_stripe read error: 65680/0, count: 1 Jun 29 05:54:36 Tower unmenu[1715]: cat: /sys/block/sde/stat: No such file or directory Jun 29 05:54:43 Tower kernel: md: disk0 write error Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15688/0, count: 1 Jun 29 05:54:43 Tower kernel: md: disk0 write error Jun 29 05:54:43 Tower kernel: md: recovery thread woken up ... Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15696/0, count: 1 Jun 29 05:54:43 Tower kernel: md: disk0 write error Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15704/0, count: 1 Jun 29 05:54:43 Tower kernel: md: disk0 write error Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15712/0, count: 1 Jun 29 05:54:43 Tower kernel: md: disk0 write error Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15720/0, count: 1 Jun 29 05:54:43 Tower kernel: md: disk0 write error Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15728/0, count: 1 Jun 29 05:54:43 Tower kernel: md: disk0 write error Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15736/0, count: 1 Jun 29 05:54:43 Tower kernel: md: disk0 write error Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15744/0, count: 1 Jun 29 05:54:43 Tower kernel: md: disk0 write error Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15752/0, count: 1 Jun 29 05:54:43 Tower kernel: md: disk0 write error Jun 29 05:54:43 Tower kernel: handle_stripe write error: 15760/0, count: 1 Jun 29 05:54:43 Tower kernel: md: disk0 write error Jun 29 05:54:43 Tower kernel: handle_stripe write error: 65680/0, count: 1 Jun 29 05:54:43 Tower kernel: md: recovery thread has nothing to resync Jun 29 05:54:45 Tower unmenu[1715]: cat: /sys/block/sde/stat: No such file or directory Jun 29 05:55:45 Tower unmenu[1715]: cat: /sys/block/sde/stat: No such file or directory Jun 29 05:56:47 Tower last message repeated 2 times Jun 29 05:57:42 Tower last message repeated 2 times Jun 29 05:57:42 Tower emhttp: disk_temperature: open: No such file or directory Jun 29 05:57:53 Tower emhttp: Spinning down all drives... Link to comment
Joe L. Posted June 29, 2010 Share Posted June 29, 2010 Basically, step 1 is to post a full syslog as an attachment. Other than that, your parity drive was taken out of service when a "write" to it failed. Your data disks are probably perfectly fine, but obviously, not protected from a failure by the parity drive. Therefore, be very careful in working on the array to not damage them or dislodge the cable to them. The write to the parity drive could have failed because 1. the drive failed 2. the power cable to the drive is loose/intermittent, or a connector on a splitter feeding the power cable is loose/intermittent. 3. if plugged into a backplane, the power or SATA connector to the drive could be loose/intermittent. 4. The SATA cable to it could be loose/intermittent/defective. When checking power connections to the drives, be sure to stop the array, power down, AND unplug the server from the wall/turn off the power supply switch. Before doing anything you can attempt to get a smartctl report or hdparm report from the drive. If it is dead. it might not respond at all. If the drive has died, your only choice is to replace it. The replacement must be as large or larger than any of your data drives. (Normally, it also has to be as large or larger than the parity drive, but since the failed drive is the parity drive, it just has to be as large or larger than the data drives.) Now is not the time to look for a perfect sale on a drive, now is when you spend a few dollars more so you can replace the drive quickly. Then, power down, replace the parity drive, trying hard not to dislodge the cables to the other drives, and power up. The array should present you with the option to start the array and build a new set of parity data. That will write the entire parity drive. Once it is done, you'll want to immediately run a parity check. That will then read the entire parity drive. Until you do this, you'll not know if the parity data you wrote the drive is readable. Normally I'd also warn against pressing the button labeled as "restore" as it is actually an "Initialize Configuration and Immediately Invalidate Parity" button. In your case, parity is already invalid. If you do find a loose cable to the parity drive and you can get a good smartctl report from it after correcting the cabling you can use the procedure described here in the wiki to put it back in service but you will need run a full parity check to fix what it could not originally write (remember, it was taken out of service because a "write" to it failed) Joe L. Link to comment
spinbot Posted June 29, 2010 Author Share Posted June 29, 2010 Thanks Joe. I will go through this, this evening, as I have to take off work for 15 minutes ago I will report later with info you required. Link to comment
spinbot Posted July 1, 2010 Author Share Posted July 1, 2010 Here's some of the info I can add ( or tried to get ). Using UnMenu I tried to run Smart Status Report and got: Statistics for /dev/sde ST31500341AS_9VS2L1L4 smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Smartctl open device: /dev/sde failed: No such file or directory I then ran HDParm Info: HDParm Info for /dev/sde ST31500341AS_9VS2L1L4 (nothing appeared) So.....I opened up the case and what do I see....one of the sata cables had the two plastic parts, that snap around the cable end, partially come apart and the cable wasn't plugged in all the way ( it wasn't a parallel with the drive ). I was able to run SmartClr and HDParm. HEre is the info: Statistics for /dev/sde ST31500341AS_9VS2L1L4 smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: ST31500341AS Serial Number: 9VS2L1L4 Firmware Version: CC1H User Capacity: 1,500,301,910,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Wed Jun 30 19:20:00 2010 GMT+5 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 609) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 124539259 3 Spin_Up_Time 0x0003 093 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 234 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 5872207 9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 6024 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 9 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 106 190 Airflow_Temperature_Cel 0x0022 067 059 045 Old_age Always - 33 (Lifetime Min/Max 31/33) 194 Temperature_Celsius 0x0022 033 041 000 Old_age Always - 33 (0 19 0 0) 195 Hardware_ECC_Recovered 0x001a 051 032 000 Old_age Always - 124539259 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 115440130982907 241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 385727526 242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 1204353787 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. HDParm Info for /dev/sde ST31500341AS_9VS2L1L4 /dev/sde: ATA device, with non-removable media Model Number: ST31500341AS Serial Number: 9VS2L1L4 Firmware Revision: CC1H Transport: Serial Standards: Used: unknown (minor revision code 0x0029) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 2930277168 device size with M = 1024*1024: 1430799 MBytes device size with M = 1000*1000: 1500301 MBytes (1500 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name Write-Read-Verify feature set * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * SATA-I signaling speed (1.5Gb/s) * SATA-II signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters Device-initiated interface power management * Software settings preservation * SMART Command Transport (SCT) feature set * SCT Long Sector Access (AC1) * SCT LBA Segment Access (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) unknown 206[12] (vendor specific) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 256min for SECURITY ERASE UNIT. 256min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 5000c50015af8095 NAA : 5 IEEE OUI : c50 Unique ID : 015af8095 Checksum: correct So.......now I am following the steps to Restore my configuration, as per the link you provided. All drives are "Green" and the Parity Check is in Progress. After that finishes, I then have to work on removing the HPA feature as two of my drives are slightly undersized because of it and I would like to fix it before I add two new drives, of equal size to my parity or before it writes to my parity drive and causes more grief As always, thank you Joe for troubleshooting steps and links to Wiki's to fix things ------------------------------------------------------------------------------------------------ Update: I completed the Parity Check and this is what I got: (Last checked on 7/1/2010 1:17:31 AM, finding 1068401 errors.) Could this be from when I initially got the write errors, I removed the items from the server that partially wrote ( as I didn't know which files moved over cleanly, so I was just going to re-copy them all )? I am 99% sure the problems were all from the SATA cable not connected properly ( ie. I trust the parity drive ). I'm just trying to figure out now how to get back to where I was before this happened. What would be the recommended next step? Thanks Link to comment
Joe L. Posted July 1, 2010 Share Posted July 1, 2010 The files you were writing to the drive were not written to it once the cable had loosened. They were written to the "simulated" drive, the drive simulated by parity in combination with all the other data drives. If you followed the "trust" procedure, you basically asked the server to use the actual contents of the failed disk and to update parity based on it. That would erase/undo anything you had done to the disk since the cable loosened. If you had wanted to keep what you had written to the drive when the cable was disconnected you could have just let the array re-construct the failed drive. This is described in the wiki at the link I had provided: A situation where this procedure may NOT apply: * You wish the keep the data that was in the process of being written when the "write" to the disk failed. If a disk is off-line, it is because a "write" to it failed. If the drive went off-line when you were saving the only copy of important files/music/pictures/movies, etc then the failed drive does not contain any of the data that was written to it. Any files written to the failed drive were recorded in parity, but not the physical data drive. It is entirely possible to load the entire failed disk with files, but only upate parity and not the physical data disk, because it is disabled. If you use this procedure, it will force the disabled back online, and when the resulting parity check proceeds, it will be updated to reflect the data on the physical disk. In other words, using this procedure will bring the disk back without the new files written to it when it went off-line or any written to it since that time. You will effectively rolled back the clock, as if they were never written to the array. A subsequent parity check will contain many errors, as it is brought in sync with the physical disk contents. Remember, the disk was taken off-line because a write to it failed. If you want to preserve the data that was written to the failed drive, you must un-assign the drive, start the array, stop the array, re-assign the drive, and let unRAID re-construct the contents as created by parity and all the other data drives. They can reconstruct a full copy of what was written to the array based on parity and the other data drives. You will not have parity protection during this period of time, but you will have the latest data written. Joe L. Link to comment
spinbot Posted July 1, 2010 Author Share Posted July 1, 2010 Sorry..... comprehension was limited as it was after 3am when I posted ( I tried to stay up until the rebuild finished ) Considering that the problem all started with some write errors, then it would be expected to have errors on the rebuild. If I was to run the parity check now, assuming all the drives are in working order, it should pass the parity check ... correct? Link to comment
Joe L. Posted July 1, 2010 Share Posted July 1, 2010 Sorry..... comprehension was limited as it was after 3am when I posted ( I tried to stay up until the rebuild finished ) Considering that the problem all started with some write errors, then it would be expected to have errors on the rebuild. If I was to run the parity check now, assuming all the drives are in working order, it should pass the parity check ... correct? A second parity check should be error free. Link to comment
spinbot Posted July 1, 2010 Author Share Posted July 1, 2010 As you suggested: Parity-Check. (Last checked on 7/1/2010 1:51:30 PM, finding 0 errors.) Glad that's over! Lesson's learned: 1. Get SATA cables that lock 2. Always have 1 spare SATA cable 3. Trust Joe 4. Thank Joe 5. Remember to Read the Wiki's (all of the related Wiki, not just half) I have two more jobs to work on this week......beware! (Removing the HPA from my motherboard and getting two of my drives to hopefully show equal size to the parity drive AND installed one of the 8-Port SATA expansion cards and 2 new 1.5TB drives on it) Fun Fun Fun Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.