September 20, 201312 yr Looks like my parity disk is disabled: Not sure what is going on. Under health, the parity disk shows it failing a SMART command. Where do I start? This is totally new to me. Thanks.
September 20, 201312 yr You may have an improperly seated drive or cable ... or perhaps a bad SATA cable. Try reseating the cables (ideally replacing the SATA cable) ... then Start the array; Stop it; unassign the parity drive; Start the Array; Stop the Array; re-assign the same drive to parity; and Start the array. That should force parity to be rebuilt on the same drive. If that doesn't resolve it, you need to replace the drive.
September 20, 201312 yr Author Interesting. I replaced the cable and the system told me "new parity disk found" (my old one) and now a parity check is in progress. Thanks for the help. Hopefully that fixes things and its not a failing disk. syslog showed nothing related to disk errors. Assuming it was just a bad cable (the system has been up for 62 days with this parity drive) -- its interesting how a cable can just go bad. Thanks again.
September 20, 201312 yr Have you moved your system at all? Was the original cable a locking cable? [Hopefully the new one is] Your cable may not have been bad -- just not seated completely. Reseating it may have been all that was necessary; but if I suspect any cable issues I always just replace the cable with a nice new locking cable.
September 20, 201312 yr Author System didn't move, but its a floor-sitting server, and the kids could have easily bumped it with the vacuum cleaner etc. The old cable was locking -- the new cable is too. I've got a drive cage (iStarUSA) but I believe the drive was seated fine in it. Frankly, when I removed the old cable, both ends were locked up tight. But I had an extra NIB cable laying around so what the heck.
September 21, 201312 yr What's the SMART output? Do you have smart history report available? If not, are you able to telnet into your unraid server? If you're able to telnet into your unraid server run this command and post the output ... smartctl -A /dev/hdc this will gives more information on what SMART bits failed. However, if SMART has failed, your drive is either toast, or will be shortly toast - aka don't trust it. At all. While there is a small chance the cable for data or the cable for power has failed, there is a greater chance the drive has failed. While this is bad - this is why we have parity calculating arrays. It supports a single drive failure. Right now, if I was in your shoes I'd buy a new drive, and pray nothing else fails while you wait for shipping and a parity rebuild.
September 21, 201312 yr It's quite possible the smart output will lead us to a bad power/data cable rendering my fear mongering invalid ... However, it's always a good idea to keep a spare drive (as large if not larger than your parity drive) for just such issues. What's the SMART output? Do you have smart history report available? If not, are you able to telnet into your unraid server? If you're able to telnet into your unraid server run this command and post the output ... smartctl -A /dev/hdc this will gives more information on what SMART bits failed. However, if SMART has failed, your drive is either toast, or will be shortly toast - aka don't trust it. At all. While there is a small chance the cable for data or the cable for power has failed, there is a greater chance the drive has failed. While this is bad - this is why we have parity calculating arrays. It supports a single drive failure. Right now, if I was in your shoes I'd buy a new drive, and pray nothing else fails while you wait for shipping and a parity rebuild.
September 21, 201312 yr Author I have no idea if this will mean anything, but here's what I get from the Health -> Disk Attributes tab: Attached to port: sdc ID# ATTRIBUTE NAME FLAG VALUE WORST THRESH TYPE UPDATED FAILED RAW VALUE 1 Raw Read Error Rate 0x000f 117 099 006 Pre-fail Always Never 155874392 3 Spin Up Time 0x0003 092 091 000 Pre-fail Always Never 0 4 Start Stop Count 0x0032 100 100 020 Old age Always Never 203 5 Reallocated Sector Ct 0x0033 100 100 010 Pre-fail Always Never 0 7 Seek Error Rate 0x000f 062 060 030 Pre-fail Always Never 1743018 9 Power On Hours 0x0032 099 099 000 Old age Always Never 1575 10 Spin Retry Count 0x0013 100 100 097 Pre-fail Always Never 0 12 Power Cycle Count 0x0032 100 100 020 Old age Always Never 10 183 Runtime Bad Block 0x0032 100 100 000 Old age Always Never 0 184 End-to-End Error 0x0032 100 100 099 Old age Always Never 0 187 Reported Uncorrect 0x0032 100 100 000 Old age Always Never 0 188 Command Timeout 0x0032 100 100 000 Old age Always Never 0 189 High Fly Writes 0x003a 098 098 000 Old age Always Never 2 190 Airflow Temperature Cel 0x0022 067 057 045 Old age Always Never 33 (Min/Max 33/41) 191 G-Sense Error Rate 0x0032 100 100 000 Old age Always Never 0 192 Power-Off Retract Count 0x0032 100 100 000 Old age Always Never 6 193 Load Cycle Count 0x0032 100 100 000 Old age Always Never 834 194 Temperature Celsius 0x0022 033 043 000 Old age Always Never 33 (0 25 0 0) 197 Current Pending Sector 0x0012 100 100 000 Old age Always Never 0 198 Offline Uncorrectable 0x0010 100 100 000 Old age Offline Never 0 199 UDMA CRC Error Count 0x003e 200 200 000 Old age Always Never 0 240 Head Flying Hours 0x0000 100 253 000 Old age Offline Never 104728482546885 241 Total LBAs Written 0x0000 100 253 000 Old age Offline Never 21661333656 242 Total LBAs Read 0x0000 100 253 000 Old age Offline Never 53856634739 The command you listed to run resulted in "no such device" being returned. Thanks for the followup replies. I may end up with a new disk anyway just to have a spare on hand, this morning's sheer panic feeling tells me I need a spare handy. Hang on, I'll run the short test now. ETA the short test has been at 90% for about 15 minutes.... is that normal?
September 21, 201312 yr From what I see (and really, take internet advice with a grain of salt) ... The current pending sector is 0. The Reallocated sector count is 0. So, your drive either decided the sectors it had previously thought were iffy weren't, and said 'fuck it, good enough for me' the one bit I don't know about is: 7 Seek Error Rate 0x000f 062 060 030 Pre-fail Always Never 1743018 Otherwise, from what you've posted your HD has been powered up for 65 days(1575 hours), has had zero reallocated sectors, has a pending reallocated sector count of 0. Anyone else care to chime in? If i"m reading these correctly, I'd run it as is, with a backup drive available in case of failure.
September 21, 201312 yr I would say that the drive is fine. My understanding is that:"Raw Read Error Rate" has to do with the drive's own internal error correction. This apparently happens all the time and is normal given the high data densities of modern hard drives. Drives from different manufacturers will also report this value differently. I have read that seagates tend to report high values for this but other drives will not. The "Seek Error Rate" means that the drive is over- or under-shooting the correct track when it moves the heads, and it has to do another (small) re-seek to acquire the track before it can read or write the data. A problem with the drive in this area is more of a performance concern rather than a concern with data integrity. The more important thing to realize with either of these attributes is that the "Raw Value" they report is actually a rate and not an absolute count of actual errors. The best way to gauge whats going on is is to look at the other columns for those attributes. The "VALUE" (not RAW VALUE) column can be considered like a score where 100 (or above) would be considered really good. The "WORST" column states what has been the worst recorded score for the drive. The "THRESH" column states at what score the drive would be considered to be failing for that attribute. So for this particular drive I would say the "Raw Read Error Rate" has a fantastic score while the "Seek Error Rate" is just alright and still in the OK range. Ultimately as far as data integrity goes you want to pay more attention to the "Reallocation Count" and "Pending Sector Count" values as they indicate failures to read the data from the disk itself. Those are solid indicators of the health and reliability of the drive.
September 21, 201312 yr Author Thanks to all for the replies. I finally got the command to work: root@ffs1:~# /usr/sbin/smartctl -A /dev/sdc smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 155874392 3 Spin_Up_Time 0x0003 092 091 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 203 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 062 060 030 Pre-fail Always - 1813923 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1579 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 10 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 098 098 000 Old_age Always - 2 190 Airflow_Temperature_Cel 0x0022 068 057 045 Old_age Always - 32 (Min/Max 32/41) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 6 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 834 194 Temperature_Celsius 0x0022 032 043 000 Old_age Always - 32 (0 25 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 198539158226121 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 23794604480 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 53856634739 Interesting to see the the "head flying hours" show 22.66 BILLION years. Ha! The self test is still running, but parity check is almost done, so maybe that's why its going so slow. The "short test" has now been running for hours and hours. If its not finished in the morning I'm not sure what I'm going to do. Probably reboot the server and cross my fingers. I don't have twins, and my kids are older. The server sits next to the entertainment center, across from the dog's bed. So any number of things could have bumped the server, although it is unlikely as the kids know better and the dog is lazy and sleeps a lot. Still not exactly sure what happened initially, but assuming parity completes and all looks good, I'll likely watch it for a few days and see what happens. Still will probably order another drive to use as a spare though.
September 23, 201312 yr The SMART data all looks good. Different manufacturers report some parameters slightly differently than others => I'd be concerned, for example, with the Seek Error Rate value (62) on a SMART report for a WD drive; but values in the 50's & 60's aren't at all unusual for Seagate drives -- so you're fine.
September 24, 201312 yr Cool. So probably a bad/unseated cable after all? Yes, that seems to be a good assumption.
Archived
This topic is now archived and is closed to further replies.