December 9, 200916 yr I recently upgraded two old 500GB disks with two new 1TB disks (disk10 and disk11). All went fine. This was about three weeks ago. Came home one day and rtorrent (I think) had killed my system. Couldn't access through telnet, local keyboard and monitor, or anything. Pressed the power button and safe powerdown script wouldn't even evoke as usual. I couldn't do anything to gain access to the server. So I held down the power button until the computer turned off. When I rebooted, the array was down because the error log told me that my super.dat file was corrupt. I simply did the "trust your parity" routine and the system came back online with no parity errors after the check. Everything has been working great. Today I looked at bubbaQs SMART parameter tracking database and received some errors and warnings. disk10 sdg 9VP266CN: *ERROR* - Reallocated_Sector_Ct it is now 2449 (error threshold is 30) and ata_error_count=12 disk10 is showing 32 errors on the unRaid main page. disk11 sdc 9VP2A6KP: *ERROR* - UDMA_CRC_Error_Count it is now 1160 (error threshold is 75) disk9 sdm 9TE0NDJZ: WARNING - Reallocated_Sector_Ct it is now 11 (warning threshold is 10) Parity is Valid:. Last parity check 4 days ago with no sync errors. Red entries from the syslog Dec 7 20:08:43 Tower kernel: ata6.00: exception Emask 0x0 SAct 0x7ffffff SErr 0x0 action 0x0 Dec 7 20:08:43 Tower kernel: ata6.00: irq_stat 0x00020002, device error via SDB FIS Dec 7 20:08:43 Tower kernel: ata6.00: cmd 61/08:00:7f:5b:0f/00:00:4c:00:00/40 tag 0 ncq 4096 out Dec 7 20:08:43 Tower kernel: res 60/02:00:00:00:00/00:00:00:00:60/00 Emask 0x1 (device error) Dec 7 20:08:43 Tower kernel: ata6.00: status: { DRDY DF } Dec 7 20:08:43 Tower kernel: ata6.00: model number mismatch 'ST31000528AS' != '' Dec 7 20:08:43 Tower kernel: ata6.00: revalidation failed (errno=-19) Dec 7 20:08:43 Tower kernel: ata6: limiting SATA link speed to 1.5 Gbps Dec 7 20:08:43 Tower kernel: ata6.00: limiting speed to UDMA/100:PIO3 Dec 7 20:08:43 Tower kernel: ata6: hard resetting link Dec 7 20:08:45 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 10) Dec 8 06:23:09 Tower kernel: ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Dec 8 06:23:09 Tower kernel: ata6.00: cmd e5/00:00:00:00:00/00:00:00:00:00/40 tag 0 Dec 8 06:23:09 Tower kernel: res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Dec 8 12:04:28 Tower kernel: end_request: I/O error, dev sdg, sector 50799 Dec 8 12:04:28 Tower kernel: sd 6:0:0:0: [sdg] Result: hostbyte=0x00 driverbyte=0x08 Dec 8 12:04:28 Tower kernel: sd 6:0:0:0: [sdg] Sense Key : 0x4 [current] [descriptor] Dec 8 12:04:28 Tower kernel: Descriptor sense data with sense descriptors (in hex): Dec 8 12:04:28 Tower kernel: 72 04 00 00 00 00 00 0c 00 0a 80 00 00 00 60 c0 Dec 8 12:04:28 Tower kernel: 00 00 00 00 Dec 8 12:04:28 Tower kernel: sd 6:0:0:0: [sdg] ASC=0x0 ASCQ=0x0 Dec 8 12:04:28 Tower kernel: end_request: I/O error, dev sdg, sector 50815 Dec 8 12:04:28 Tower kernel: end_request: I/O error, dev sdg, sector 50823 Dec 8 12:04:28 Tower kernel: ata6: EH complete Dec 8 12:04:28 Tower kernel: md: disk10 read error Dec 8 12:04:28 Tower kernel: handle_stripe read error: 50736/10, count: 1 Dec 8 16:23:21 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Dec 8 16:23:21 Tower kernel: ata8.00: cmd e0/00:00:00:00:00/00:00:00:00:00/40 tag 0 Dec 8 16:23:21 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) Dec 8 16:23:21 Tower kernel: ata8.00: status: { DRDY } Dec 8 16:23:21 Tower kernel: ata8: hard resetting link I'm thinking of running another parity check to see how that pans out. I should probably verify the physical cable connections on the drives first. They are in Chenbro SATA backplanes so maybe something came loose. Does anyone else have a suggestion? I will save the syslog and can post it if necessary. Thanks very much for the help and especially for these great diagnostic tools which are a big part of why unRAID is so great.
December 9, 200916 yr UDMA CRC errors can indicate cabling errors. Please post up the syslog so we can see the sequences in full. Given the circumstances I would re-check sata and power connections to disks 9,10 and 11, then run a parity check.
December 9, 200916 yr isn't "*ERROR* - Reallocated_Sector_Ct" increasing over time a sign of a failing disc?
December 9, 200916 yr isn't "*ERROR* - Reallocated_Sector_Ct" increasing over time a sign of a failing disc? Yes. Disk10 /dev/sdg is a prime candidate for replacement.
December 9, 200916 yr isn't "*ERROR* - Reallocated_Sector_Ct" increasing over time a sign of a failing disc? Yes it is (drive 9 and 11). However you mentioned Disk 11 is seeing lots of UDMA CRC errors, this is typically cabling related. With Disk 9 reporting reallocated_sector count rising and the syslog postings are from ata 6 having exception handler issues and ata 8 also having troubles. Disk 10 reporting read errors. Seeing SMART reports for all drives and the syslog we would be able to better advise. Did you replace the two 500Gb drives for a particular reason?
December 10, 200916 yr Author Thanks very much for the responses and suggestions. I replaced these two drives for two reasons -- I needed to expand capacity and these drives were giving warnings about too many power on hours (>20K). Here is a slightly edited version of my syslog. I don't think I left anything important out but had to remove some entries for security reasons. Mostly mover log info and duplicate file info. Originally, the syslog was a huge 34mb text file. http://pastebin.com/m3590e515 Here is a printout of smartctl for my server. This was from 9am this morning. http://pastebin.com/m17519570 Here is a smartctl for drive10 now. http://pastebin.com/m62b97b3 Uh-oh! A big red X when using myMain SmartView tab. Looks like the drive is going to fail within 24 hours. Any suggestions of what the best course of action would be are greatly appreciated. Thanks everyone for your help.
December 10, 200916 yr Thanks very much for the responses and suggestions. I replaced these two drives for two reasons -- I needed to expand capacity and these drives were giving warnings about too many power on hours (>20K). Here is a slightly edited version of my syslog. I don't think I left anything important out but had to remove some entries for security reasons. Mostly mover log info and duplicate file info. Originally, the syslog was a huge 34mb text file. http://pastebin.com/m3590e515 Here is a printout of smartctl for my server. This was from 9am this morning. http://pastebin.com/m17519570 Here is a smartctl for drive10 now. http://pastebin.com/m62b97b3 Uh-oh! A big red X when using myMain SmartView tab. Looks like the drive is going to fail within 24 hours. Any suggestions of what the best course of action would be are greatly appreciated. Thanks everyone for your help. shut the server down and wait for a replacement drive to show up.
December 10, 200916 yr Author Okay -- thanks a lot. I think you are saying to replace the drive. I can't wait for an RMA so I think your saying I should -- shutdown the server buy a new replacement drive replace the failing drive with the new double check all other drive connections complete the Replacing a Data Drive routine from the wiki RMA my bad drive ... etc. If I misunderstand please let me know. Thanks very much for the guidance.
December 10, 200916 yr The 'drive failure in 24 hours' message is a very generic message, not literal, but should still be taken very seriously. The SMART firmware programmers are supposed to set the failure scales such that when an attribute reaches a number that according to their internal studies statistically *could* be within 24 hours of complete failure, that is considered the failure threshold. For your drive, the failure threshold for Reallocated_Sector_Ct is 36, which represents 36% of the remaining reserved sectors for remapping. In other words, at your Reallocated_Sector_Ct VALUE of 32, you have used up 68% of the available spare sectors, and depending on the rate of failing sectors, could be running out very soon. At 455 Power_On_Hours, the Reallocated_Sector_Ct was 2758. At 467 Power_On_Hours, the Reallocated_Sector_Ct was 2801, which is a rapid and serious degradation. If this were a constant rate of failure (which it definitely is not), then you are losing about 86 sectors per day, which means I believe, you still have about a 2 weeks supply left (possibly 1200 of the original 4000). At that time, you would run out of spare sectors. Since all other SMART attributes look fine, the drive would still not fail, but new bad sectors would not be recoverable, would return hard read and write errors, with an increasing chance of data loss. Shutting down the server protects you, as there is a risk of complete loss of all data on Disk 10. What I would do is disconnect the drive, and then if it is absolutely necessary, you can start the array and copy off any critical data from the virtual Disk 10. Keep the drive safe, as it is still readable, so that if anything else goes wrong, and you lose the virtual Disk 10, then you can reinsert this drive and copy off every thing important. Once the drive is replaced and Disk 10 is rebuilt, then you can dispose of it. To decrease the risk of losing the virtual Disk 10, use the array as little as possible. Keeping it shut down is safest.
Archived
This topic is now archived and is closed to further replies.