Disk Errors and Warnings

December 9, 200916 yr

I recently upgraded two old 500GB disks with two new 1TB disks (disk10 and disk11). All went fine. This was about three weeks ago.

Came home one day and rtorrent (I think) had killed my system. Couldn't access through telnet, local keyboard and monitor, or anything. Pressed the power button and safe powerdown script wouldn't even evoke as usual. I couldn't do anything to gain access to the server. So I held down the power button until the computer turned off.

When I rebooted, the array was down because the error log told me that my super.dat file was corrupt. I simply did the "trust your parity" routine and the system came back online with no parity errors after the check. Everything has been working great.

Today I looked at bubbaQs SMART parameter tracking database and received some errors and warnings.

disk10 sdg 9VP266CN: *ERROR* - Reallocated_Sector_Ct it is now 2449 (error threshold is 30) and ata_error_count=12

disk10 is showing 32 errors on the unRaid main page.

disk11 sdc 9VP2A6KP: *ERROR* - UDMA_CRC_Error_Count it is now 1160 (error threshold is 75)

disk9 sdm 9TE0NDJZ: WARNING - Reallocated_Sector_Ct it is now 11 (warning threshold is 10)

Parity is Valid:. Last parity check 4 days ago with no sync errors.

Red entries from the syslog

Dec 7 20:08:43 Tower kernel: ata6.00: exception Emask 0x0 SAct 0x7ffffff SErr 0x0 action 0x0
Dec 7 20:08:43 Tower kernel: ata6.00: irq_stat 0x00020002, device error via SDB FIS
Dec 7 20:08:43 Tower kernel: ata6.00: cmd 61/08:00:7f:5b:0f/00:00:4c:00:00/40 tag 0 ncq 4096 out
Dec 7 20:08:43 Tower kernel: res 60/02:00:00:00:00/00:00:00:00:60/00 Emask 0x1 (device error)
Dec 7 20:08:43 Tower kernel: ata6.00: status: { DRDY DF }

Dec 7 20:08:43 Tower kernel: ata6.00: model number mismatch 'ST31000528AS' != ''
Dec 7 20:08:43 Tower kernel: ata6.00: revalidation failed (errno=-19)
Dec 7 20:08:43 Tower kernel: ata6: limiting SATA link speed to 1.5 Gbps
Dec 7 20:08:43 Tower kernel: ata6.00: limiting speed to UDMA/100:PIO3
Dec 7 20:08:43 Tower kernel: ata6: hard resetting link
Dec 7 20:08:45 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 10)

Dec 8 06:23:09 Tower kernel: ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 8 06:23:09 Tower kernel: ata6.00: cmd e5/00:00:00:00:00/00:00:00:00:00/40 tag 0
Dec 8 06:23:09 Tower kernel: res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Dec 8 12:04:28 Tower kernel: end_request: I/O error, dev sdg, sector 50799
Dec 8 12:04:28 Tower kernel: sd 6:0:0:0: [sdg] Result: hostbyte=0x00 driverbyte=0x08
Dec 8 12:04:28 Tower kernel: sd 6:0:0:0: [sdg] Sense Key : 0x4 [current] [descriptor]
Dec 8 12:04:28 Tower kernel: Descriptor sense data with sense descriptors (in hex):
Dec 8 12:04:28 Tower kernel: 72 04 00 00 00 00 00 0c 00 0a 80 00 00 00 60 c0
Dec 8 12:04:28 Tower kernel: 00 00 00 00
Dec 8 12:04:28 Tower kernel: sd 6:0:0:0: [sdg] ASC=0x0 ASCQ=0x0
Dec 8 12:04:28 Tower kernel: end_request: I/O error, dev sdg, sector 50815

Dec 8 12:04:28 Tower kernel: end_request: I/O error, dev sdg, sector 50823
Dec 8 12:04:28 Tower kernel: ata6: EH complete
Dec 8 12:04:28 Tower kernel: md: disk10 read error
Dec 8 12:04:28 Tower kernel: handle_stripe read error: 50736/10, count: 1

Dec 8 16:23:21 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 8 16:23:21 Tower kernel: ata8.00: cmd e0/00:00:00:00:00/00:00:00:00:00/40 tag 0
Dec 8 16:23:21 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Dec 8 16:23:21 Tower kernel: ata8.00: status: { DRDY }
Dec 8 16:23:21 Tower kernel: ata8: hard resetting link

I'm thinking of running another parity check to see how that pans out. I should probably verify the physical cable connections on the drives first. They are in Chenbro SATA backplanes so maybe something came loose.

Does anyone else have a suggestion? I will save the syslog and can post it if necessary.

Thanks very much for the help and especially for these great diagnostic tools which are a big part of why unRAID is so great.

December 9, 200916 yr

UDMA CRC errors can indicate cabling errors. Please post up the syslog so we can see the sequences in full.

Given the circumstances I would re-check sata and power connections to disks 9,10 and 11, then run a parity check.

December 9, 200916 yr

isn't "*ERROR* - Reallocated_Sector_Ct" increasing over time a sign of a failing disc?

December 9, 200916 yr

isn't "*ERROR* - Reallocated_Sector_Ct" increasing over time a sign of a failing disc?

Yes. Disk10 /dev/sdg is a prime candidate for replacement.

December 9, 200916 yr

isn't "*ERROR* - Reallocated_Sector_Ct" increasing over time a sign of a failing disc?

Yes it is (drive 9 and 11). However you mentioned Disk 11 is seeing lots of UDMA CRC errors, this is typically cabling related. With Disk 9 reporting reallocated_sector count rising and the syslog postings are from ata 6 having exception handler issues and ata 8 also having troubles. Disk 10 reporting read errors.

Seeing SMART reports for all drives and the syslog we would be able to better advise. Did you replace the two 500Gb drives for a particular reason?

December 10, 200916 yr

Author

Thanks very much for the responses and suggestions.

I replaced these two drives for two reasons -- I needed to expand capacity and these drives were giving warnings about too many power on hours (>20K).

Here is a slightly edited version of my syslog. I don't think I left anything important out but had to remove some entries for security reasons. Mostly mover log info and duplicate file info. Originally, the syslog was a huge 34mb text file.

http://pastebin.com/m3590e515

Here is a printout of smartctl for my server. This was from 9am this morning.

http://pastebin.com/m17519570

Here is a smartctl for drive10 now.

http://pastebin.com/m62b97b3

Uh-oh! A big red X when using myMain SmartView tab. Looks like the drive is going to fail within 24 hours.

Any suggestions of what the best course of action would be are greatly appreciated.

Thanks everyone for your help.

December 10, 200916 yr

Thanks very much for the responses and suggestions.

I replaced these two drives for two reasons -- I needed to expand capacity and these drives were giving warnings about too many power on hours (>20K).

Here is a slightly edited version of my syslog. I don't think I left anything important out but had to remove some entries for security reasons. Mostly mover log info and duplicate file info. Originally, the syslog was a huge 34mb text file.

http://pastebin.com/m3590e515

Here is a printout of smartctl for my server. This was from 9am this morning.

http://pastebin.com/m17519570

Here is a smartctl for drive10 now.

http://pastebin.com/m62b97b3

Uh-oh! A big red X when using myMain SmartView tab. Looks like the drive is going to fail within 24 hours.

Any suggestions of what the best course of action would be are greatly appreciated.

Thanks everyone for your help.

shut the server down and wait for a replacement drive to show up.

December 10, 200916 yr

Author

Okay -- thanks a lot. I think you are saying to replace the drive. I can't wait for an RMA so I think your saying I should --

shutdown the server
buy a new replacement drive
replace the failing drive with the new
double check all other drive connections
complete the Replacing a Data Drive routine from the wiki
RMA my bad drive ... etc.

If I misunderstand please let me know. Thanks very much for the guidance.

December 10, 200916 yr

The 'drive failure in 24 hours' message is a very generic message, not literal, but should still be taken very seriously. The SMART firmware programmers are supposed to set the failure scales such that when an attribute reaches a number that according to their internal studies statistically *could* be within 24 hours of complete failure, that is considered the failure threshold. For your drive, the failure threshold for Reallocated_Sector_Ct is 36, which represents 36% of the remaining reserved sectors for remapping. In other words, at your Reallocated_Sector_Ct VALUE of 32, you have used up 68% of the available spare sectors, and depending on the rate of failing sectors, could be running out very soon. At 455 Power_On_Hours, the Reallocated_Sector_Ct was 2758. At 467 Power_On_Hours, the Reallocated_Sector_Ct was 2801, which is a rapid and serious degradation. If this were a constant rate of failure (which it definitely is not), then you are losing about 86 sectors per day, which means I believe, you still have about a 2 weeks supply left (possibly 1200 of the original 4000). At that time, you would run out of spare sectors. Since all other SMART attributes look fine, the drive would still not fail, but new bad sectors would not be recoverable, would return hard read and write errors, with an increasing chance of data loss.

Shutting down the server protects you, as there is a risk of complete loss of all data on Disk 10. What I would do is disconnect the drive, and then if it is absolutely necessary, you can start the array and copy off any critical data from the virtual Disk 10. Keep the drive safe, as it is still readable, so that if anything else goes wrong, and you lose the virtual Disk 10, then you can reinsert this drive and copy off every thing important. Once the drive is replaced and Disk 10 is rebuilt, then you can dispose of it. To decrease the risk of losing the virtual Disk 10, use the array as little as possible. Keeping it shut down is safest.

Disk Errors and Warnings

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)