Read errors on new parity disk

Woodpusherghd · October 17, 2014

I recently replaced my failing 1T parity disk with a 2T disk. Parity rebuilt and I did a parity check which showed 1 sync error which was corrected. After a few weeks with no problems there appeared a log entry showing 108 read errors on the parity disk. I haven't noticed any other problems, the read errors haven't re-appeared and I was wondering if this is something I should be concerned about? I did a smart report on the parity drive and it seems ok. Smart report and syslog attached. Thanks.

smart.txt

syslog-20141017.txt.zip

SSD · October 17, 2014

Smart data looks fine.

I am not a syslog expert, but the following section seems relevant ...

Oct 15 06:40:20 Tower kernel: ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Oct 15 06:40:20 Tower kernel: ata12.00: BMDMA stat 0x25

Oct 15 06:40:20 Tower kernel: ata12.00: failed command: READ DMA EXT

Oct 15 06:40:20 Tower kernel: ata12.00: cmd 25/00:28:70:e4:50/00:03:41:00:00/e0 tag 0 dma 413696 in

Oct 15 06:40:20 Tower kernel: res 51/40:28:70:e4:50/40:03:41:00:00/e0 Emask 0x9 (media error)

Oct 15 06:40:20 Tower kernel: ata12.00: status: { DRDY ERR }

Oct 15 06:40:20 Tower kernel: ata12.00: error: { UNC }

Oct 15 06:40:20 Tower kernel: ata12.00: configured for UDMA/133

Oct 15 06:40:20 Tower kernel: sd 12:0:0:0: [sdf] Unhandled sense code

Oct 15 06:40:20 Tower kernel: sd 12:0:0:0: [sdf]

Oct 15 06:40:20 Tower kernel: Result: hostbyte=0x00 driverbyte=0x08

Oct 15 06:40:20 Tower kernel: sd 12:0:0:0: [sdf]

Oct 15 06:40:20 Tower kernel: Sense Key : 0x3 [current] [descriptor]

Oct 15 06:40:20 Tower kernel: Descriptor sense data with sense descriptors (in hex):

Oct 15 06:40:20 Tower kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00

Oct 15 06:40:20 Tower kernel: 41 50 e4 70

Oct 15 06:40:20 Tower kernel: sd 12:0:0:0: [sdf]

Oct 15 06:40:20 Tower kernel: ASC=0x11 ASCQ=0x4

Oct 15 06:40:20 Tower kernel: sd 12:0:0:0: [sdf] CDB:

Oct 15 06:40:20 Tower kernel: cdb[0]=0x28: 28 00 41 50 e4 70 00 03 28 00

Oct 15 06:40:20 Tower kernel: end_request: I/O error, dev sdf, sector 1095820400

Oct 15 06:40:20 Tower kernel: md: disk0 read error, sector=1095820336

Oct 15 06:40:20 Tower kernel: md: disk0 read error, sector=1095820344

Oct 15 06:40:20 Tower kernel: md: disk0 read error, sector=1095820352

....

My best guess would be a bad or loose SATA cable.

Second thought is a memory error.

RobJ · October 17, 2014

A "media error" with error flag UNC is always a bad sector, and that comes directly from the drive itself. A bad SATA cable will always (as far as I have ever seen) include the error flag BadCRC or ICRC.

At Oct 14 6:52, you had the first error on sdf:

Oct 14 06:52:01 Tower kernel: ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 14 06:52:01 Tower kernel: ata12.00: BMDMA stat 0x25

Oct 14 06:52:01 Tower kernel: ata12.00: failed command: WRITE DMA EXT

Oct 14 06:52:01 Tower kernel: ata12.00: cmd 35/00:00:00:53:28/00:04:3d:00:00/e0 tag 0 dma 524288 out

Oct 14 06:52:01 Tower kernel: res 51/10:00:00:53:28/10:04:3d:00:00/e0 Emask 0x81 (invalid argument)

Oct 14 06:52:01 Tower kernel: ata12.00: status: { DRDY ERR }

Oct 14 06:52:01 Tower kernel: ata12.00: error: { IDNF }

Oct 14 06:52:01 Tower kernel: ata12.00: configured for UDMA/133

Oct 14 06:52:01 Tower kernel: ata12: EH complete

An IDNF error flag is serious, means a sector cannot be found (sector ID Not Found). It's pretty rare with modern drives, used to be common. It is generally associated either with a serious mechanical problem with the drive, or a sector that is permanently damaged, but your SMART report (as Brian said) looks great! I'm rather shocked by that! You have run SMART short tests recently, but what is needed now is the SMART long test. After you start it, wait 5 hours and obtain another SMART report. The test section of it will let you know if the test has finished. If it hasn't, then it will tell you the approximate time remaining, so get another when the test does finish. I don't expect the test to be good. I'm out of time at the moment...

Woodpusherghd · October 17, 2014

Can I run the long smart test with the array online?

RobJ · October 17, 2014

Can I run the long smart test with the array online?

I'm back, for a bit. It's the parity drive, so you don't want anything accessing it during the test. I would temporarily turn off spin down for the drive, then stop the array, then start the test, and wait for it to finish completely.

An alternate strategy, stop the array, unassign the parity drive, shut down and add another drive as parity drive, restart, assign it, and build parity. Once parity is good, then Preclear the problematic old parity drive.

Woodpusherghd · October 18, 2014

Completed long smart test, I don't see any errors. Smart report attached.

smart.txt

RobJ · October 18, 2014

Completed long smart test, I don't see any errors. Smart report attached.

Have to be honest, I'm a bit nonplussed, that is still a good clean SMART report, no issues. The drive itself is reporting the problems with a bad sector (always in the same spot) and an IDNF (sector unknown), yet the SMART report shows nothing bad.

All I can suggest is rebuild parity, to force a write to all sectors. Un-assign the parity drive, start and stop the array, then re-assign the parity drive and start the array to rebuild parity and rewrite all sectors.

Woodpusherghd · October 19, 2014

I stopped the array, unassigned the parity drive, then re-assigned and rebuilt parity. I then did a parity check which found over 500 sync errors which were corrected. I then started another parity check which is still finding. sync errors. Could this be caused by bad ram? Should I run memtest? Thoughts?

Woodpusherghd · October 20, 2014

I've attached a smart report for one of my data disks showing errors. Could this be the cause of my sync errors?

smart.txt

SSD · October 20, 2014

I've attached a smart report for one of my data disks showing errors. Could this be the cause of my sync errors?

The report looks ok. No telltale signs of failure.

This line is typically associated with a cabling problem, which is frequently at the heart of this type of problem:

199 UDMA_CRC_Error_Count 0x0036 099 099 000 Old_age Always - 686

There is also a series of ATA errors which are also often associated with cabling problems. The ATA errors are quite old however (there is no age on the error count). And note that corrective actions will NOT remove these errors or reduce the UDMA_CRC_Error_Count. So if you had a problem and fixed it, there would be no way of knowing unless you had an old smart report from after the time the fix was made.

But my recommendation (and sorry I haven't read through this entire thread and this may already have been suggested / done), would be to replace the SATA cable, plug into different controller port (motherboard would be best), and generally avoid as many potential connection-related issues as you can (e.g.,. if it is in a drive cage, try connecting directly to the MB). If problems continue to occur, monitor the smart attributes for changes.

Woodpusherghd · October 29, 2014

I've been getting a lot of read errors on this parity drive so I finally swapped it out with a new one. Rebuilt parity, checked parity, no errors. I'm returning the drive to Microcenter. Just wondering, the problem drive is a WD Red. Any other users have any problems with theses drives or did I just get a lemon?

Read errors on new parity disk

Recommended Posts

Woodpusherghd

Link to comment

SSD

Link to comment

RobJ

Link to comment

Woodpusherghd

Link to comment

RobJ

Link to comment

Woodpusherghd

Link to comment

RobJ

Link to comment

Woodpusherghd

Link to comment

Woodpusherghd

Link to comment

SSD

Link to comment

Woodpusherghd

Link to comment

Archived