Jump to content

Transient read errors - magically disappeared...


Go to solution Solved by JorgeB,

Recommended Posts

Wondering if anybody else has seen anything like this.

 

I was doing some file archiving from PC to my RAID array a couple of days ago (using TeraCopy to CRC-verify the moves).

 

Everything appeared to have gone fine (no hint of errors at the OS/app level from the actual file transfers) and I was about to shutdown the array when I noticed on the main UI tab that there were READ errors on the parity drive (14TB WD, /dev/sdb) over the course of the day. Inspection of the syslog showed stanzas like this:

Oct 12 22:06:27 Tower2 kernel: ata3.00: exception Emask 0x0 SAct 0x80f80020 SErr 0x0 action 0x0
Oct 12 22:06:27 Tower2 kernel: ata3.00: irq_stat 0x40000008
Oct 12 22:06:27 Tower2 kernel: ata3.00: failed command: READ FPDMA QUEUED
Oct 12 22:06:27 Tower2 kernel: ata3.00: cmd 60/00:98:80:ac:5a/04:00:a5:05:00/40 tag 19 ncq dma 524288 in
Oct 12 22:06:27 Tower2 kernel:         res 43/40:f8:88:ac:5a/00:03:a5:05:00/40 Emask 0x409 (media error) <F>
Oct 12 22:06:27 Tower2 kernel: ata3.00: status: { DRDY SENSE ERR }
Oct 12 22:06:27 Tower2 kernel: ata3.00: error: { UNC }
Oct 12 22:06:27 Tower2 kernel: ata3.00: configured for UDMA/133
Oct 12 22:06:27 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=7s
Oct 12 22:06:27 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 Sense Key : 0x3 [current] 
Oct 12 22:06:27 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 ASC=0x11 ASCQ=0x4 
Oct 12 22:06:27 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 CDB: opcode=0x88 88 00 00 00 00 05 a5 5a ac 80 00 00 04 00 00 00
Oct 12 22:06:27 Tower2 kernel: I/O error, dev sdb, sector 24249019520 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 2
Oct 12 22:06:27 Tower2 kernel: md: disk0 read error, sector=24249019456
...
Oct 12 22:06:27 Tower2 kernel: md: disk0 read error, sector=24249020472
Oct 12 22:06:27 Tower2 kernel: ata3: EH complete

 

There were about 9 such sections, with a total of 1351 md read errors showing up on the main UI screen. SMART showed NO pending or actual reallocated sectors having cropped up.

 

I then proceeded to run an extended SMART test which took almost 24h, during which the Raw_Read_Error_Rate hit its threshold (001). You could see the SMART attribute steadily diminishing over the whole of the run from the Unraid page for the disk's SMART attributes.

  1 Raw_Read_Error_Rate     PO-R--   001   001   001    NOW  8225

 

I assume that since these errors were occurring (continuing to occur) internally during a SMART extended test, this would rule out issues related to the cabling or controller card(s).

 

Since the tower was still booted (same boot), I decided to try a dd test to see if I would get failures.

dd if=/dev/sdb of=/dev/null status=progress

 

Sure enough, not long into the run, there were plenty of read-rate stalls and entries like this in syslog:

Oct 14 06:42:35 Tower2 kernel: ata3.00: exception Emask 0x0 SAct 0x80080 SErr 0x0 action 0x0
Oct 14 06:42:35 Tower2 kernel: ata3.00: irq_stat 0x40000008
Oct 14 06:42:35 Tower2 kernel: ata3.00: failed command: READ FPDMA QUEUED
Oct 14 06:42:35 Tower2 kernel: ata3.00: cmd 60/00:98:08:b1:ca/01:00:0d:00:00/40 tag 19 ncq dma 131072 in
Oct 14 06:42:35 Tower2 kernel:         res 43/40:08:00:b2:ca/00:00:0d:00:00/40 Emask 0x409 (media error) <F>
Oct 14 06:42:35 Tower2 kernel: ata3.00: status: { DRDY SENSE ERR }
Oct 14 06:42:35 Tower2 kernel: ata3.00: error: { UNC }
Oct 14 06:42:35 Tower2 kernel: ata3.00: configured for UDMA/133
Oct 14 06:42:35 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=17s
Oct 14 06:42:35 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 Sense Key : 0x3 [current] 
Oct 14 06:42:35 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 ASC=0x11 ASCQ=0x4 
Oct 14 06:42:35 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 CDB: opcode=0x88 88 00 00 00 00 00 0d ca b1 08 00 00 01 00 00 00
Oct 14 06:42:35 Tower2 kernel: I/O error, dev sdb, sector 231387648 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2

 

Similarly, new failed "READ FDPMA QUEUED" entries appeared in the device's own internal SMART history/log.

 

At this point, I was prepared to swap out the drive, however I rebooted the array just to see what would happen. After rebooting, the SMART Raw_Read_Error_Rate had returned to 100 (expected). I re-ran the 'dd' test to see if errors were still happening. NOTHING failed or flaked out.

 

I BELIEVE that all of these read 'errors' were in fact "recoverable" - i.e. the file move operations that succeeded were all able to read and verify the data eventually.

 

I'm currently running some CRC checks on the files I moved, but so far all reads are good (no ATA-level exceptions) and no CRC mismatches have been detected.

 

I'm at a loss to explain why the disk would have exhibited this read-failure behaviour during one boot, and it having disappeared at the next boot. I've never had issues with this disk previously either.

 

Can anybody offer any insights or similar experiences? I'm going to leave the parity drive in-place for now and just keep monitoring it in case this happens again, but for now it seems to have returned to full health?!

Edited by magic144
add recoverable note
Link to comment
34 minutes ago, magic144 said:
(media error)

 

34 minutes ago, magic144 said:
error: { UNC }

 

The most important part of those errors is the parts above, it usually means that it was a disk error, and that SMART attribute also points to that, this type of error can be intermittent, but most likely there will be new errors soon, and if that happens recommend replacing the disk.

Link to comment

Thanks @JorgeB.

 

It is surprising, though, that whilst the previous (with-errors) boot session was still active, these "read errors" could be readily triggered (e.g. via dd), but since rebooting, they appear to have disappeared. One would imagine that the intermittent nature of a media error would be just that, not limited to one particular boot cycle vs another. This seems to smack of a weird internal glitch in the drive from boot. It "got out of bed on the wrong side that day" almost!

 

Another Q. Isn't it the case that if a disk finds a sector to be unreadable, it would be marked as "pending reallocation"? No such incidents have been noted in the SMART data.

 

Thanks again.

Link to comment

Yeah. I ran the one extended SMART last yesterday when it was "unwell". Took nearly 24 hours. Problem with it is it keeps failure info "internal" i.e. not exposed to the kernel/OS, but it did run down the Raw_Read_Error_Rate until threshold failure. It (the extended test) did not add any info to the internally logged Comprehensive Error Log/Device Error Count.

Will probably run a parity check next.

 

Still no errors recorded on this boot via CRC checking of the 1100+ GB moved onto it last time during the "rocky" period. Admittedly that's not (now) testing the parity disk, but it's giving me peace of mind on what was written onto the array previously. Parity check will give a more comprehensive result.

Edited by magic144
expanded comment
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...