magic144 Posted October 14 Share Posted October 14 (edited) Wondering if anybody else has seen anything like this. I was doing some file archiving from PC to my RAID array a couple of days ago (using TeraCopy to CRC-verify the moves). Everything appeared to have gone fine (no hint of errors at the OS/app level from the actual file transfers) and I was about to shutdown the array when I noticed on the main UI tab that there were READ errors on the parity drive (14TB WD, /dev/sdb) over the course of the day. Inspection of the syslog showed stanzas like this: Oct 12 22:06:27 Tower2 kernel: ata3.00: exception Emask 0x0 SAct 0x80f80020 SErr 0x0 action 0x0 Oct 12 22:06:27 Tower2 kernel: ata3.00: irq_stat 0x40000008 Oct 12 22:06:27 Tower2 kernel: ata3.00: failed command: READ FPDMA QUEUED Oct 12 22:06:27 Tower2 kernel: ata3.00: cmd 60/00:98:80:ac:5a/04:00:a5:05:00/40 tag 19 ncq dma 524288 in Oct 12 22:06:27 Tower2 kernel: res 43/40:f8:88:ac:5a/00:03:a5:05:00/40 Emask 0x409 (media error) <F> Oct 12 22:06:27 Tower2 kernel: ata3.00: status: { DRDY SENSE ERR } Oct 12 22:06:27 Tower2 kernel: ata3.00: error: { UNC } Oct 12 22:06:27 Tower2 kernel: ata3.00: configured for UDMA/133 Oct 12 22:06:27 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=7s Oct 12 22:06:27 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 Sense Key : 0x3 [current] Oct 12 22:06:27 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 ASC=0x11 ASCQ=0x4 Oct 12 22:06:27 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 CDB: opcode=0x88 88 00 00 00 00 05 a5 5a ac 80 00 00 04 00 00 00 Oct 12 22:06:27 Tower2 kernel: I/O error, dev sdb, sector 24249019520 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 2 Oct 12 22:06:27 Tower2 kernel: md: disk0 read error, sector=24249019456 ... Oct 12 22:06:27 Tower2 kernel: md: disk0 read error, sector=24249020472 Oct 12 22:06:27 Tower2 kernel: ata3: EH complete There were about 9 such sections, with a total of 1351 md read errors showing up on the main UI screen. SMART showed NO pending or actual reallocated sectors having cropped up. I then proceeded to run an extended SMART test which took almost 24h, during which the Raw_Read_Error_Rate hit its threshold (001). You could see the SMART attribute steadily diminishing over the whole of the run from the Unraid page for the disk's SMART attributes. 1 Raw_Read_Error_Rate PO-R-- 001 001 001 NOW 8225 I assume that since these errors were occurring (continuing to occur) internally during a SMART extended test, this would rule out issues related to the cabling or controller card(s). Since the tower was still booted (same boot), I decided to try a dd test to see if I would get failures. dd if=/dev/sdb of=/dev/null status=progress Sure enough, not long into the run, there were plenty of read-rate stalls and entries like this in syslog: Oct 14 06:42:35 Tower2 kernel: ata3.00: exception Emask 0x0 SAct 0x80080 SErr 0x0 action 0x0 Oct 14 06:42:35 Tower2 kernel: ata3.00: irq_stat 0x40000008 Oct 14 06:42:35 Tower2 kernel: ata3.00: failed command: READ FPDMA QUEUED Oct 14 06:42:35 Tower2 kernel: ata3.00: cmd 60/00:98:08:b1:ca/01:00:0d:00:00/40 tag 19 ncq dma 131072 in Oct 14 06:42:35 Tower2 kernel: res 43/40:08:00:b2:ca/00:00:0d:00:00/40 Emask 0x409 (media error) <F> Oct 14 06:42:35 Tower2 kernel: ata3.00: status: { DRDY SENSE ERR } Oct 14 06:42:35 Tower2 kernel: ata3.00: error: { UNC } Oct 14 06:42:35 Tower2 kernel: ata3.00: configured for UDMA/133 Oct 14 06:42:35 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=17s Oct 14 06:42:35 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 Sense Key : 0x3 [current] Oct 14 06:42:35 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 ASC=0x11 ASCQ=0x4 Oct 14 06:42:35 Tower2 kernel: sd 4:0:0:0: [sdb] tag#19 CDB: opcode=0x88 88 00 00 00 00 00 0d ca b1 08 00 00 01 00 00 00 Oct 14 06:42:35 Tower2 kernel: I/O error, dev sdb, sector 231387648 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 Similarly, new failed "READ FDPMA QUEUED" entries appeared in the device's own internal SMART history/log. At this point, I was prepared to swap out the drive, however I rebooted the array just to see what would happen. After rebooting, the SMART Raw_Read_Error_Rate had returned to 100 (expected). I re-ran the 'dd' test to see if errors were still happening. NOTHING failed or flaked out. I BELIEVE that all of these read 'errors' were in fact "recoverable" - i.e. the file move operations that succeeded were all able to read and verify the data eventually. I'm currently running some CRC checks on the files I moved, but so far all reads are good (no ATA-level exceptions) and no CRC mismatches have been detected. I'm at a loss to explain why the disk would have exhibited this read-failure behaviour during one boot, and it having disappeared at the next boot. I've never had issues with this disk previously either. Can anybody offer any insights or similar experiences? I'm going to leave the parity drive in-place for now and just keep monitoring it in case this happens again, but for now it seems to have returned to full health?! Edited October 14 by magic144 add recoverable note Quote Link to comment
JorgeB Posted October 14 Share Posted October 14 34 minutes ago, magic144 said: (media error) 34 minutes ago, magic144 said: error: { UNC } The most important part of those errors is the parts above, it usually means that it was a disk error, and that SMART attribute also points to that, this type of error can be intermittent, but most likely there will be new errors soon, and if that happens recommend replacing the disk. Quote Link to comment
magic144 Posted October 14 Author Share Posted October 14 Thanks @JorgeB. It is surprising, though, that whilst the previous (with-errors) boot session was still active, these "read errors" could be readily triggered (e.g. via dd), but since rebooting, they appear to have disappeared. One would imagine that the intermittent nature of a media error would be just that, not limited to one particular boot cycle vs another. This seems to smack of a weird internal glitch in the drive from boot. It "got out of bed on the wrong side that day" almost! Another Q. Isn't it the case that if a disk finds a sector to be unreadable, it would be marked as "pending reallocation"? No such incidents have been noted in the SMART data. Thanks again. Quote Link to comment
Solution JorgeB Posted October 14 Solution Share Posted October 14 9 minutes ago, magic144 said: it would be marked as "pending reallocation"? Not always. Quote Link to comment
magic144 Posted October 14 Author Share Posted October 14 Alright, thanks again. Will just have to keep my eye on it. Maybe stress test it with dd from time-to-time 🙂 Quote Link to comment
JorgeB Posted October 14 Share Posted October 14 An extended SMART test or a parity check are also good tests, since they do a complete surface read. Quote Link to comment
magic144 Posted October 14 Author Share Posted October 14 (edited) Yeah. I ran the one extended SMART last yesterday when it was "unwell". Took nearly 24 hours. Problem with it is it keeps failure info "internal" i.e. not exposed to the kernel/OS, but it did run down the Raw_Read_Error_Rate until threshold failure. It (the extended test) did not add any info to the internally logged Comprehensive Error Log/Device Error Count. Will probably run a parity check next. Still no errors recorded on this boot via CRC checking of the 1100+ GB moved onto it last time during the "rocky" period. Admittedly that's not (now) testing the parity disk, but it's giving me peace of mind on what was written onto the array previously. Parity check will give a more comprehensive result. Edited October 14 by magic144 expanded comment Quote Link to comment
magic144 Posted October 15 Author Share Posted October 15 (edited) Update: 4 more UNC "READ FPDMA QUEUED" errors showed up via the dd test so far during the course of today. Edited October 15 by magic144 Quote Link to comment
magic144 Posted October 22 Author Share Posted October 22 Update: performed full parity check this past weekend - NO errors on disk detected... Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.