Read Errors on Drive 1 advice needed

Geck0 · March 1

Hi, the weekend has greeted me with read errors on Drive 1. Unraid was in the middle of a monthly parity check, which I've now paused until I've received some feedback.

Under "Fix Probles", it states this

Quote

If the disk has not been disabled, then Unraid has successfully rewritten the contents of the offending sectors back to the hard drive. It would be a good idea to look at the S.M.A.R.T. Attributes

Drive hasn't been disabled, a short SMART test shows no errors.

Disk log information is

Quote

text error warn system array login

Feb 19 18:40:39 Nexus kernel: ata10: SATA max UDMA/133 abar m2048@0xfbe00000 port 0xfbe00380 irq 83
Feb 19 18:40:39 Nexus kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb 19 18:40:39 Nexus kernel: ata10.00: ATA-11: ST14000VN0008-2JG101, SC60, max UDMA/133
Feb 19 18:40:39 Nexus kernel: ata10.00: 27344764928 sectors, multi 16: LBA48 NCQ (depth 32), AA
Feb 19 18:40:39 Nexus kernel: ata10.00: Features: NCQ-sndrcv
Feb 19 18:40:39 Nexus kernel: ata10.00: configured for UDMA/133
Feb 19 18:40:39 Nexus kernel: sd 10:0:0:0: [sdi] 27344764928 512-byte logical blocks: (14.0 TB/12.7 TiB)
Feb 19 18:40:39 Nexus kernel: sd 10:0:0:0: [sdi] 4096-byte physical blocks
Feb 19 18:40:39 Nexus kernel: sd 10:0:0:0: [sdi] Write Protect is off
Feb 19 18:40:39 Nexus kernel: sd 10:0:0:0: [sdi] Mode Sense: 00 3a 00 00
Feb 19 18:40:39 Nexus kernel: sd 10:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 19 18:40:39 Nexus kernel: sd 10:0:0:0: [sdi] Preferred minimum I/O size 4096 bytes
Feb 19 18:40:39 Nexus kernel: sdi: sdi1
Feb 19 18:40:39 Nexus kernel: sd 10:0:0:0: [sdi] Attached SCSI removable disk
Feb 19 18:41:04 Nexus emhttpd: ST14000VN0008-2JG101_ZHZ3DK2T (sdi) 512 27344764928
Feb 19 18:41:04 Nexus kernel: mdcmd (2): import 1 sdi 64 13672382412 0 ST14000VN0008-2JG101_ZHZ3DK2T
Feb 19 18:41:04 Nexus kernel: md: import disk1: (sdi) ST14000VN0008-2JG101_ZHZ3DK2T size: 13672382412
Feb 19 18:41:04 Nexus emhttpd: read SMART /dev/sdi
Feb 19 18:43:59 Nexus emhttpd: shcmd (209): /usr/local/sbin/set_ncq sdi 1
Feb 19 18:43:59 Nexus root: set_ncq: setting sdi queue_depth to 1
Feb 19 18:43:59 Nexus emhttpd: shcmd (210): echo 128 > /sys/block/sdi/queue/nr_requests
Mar 2 01:31:41 Nexus kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 2 01:31:41 Nexus kernel: ata10.00: irq_stat 0x40000001
Mar 2 01:31:41 Nexus kernel: ata10.00: failed command: READ DMA EXT
Mar 2 01:31:41 Nexus kernel: ata10.00: cmd 25/00:80:78:84:e8/00:01:29:06:00/e0 tag 18 dma 196608 in
Mar 2 01:31:41 Nexus kernel: ata10.00: status: { DRDY SENSE ERR }
Mar 2 01:31:41 Nexus kernel: ata10.00: error: { UNC }
Mar 2 01:31:41 Nexus kernel: ata10.00: configured for UDMA/133
Mar 2 01:31:41 Nexus kernel: sd 10:0:0:0: [sdi] tag#18 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=6s
Mar 2 01:31:41 Nexus kernel: sd 10:0:0:0: [sdi] tag#18 Sense Key : 0x3 [current]
Mar 2 01:31:41 Nexus kernel: sd 10:0:0:0: [sdi] tag#18 ASC=0x11 ASCQ=0x4
Mar 2 01:31:41 Nexus kernel: sd 10:0:0:0: [sdi] tag#18 CDB: opcode=0x88 88 00 00 00 00 06 29 e8 84 78 00 00 01 80 00 00
Mar 2 01:31:41 Nexus kernel: I/O error, dev sdi, sector 26472907896 op 0x0:(READ) flags 0x0 phys_seg 48 prio class 2
Mar 2 01:31:41 Nexus kernel: ata10: EH complete
Mar 2 01:31:48 Nexus kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Mar 2 01:31:48 Nexus kernel: ata10.00: irq_stat 0x40000001
Mar 2 01:31:48 Nexus kernel: ata10.00: failed command: READ DMA EXT
Mar 2 01:31:48 Nexus kernel: ata10.00: cmd 25/00:00:f8:85:e8/00:02:29:06:00/e0 tag 6 dma 262144 in
Mar 2 01:31:48 Nexus kernel: ata10.00: status: { DRDY SENSE ERR }
Mar 2 01:31:48 Nexus kernel: ata10.00: error: { UNC }
Mar 2 01:31:48 Nexus kernel: ata10.00: configured for UDMA/133
Mar 2 01:31:48 Nexus kernel: sd 10:0:0:0: [sdi] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=6s
Mar 2 01:31:48 Nexus kernel: sd 10:0:0:0: [sdi] tag#6 Sense Key : 0x3 [current]
Mar 2 01:31:48 Nexus kernel: sd 10:0:0:0: [sdi] tag#6 ASC=0x11 ASCQ=0x4
Mar 2 01:31:48 Nexus kernel: sd 10:0:0:0: [sdi] tag#6 CDB: opcode=0x88 88 00 00 00 00 06 29 e8 85 f8 00 00 02 00 00 00
Mar 2 01:31:48 Nexus kernel: I/O error, dev sdi, sector 26472908280 op 0x0:(READ) flags 0x0 phys_seg 64 prio class 2
Mar 2 01:31:48 Nexus kernel: ata10: EH complete

** Press ANY KEY to close this window **

.....and more importantly here are the diagnostics attached. I would appreciate it if somebody could cast their eye over this, I've taken docker offline and paused the parity check.

nexus-diagnostics-20240302-0902.zip

JorgeB · March 2

It's logged as a disk error, but it may be corrected now, try another parity check from the beginning or run an extended SMART test on disk1

Geck0 · March 2

Hi Jorge,

I cancelled the existing parity check and put the array into maintenance mode. I'm currently running an extended smart test. I'll reverr after it completes. I have got a new drive on standby if I need to swap out.

Thanks for taking the time to respond.

Geck0 · March 4

Hi JorgeB,

I've completed the extended smart test, it came back as "completed without error".

The Extended smart test results;

The parity test completed today and came back with no issues. I was running a backup of my Nextcloud data and noticed in the logs that a number of excel files had an md5 hash difference from the last backup. All of them are on disk1. I've only just found them and still need to compare to see if there is an issue with the server side ones, as it may be the backup drive thats at fault here. However, it makes me nervous that there are other issues as well, I don't backup the entire drive, just the important data.

I'm not great at reading SMART drive results, is it worth swapping out the drive and performing a rebuild from parity?

Quote

It's logged as a disk error, but it may be corrected now, try another parity check from the beginning or run an extended SMART test on disk1

Do you mean corrected from parity or reallocated sectors? I'm not sure what happens in this instance, but have this concern that corrupted files have been written to parity. Any input would be appreciated.

JorgeB · March 4

6 minutes ago, Geck0 said:

but have this concern that corrupted files have been written to parity

No reason to think that, if the SMART test passed disk is OK for now, keep monitoring, more errors in the near future you may consider replacing it.

Geck0 · March 4

My thanks!

Geck0 · March 6

Hi JorgeB et al, I've had an interesting week.

Drive 5 started failing today, it kicked off with reallocated sectors, which increased from 17 to 126 within 4 hours and then up to 215 after another 45 mins.

It also came up with a pending sector of 1, which later returned to normal.

The disk then went off line, after becoming "uncorrectable is 1" and entering "Disk 5 in error".

Fortunately, I still had a brand new 18TB on standby, already hooked up. I've started a rebuild. The original disk can still be mounted, but I've left this alone for now, in case the rebuild fails. I've not had two drives with errors in the same week before.

Can you advise if there is anything else I should consider? I'm not aware that a faulty cable or disk controller could cause this issue, I'm just wondering if there is anything else to look at? The two drives this week are both Iron Wolf Pro and purchased a couple of years apart. The one that is failing today is only a couple of years old. It failed the extended SMART test and dropped like a rock from there.

I'm starting to rethink the quality of Seagate's drives.

nexus-diagnostics-20240306-1704.zip

itimpi · March 6

1 hour ago, Geck0 said:

I'm not aware that a faulty cable or disk controller could cause this issue, I'm just wondering if there is anything else to look at?

The only possibility I can think of other than the disk itself failing might be some obscure power related issue. Do not think, however, that could cause the rapidly increasing reallocated sectors value.

JorgeB · March 6

47 minutes ago, itimpi said:

Do not think, however, that could cause the rapidly increasing reallocated sectors value.

I have seen that before, bad power causing reallocated sectors, but most likely it was just a bad disk, if it happens again to a different disk, then I would consider that.

Geck0 · March 6

Okay, thanks for replying guys. Appreciate it.

Read Errors on Drive 1 advice needed

Recommended Posts

Geck0

Link to comment

JorgeB

Link to comment

Geck0

Link to comment

Geck0

Link to comment

JorgeB

Link to comment

Geck0

Link to comment

Geck0

Link to comment

itimpi

Link to comment

JorgeB

Link to comment

Geck0

Link to comment

Join the conversation