Jump to content

Multiple drive read errors during parity data rebuild after drive swap


Go to solution Solved by JorgeB,

Recommended Posts

I'm working on adding 2 new 16TB disks to my array. I ran the preclear script on both new drives without issue. 

 

One drive is slated to upgrade parity from 8TB to 16TB. The other drive will be added as additional data. 

 

I installed one of the drives. After booting up, I swapped the existing parity drive and the new drive, and started the array/rebuild. The existing parity drive is in unassigned devices. I let it run and went to bed. 

 

I got up to a long string of unraid notifications regarding read errors. 3 existing drives are showing UDMA CRC errors in SMART, with these amounts: 97, 232, 1. 

 

The parity drive shows 1024 errors in the unraid array web interface. The parity drive disk log shows numerous errors, starting with: 

 

Jul 12 01:42:29 solidsnake kernel: ata9.00: exception Emask 0x10 SAct 0x400000 SErr 0x480000 action 0x6 frozen
Jul 12 01:42:29 solidsnake kernel: ata9.00: irq_stat 0x08000000, interface fatal error
Jul 12 01:42:29 solidsnake kernel: ata9: SError: { 10B8B Handshk }
Jul 12 01:42:29 solidsnake kernel: ata9.00: failed command: WRITE FPDMA QUEUED
Jul 12 01:42:29 solidsnake kernel: ata9.00: cmd 61/50:b0:a8:76:08/03:00:c0:00:00/40 tag 22 ncq dma 434176 out
Jul 12 01:42:29 solidsnake kernel: ata9.00: status: { DRDY }
Jul 12 01:42:29 solidsnake kernel: ata9: hard resetting link
Jul 12 01:42:39 solidsnake kernel: ata9: softreset failed (1st FIS failed)
Jul 12 01:42:39 solidsnake kernel: ata9: hard resetting link
Jul 12 01:42:49 solidsnake kernel: ata9: softreset failed (1st FIS failed)
Jul 12 01:42:49 solidsnake kernel: ata9: hard resetting link
Jul 12 01:43:24 solidsnake kernel: ata9: softreset failed (1st FIS failed)
Jul 12 01:43:24 solidsnake kernel: ata9: limiting SATA link speed to 3.0 Gbps
Jul 12 01:43:24 solidsnake kernel: ata9: hard resetting link
Jul 12 01:43:29 solidsnake kernel: ata9: softreset failed (1st FIS failed)
Jul 12 01:43:29 solidsnake kernel: ata9: reset failed, giving up
Jul 12 01:43:29 solidsnake kernel: ata9.00: disabled
Jul 12 01:43:29 solidsnake kernel: ata9: EH complete
Jul 12 01:43:29 solidsnake kernel: sd 9:0:0:0: [sdh] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 cmd_age=0s
Jul 12 01:43:29 solidsnake kernel: sd 9:0:0:0: [sdh] tag#2 CDB: opcode=0x8a 8a 00 00 00 00 00 c0 08 87 88 00 00 04 a0 00 00
Jul 12 01:43:29 solidsnake kernel: blk_update_request: I/O error, dev sdh, sector 3221784456 op 0x1:(WRITE) flags 0x0 phys_seg 148 prio class 0

 

After some initial research, the CRC errors are indicative of a physical failure, like a cable. Because I had several drives blow up at once, I'm thinking that maybe I overloaded the sata controller? 

 

I have syslog running, and it looks like the very first error was on ata4, and then everything started blowing up from there. 

 

I'm shutting down the server for now. My recovery idea is to just swap the existing parity drive back in place, removing the new drive, and let the data rebuild to the previous good state. However, I'm apprehensive because I suspect this is not a drive problem, but a motherboard problem.

 

I've never had an issue before today. This system has been live for about two years, with multiple drive upgrades happening in the past. 

 

I'd appreciate any and all advice on next steps. 

 

 

 

image.png

image.png

image.png

image.png

solidsnake-diagnostics-20220712-0649.zip

Edited by Apeiron
added diag
Link to comment

Thank you for your response. I have two questions. 

 

1) Should I attempt to get the array back to a known working state? That would involve removing the new drive, and putting the old parity drive back in place. Once I'm sure everything is working without errors, I can try to swap again. The alternative would be to just swap the cables, and reattempt using the new drive.

 

2) Excuse my ignorance, but if it was a single failed cable, that makes sense to me. Having multiple failures at once doesn't lead me to believe that it is only cables. That's just my intuition talking. Is there any other testing I could perform to help try to isolate any issue? I have a large pile of used sata cables, but I'd order a complete set of new ones if I can confirm that is the problem.

 

edit: ended up ordering a full set of new sata cables anyways. Cheap enough to rule it out as a factor. 

Edited by Apeiron
spelling
Link to comment
  • Solution

For now I would replace the cables and try to sync the new parity again.

 

5 minutes ago, Apeiron said:

Having multiple failures at once doesn't lead me to believe that it is only cables.

CRC errors are usually the SATA cable, for parity and like mentioned need to see the SMART report first, but for now looks like power/connection problem.

  • Like 1
Link to comment
On 7/12/2022 at 7:39 AM, JorgeB said:

Logs are full of ATA errors for disks 1 and 4, check/replace cables, parity dropped offline, probably also a power/connection problem, but would need a SMART report after the disk comes back online to confirm.

The results won't be too scientific, as I made multiple changes at once. I have parity back, with the new 16TB drive. No other disk changes have been made yet. I ran SMART short self-tests on all disks which passed. It does not appear that the CRC error rate increased since the first instance of the problem. Looking at some historical logs, I noticed my UPS has been engaging for a second or two somewhat regularly (once every few days to once daily). I attribute the fluctuations to the local power service constantly adjusting loads during this summer heat. I don't have a correlating UPS log at the time of the read errors, so probably just an anecdote, but maybe not outside of realm of possibility that I had a power issue. 

  1. Added several more cooling fans I had laying around and cleaned the case. The case wasn't very dirty, but hard disk temps would go ~41+ during parity checks. Temps dropped to 36 max during latest parity check. 
  2. Split 2 SATA cables from the motherboard to PCI Syba SI-PEX40139 SATA card. Currently 4x SATA, 3x PCI, previously 6/1.
  3. Matched all SATA cables by manufacturer and connector
  4. Arranged cables to reduce cable touching on long parallel SATA runs
  5. Updated PSU - HX750

On boot up, a read check was executed on all data disks, while the parity was still disabled. After the read check succeeded, I started the parity rebuild, which was also successful. 

 

Next step is to replace one of the data drives with the old parity drive, and then add a new drive to the array. 

 

Finally, posting the diagnostics again, with new smart reports, in case anything may look awry. Thanks for the help. 

solidsnake-diagnostics-20220715-1541.zip

Link to comment

It seems I was premature. I'm not getting a CRC error, but a different error now. I had gotten full parity back with no apparent issues. So I swapped out a drive, and started the data rebuild process again. Several hours in, I received the following errors and then it stabilized and continued. The rebuild is still running right now. 

 

I've been looking into this error using keywords AMD-Vi and IO_PAGE_FAULT, and there are a lot of interesting results. I've seen some comments around IOMMU settings, PCIexpress lanes, BIOS upgrades and kernel bugs. Notably, several comments were found related to my system, Ryzen 2700X and ASMedia X470, having SATA issues with some configurations. 

 

Jul 16 01:00:04 solidsnake kernel: ahci 0000:03:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x407d4db010608040 flags=0x0030]
Jul 16 01:00:35 solidsnake kernel: ata2.00: exception Emask 0x0 SAct 0x80fe7fc0 SErr 0x0 action 0x6 frozen

 

Followed by: 

 

Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/20:30:38:49:7d/02:00:f9:01:00/40 tag 6 ncq dma 278528 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/58:38:58:4b:7d/02:00:f9:01:00/40 tag 7 ncq dma 307200 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/10:40:b0:4d:7d/02:00:f9:01:00/40 tag 8 ncq dma 270336 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/10:48:c0:4f:7d/02:00:f9:01:00/40 tag 9 ncq dma 270336 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/30:50:d0:51:7d/01:00:f9:01:00/40 tag 10 ncq dma 155648 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/f8:58:00:53:7d/01:00:f9:01:00/40 tag 11 ncq dma 258048 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/08:60:f8:54:7d/01:00:f9:01:00/40 tag 12 ncq dma 135168 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/c0:68:00:56:7d/01:00:f9:01:00/40 tag 13 ncq dma 229376 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/40:70:c0:57:7d/01:00:f9:01:00/40 tag 14 ncq dma 163840 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/d8:88:00:59:7d/01:00:f9:01:00/40 tag 17 ncq dma 241664 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/a0:90:d8:5a:7d/01:00:f9:01:00/40 tag 18 ncq dma 212992 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/20:98:78:5c:7d/01:00:f9:01:00/40 tag 19 ncq dma 147456 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/e0:a0:98:5d:7d/00:00:f9:01:00/40 tag 20 ncq dma 114688 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/d8:a8:78:5e:7d/02:00:f9:01:00/40 tag 21 ncq dma 372736 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/68:b0:50:61:7d/02:00:f9:01:00/40 tag 22 ncq dma 315392 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/60:b8:b8:63:7d/02:00:f9:01:00/40 tag 23 ncq dma 311296 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2.00: failed command: READ FPDMA QUEUED
Jul 16 01:00:35 solidsnake kernel: ata2.00: cmd 60/20:f8:18:66:7d/03:00:f9:01:00/40 tag 31 ncq dma 409600 in
Jul 16 01:00:35 solidsnake kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 16 01:00:35 solidsnake kernel: ata2.00: status: { DRDY }
Jul 16 01:00:35 solidsnake kernel: ata2: hard resetting link
Jul 16 01:00:40 solidsnake kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 16 01:00:40 solidsnake kernel: ata2.00: configured for UDMA/133
Jul 16 01:00:40 solidsnake kernel: ata2.00: device reported invalid CHS sector 0
Jul 16 01:00:40 solidsnake kernel: ata2.00: device reported invalid CHS sector 0
Jul 16 01:00:40 solidsnake kernel: ata2.00: device reported invalid CHS sector 0
Jul 16 01:00:40 solidsnake kernel: ata2.00: device reported invalid CHS sector 0
Jul 16 01:00:40 solidsnake kernel: ata2.00: device reported invalid CHS sector 0
Jul 16 01:00:40 solidsnake kernel: ata2: EH complete

 

My next NAS build is definitely going to be a real server. Thoughts on these errors? 

 

Note: can't seem to unmark this as solved, so I can start a new thread if necessary. Thank you. 

 

solidsnake-diagnostics-20220716-0137.zip

Link to comment
Jul 16 01:00:04 solidsnake kernel: ahci 0000:03:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x407d4db010608040 flags=0x0030]

 

Problem with the onboard SATA controller, quite common with some Ryzen servers, especially under load, look for a BIOS update but best bet it use an add-on controller.

  • Like 1
Link to comment
  • 2 weeks later...
On 7/16/2022 at 2:58 AM, JorgeB said:
Jul 16 01:00:04 solidsnake kernel: ahci 0000:03:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x407d4db010608040 flags=0x0030]

 

Problem with the onboard SATA controller, quite common with some Ryzen servers, especially under load, look for a BIOS update but best bet it use an add-on controller.

 

I finally received my second Sata controller. I used the same one I had previously, SI-PEX40139, so now I have 2 of these cards installed. I have 8 total hard drives, and split them to 4 drives on each card. Nothing is connected via the mobo Sata ports. I have (1) 1TB M.2 drive installed as cache. Lastly, the Unraid USB. 

 

After getting the server booted up, I added the new drive as a 7th data drive, and started the array in maintenance mode. I immediately started getting I/O buffer errors again. These started popping up without any load on the system. I have not started the formatting of the new drive yet.

 

I've now replaced the PSU, all Sata cables, added an additional sata pcie card, added cooling fans and cleaned everything. The drive I am adding is new, and no errors were detected during the preclear. 

 

Jul 24 19:09:26 solidsnake emhttpd: Opening encrypted volumes...
Jul 24 19:09:26 solidsnake kernel: Buffer I/O error on dev md7, logical block 0, async page read

 

Suggestions for next steps or isolating the issue? Thank you. 

 

solidsnake-diagnostics-20220724-1921.zip

Edited by Apeiron
grammar
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...