Write Errors without Writing?

johnsanc · January 30, 2020

I recently invoked the mover script and after awhile I noticed 2 disks became marked as disabled due to write errors

Disk 7 = 52 writes, 2252 errors (Are these read errors?)
Disk 15 = 546,971 writes, 731 errors

However the mover was only moving files to Disk 15. I do have turbo write enabled, could Disk 7 read errors cause Disk 15 write errors?

Also, could a cabling issue cause this? Both disks look fine in the SMART reports.

I should also mention that I just completed a parity check with about ~90,000 sync errors due to a mysterious reboot that happened in the middle of the night. The server started back up and started a parity check so I let it run. Would a parity check log errors for individual disks? If it was a cabling issue I'm surprised I didn't get individual disk errors during the check.

Errors on Disk7:

Jan 29 22:14:03 Tower kernel: ata2.00: failed command: READ DMA EXT
Jan 29 22:14:03 Tower kernel: ata2.00: cmd 25/00:00:50:1f:34/00:04:88:01:00/e0 tag 28 dma 524288 in
Jan 29 22:14:03 Tower kernel: ata2.00: status: { DRDY }
Jan 29 22:14:03 Tower kernel: ata2: hard resetting link
Jan 29 22:14:13 Tower kernel: ata2: softreset failed (1st FIS failed)
Jan 29 22:14:13 Tower kernel: ata2: hard resetting link
Jan 29 22:14:23 Tower kernel: ata2: softreset failed (1st FIS failed)
Jan 29 22:14:23 Tower kernel: ata2: hard resetting link
Jan 29 22:14:58 Tower kernel: ata2: softreset failed (1st FIS failed)
Jan 29 22:14:58 Tower kernel: ata2: limiting SATA link speed to 3.0 Gbps
Jan 29 22:14:58 Tower kernel: ata2: hard resetting link
Jan 29 22:15:03 Tower kernel: ata2: softreset failed (1st FIS failed)
Jan 29 22:15:03 Tower kernel: ata2: reset failed, giving up
Jan 29 22:15:03 Tower kernel: ata2.00: disabled
Jan 29 22:15:03 Tower kernel: ata2: EH complete
Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#29 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#29 CDB: opcode=0x88 88 00 00 00 00 01 88 34 23 50 00 00 05 40 00 00
Jan 29 22:15:03 Tower kernel: print_req_error: I/O error, dev sdc, sector 6580085584
Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#30 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#30 CDB: opcode=0x88 88 00 00 00 00 01 88 34 28 90 00 00 05 40 00 00
Jan 29 22:15:03 Tower kernel: print_req_error: I/O error, dev sdc, sector 6580086928
Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#28 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#28 CDB: opcode=0x88 88 00 00 00 00 01 88 34 1f 50 00 00 04 00 00 00
Jan 29 22:15:03 Tower kernel: print_req_error: I/O error, dev sdc, sector 6580084560

Errors on Disk15:

Jan 29 22:12:28 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 29 22:12:28 Tower kernel: ata1.00: failed command: WRITE DMA EXT
Jan 29 22:12:28 Tower kernel: ata1.00: cmd 35/00:00:30:cb:38/00:04:88:01:00/e0 tag 15 dma 524288 out
Jan 29 22:12:28 Tower kernel: ata1.00: status: { DRDY }
Jan 29 22:12:28 Tower kernel: ata1: hard resetting link
Jan 29 22:12:28 Tower kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 29 22:12:33 Tower kernel: ata1.00: qc timeout (cmd 0xec)
Jan 29 22:12:33 Tower kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jan 29 22:12:33 Tower kernel: ata1.00: revalidation failed (errno=-5)
Jan 29 22:12:33 Tower kernel: ata1: hard resetting link
Jan 29 22:12:43 Tower kernel: ata1: softreset failed (1st FIS failed)
Jan 29 22:12:43 Tower kernel: ata1: hard resetting link
Jan 29 22:12:53 Tower kernel: ata1: softreset failed (1st FIS failed)
Jan 29 22:12:53 Tower kernel: ata1: hard resetting link
Jan 29 22:13:28 Tower kernel: ata1: softreset failed (1st FIS failed)
Jan 29 22:13:28 Tower kernel: ata1: limiting SATA link speed to 3.0 Gbps
Jan 29 22:13:28 Tower kernel: ata1: hard resetting link
Jan 29 22:13:33 Tower kernel: ata1: softreset failed (1st FIS failed)
Jan 29 22:13:33 Tower kernel: ata1: reset failed, giving up
Jan 29 22:13:33 Tower kernel: ata1.00: disabled
Jan 29 22:13:33 Tower kernel: ata1: EH complete
Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] tag#12 CDB: opcode=0x8a 8a 00 00 00 00 01 88 38 cb 30 00 00 04 00 00 00
Jan 29 22:13:33 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580390704
Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] Read Capacity(16) failed: Result: hostbyte=0x04 driverbyte=0x00
Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] Sense not available.
Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] Read Capacity(10) failed: Result: hostbyte=0x04 driverbyte=0x00
Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] Sense not available.
Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] 0 512-byte logical blocks: (0 B/0 B)
Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] 4096-byte physical blocks
Jan 29 22:13:33 Tower kernel: sdb: detected capacity change from 10000831348736 to 0
Jan 29 22:13:33 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580391728
Jan 29 22:13:33 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580392056
Jan 29 22:13:33 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580393088
Jan 29 22:13:33 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580084024
Jan 29 22:15:03 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580392056

Edited January 30, 2020 by johnsanc

JorgeB · January 30, 2020

Both disks drooped offline, when Unraid receives a read error it first tries to write that sector back, and this will cause Unraid to disable it even if there weren't active writes to any of them.

Disks dropping offline like that is usually cable/power related, complete diags could give some more clues.

johnsanc · January 30, 2020

Attached are my diagnostics from this morning as well as the relevant portion of my syslog since the time I started mover (a few reboots ago).

Just skimming it though it I can see theres a few issues... but any interpretation and recommendations are welcome.

Also, not sure if its related or not, but I've had weird issues lately with this new hardware (x570 board) where sometimes unraid gets stuck in a boot loop and it wont make it past extracting /bzimage. I never had these issues with my older hardware just a few weeks ago. I suppose the flash could be dying, but that would be a weird coincidence that the flash starts dying right when I get new hardware.

syslog-johnsanc.log tower-diagnostics-20200130-0921.zip

JorgeB · January 30, 2020

Problem with the Asmedia controller (both affected disks are connected there), possibly related to virtualization:

Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80000 flags=0x0020]
Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80180 flags=0x0020]
Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80280 flags=0x0020]
Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80380 flags=0x0020]
Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80480 flags=0x0020]
Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80580 flags=0x0020]
Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80600 flags=0x0020]
Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80700 flags=0x0020]
Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80800 flags=0x0020]
Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80900 flags=0x0020]
Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80980 flags=0x0020]
Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80a80 flags=0x0020]
Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80b80 flags=0x0020]
Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80c80 flags=0x0020]
Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80d80 flags=0x0020]
Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80e80 flags=0x0020]
Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80f80 flags=0x0020]
Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff81000 flags=0x0020]
Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff81180 flags=0x0020]
Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff81280 flags=0x0020]

johnsanc · January 30, 2020

Interesting - I wonder why virtualization would cause an issue with that. I obviously am not passing that through to my VM. Are there any BIOS settings or anything I should look into?

JorgeB · January 30, 2020

11 minutes ago, johnsanc said:

Interesting - I wonder why virtualization would cause an issue with that.

Don't know, but not the first time I see similar errors with AMD based systems, BIOS update might help.

johnsanc · January 31, 2020

Thanks, I am already using the latest BIOS. I suppose theres still a few kinks to work out with X570. In the meantime I tried this: https://forum.level1techs.com/t/devops-workstation-fixing-nvme-trim-on-linux/148354

I did not have a TRIM job going at the time, but I figured its worth a shot to try it anyways, before I replace cables.

Also, I believe all of the errors around 5:30 AM were related to TRIM of my cache pool. It seems as if the writes are blocked while TRIM is running. I turned off Docker and re-ran TRIM and no errors.

Weird boot loop also seems to be resolved as long as I use legacy boot instead of UEFI

Edited January 31, 2020 by johnsanc

Write Errors without Writing?

Recommended Posts

johnsanc

Link to comment

JorgeB

Link to comment

johnsanc

Link to comment

JorgeB

Link to comment

johnsanc

Link to comment

JorgeB

Link to comment

johnsanc

Link to comment

Join the conversation