johnsanc Posted January 30, 2020 Share Posted January 30, 2020 (edited) I recently invoked the mover script and after awhile I noticed 2 disks became marked as disabled due to write errors Disk 7 = 52 writes, 2252 errors (Are these read errors?) Disk 15 = 546,971 writes, 731 errors However the mover was only moving files to Disk 15. I do have turbo write enabled, could Disk 7 read errors cause Disk 15 write errors? Also, could a cabling issue cause this? Both disks look fine in the SMART reports. I should also mention that I just completed a parity check with about ~90,000 sync errors due to a mysterious reboot that happened in the middle of the night. The server started back up and started a parity check so I let it run. Would a parity check log errors for individual disks? If it was a cabling issue I'm surprised I didn't get individual disk errors during the check. Errors on Disk7: Jan 29 22:14:03 Tower kernel: ata2.00: failed command: READ DMA EXT Jan 29 22:14:03 Tower kernel: ata2.00: cmd 25/00:00:50:1f:34/00:04:88:01:00/e0 tag 28 dma 524288 in Jan 29 22:14:03 Tower kernel: ata2.00: status: { DRDY } Jan 29 22:14:03 Tower kernel: ata2: hard resetting link Jan 29 22:14:13 Tower kernel: ata2: softreset failed (1st FIS failed) Jan 29 22:14:13 Tower kernel: ata2: hard resetting link Jan 29 22:14:23 Tower kernel: ata2: softreset failed (1st FIS failed) Jan 29 22:14:23 Tower kernel: ata2: hard resetting link Jan 29 22:14:58 Tower kernel: ata2: softreset failed (1st FIS failed) Jan 29 22:14:58 Tower kernel: ata2: limiting SATA link speed to 3.0 Gbps Jan 29 22:14:58 Tower kernel: ata2: hard resetting link Jan 29 22:15:03 Tower kernel: ata2: softreset failed (1st FIS failed) Jan 29 22:15:03 Tower kernel: ata2: reset failed, giving up Jan 29 22:15:03 Tower kernel: ata2.00: disabled Jan 29 22:15:03 Tower kernel: ata2: EH complete Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#29 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#29 CDB: opcode=0x88 88 00 00 00 00 01 88 34 23 50 00 00 05 40 00 00 Jan 29 22:15:03 Tower kernel: print_req_error: I/O error, dev sdc, sector 6580085584 Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#30 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#30 CDB: opcode=0x88 88 00 00 00 00 01 88 34 28 90 00 00 05 40 00 00 Jan 29 22:15:03 Tower kernel: print_req_error: I/O error, dev sdc, sector 6580086928 Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#28 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Jan 29 22:15:03 Tower kernel: sd 3:0:0:0: [sdc] tag#28 CDB: opcode=0x88 88 00 00 00 00 01 88 34 1f 50 00 00 04 00 00 00 Jan 29 22:15:03 Tower kernel: print_req_error: I/O error, dev sdc, sector 6580084560 Errors on Disk15: Jan 29 22:12:28 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jan 29 22:12:28 Tower kernel: ata1.00: failed command: WRITE DMA EXT Jan 29 22:12:28 Tower kernel: ata1.00: cmd 35/00:00:30:cb:38/00:04:88:01:00/e0 tag 15 dma 524288 out Jan 29 22:12:28 Tower kernel: ata1.00: status: { DRDY } Jan 29 22:12:28 Tower kernel: ata1: hard resetting link Jan 29 22:12:28 Tower kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Jan 29 22:12:33 Tower kernel: ata1.00: qc timeout (cmd 0xec) Jan 29 22:12:33 Tower kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) Jan 29 22:12:33 Tower kernel: ata1.00: revalidation failed (errno=-5) Jan 29 22:12:33 Tower kernel: ata1: hard resetting link Jan 29 22:12:43 Tower kernel: ata1: softreset failed (1st FIS failed) Jan 29 22:12:43 Tower kernel: ata1: hard resetting link Jan 29 22:12:53 Tower kernel: ata1: softreset failed (1st FIS failed) Jan 29 22:12:53 Tower kernel: ata1: hard resetting link Jan 29 22:13:28 Tower kernel: ata1: softreset failed (1st FIS failed) Jan 29 22:13:28 Tower kernel: ata1: limiting SATA link speed to 3.0 Gbps Jan 29 22:13:28 Tower kernel: ata1: hard resetting link Jan 29 22:13:33 Tower kernel: ata1: softreset failed (1st FIS failed) Jan 29 22:13:33 Tower kernel: ata1: reset failed, giving up Jan 29 22:13:33 Tower kernel: ata1.00: disabled Jan 29 22:13:33 Tower kernel: ata1: EH complete Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] tag#12 CDB: opcode=0x8a 8a 00 00 00 00 01 88 38 cb 30 00 00 04 00 00 00 Jan 29 22:13:33 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580390704 Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] Read Capacity(16) failed: Result: hostbyte=0x04 driverbyte=0x00 Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] Sense not available. Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] Read Capacity(10) failed: Result: hostbyte=0x04 driverbyte=0x00 Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] Sense not available. Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] 0 512-byte logical blocks: (0 B/0 B) Jan 29 22:13:33 Tower kernel: sd 2:0:0:0: [sdb] 4096-byte physical blocks Jan 29 22:13:33 Tower kernel: sdb: detected capacity change from 10000831348736 to 0 Jan 29 22:13:33 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580391728 Jan 29 22:13:33 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580392056 Jan 29 22:13:33 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580393088 Jan 29 22:13:33 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580084024 Jan 29 22:15:03 Tower kernel: print_req_error: I/O error, dev sdb, sector 6580392056 Edited January 30, 2020 by johnsanc Quote Link to comment
JorgeB Posted January 30, 2020 Share Posted January 30, 2020 Both disks drooped offline, when Unraid receives a read error it first tries to write that sector back, and this will cause Unraid to disable it even if there weren't active writes to any of them. Disks dropping offline like that is usually cable/power related, complete diags could give some more clues. Quote Link to comment
johnsanc Posted January 30, 2020 Author Share Posted January 30, 2020 Attached are my diagnostics from this morning as well as the relevant portion of my syslog since the time I started mover (a few reboots ago). Just skimming it though it I can see theres a few issues... but any interpretation and recommendations are welcome. Also, not sure if its related or not, but I've had weird issues lately with this new hardware (x570 board) where sometimes unraid gets stuck in a boot loop and it wont make it past extracting /bzimage. I never had these issues with my older hardware just a few weeks ago. I suppose the flash could be dying, but that would be a weird coincidence that the flash starts dying right when I get new hardware. syslog-johnsanc.log tower-diagnostics-20200130-0921.zip Quote Link to comment
JorgeB Posted January 30, 2020 Share Posted January 30, 2020 Problem with the Asmedia controller (both affected disks are connected there), possibly related to virtualization: Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80000 flags=0x0020] Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80180 flags=0x0020] Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80280 flags=0x0020] Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80380 flags=0x0020] Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80480 flags=0x0020] Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80580 flags=0x0020] Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80600 flags=0x0020] Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80700 flags=0x0020] Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80800 flags=0x0020] Jan 29 22:11:37 Tower kernel: ahci 0000:26:00.0: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x000007fffff80900 flags=0x0020] Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80980 flags=0x0020] Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80a80 flags=0x0020] Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80b80 flags=0x0020] Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80c80 flags=0x0020] Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80d80 flags=0x0020] Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80e80 flags=0x0020] Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff80f80 flags=0x0020] Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff81000 flags=0x0020] Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff81180 flags=0x0020] Jan 29 22:11:37 Tower kernel: AMD-Vi: Event logged [IO_PAGE_FAULT device=26:00.0 domain=0x0000 address=0x000007fffff81280 flags=0x0020] Quote Link to comment
johnsanc Posted January 30, 2020 Author Share Posted January 30, 2020 Interesting - I wonder why virtualization would cause an issue with that. I obviously am not passing that through to my VM. Are there any BIOS settings or anything I should look into? Quote Link to comment
JorgeB Posted January 30, 2020 Share Posted January 30, 2020 11 minutes ago, johnsanc said: Interesting - I wonder why virtualization would cause an issue with that. Don't know, but not the first time I see similar errors with AMD based systems, BIOS update might help. Quote Link to comment
johnsanc Posted January 31, 2020 Author Share Posted January 31, 2020 (edited) Thanks, I am already using the latest BIOS. I suppose theres still a few kinks to work out with X570. In the meantime I tried this: https://forum.level1techs.com/t/devops-workstation-fixing-nvme-trim-on-linux/148354 I did not have a TRIM job going at the time, but I figured its worth a shot to try it anyways, before I replace cables. Also, I believe all of the errors around 5:30 AM were related to TRIM of my cache pool. It seems as if the writes are blocked while TRIM is running. I turned off Docker and re-ran TRIM and no errors. Weird boot loop also seems to be resolved as long as I use legacy boot instead of UEFI Edited January 31, 2020 by johnsanc Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.