Lumpy_BD Posted November 3, 2021 Share Posted November 3, 2021 My monthly parity check is currently running, and it's reported 165 errors so far. It's the first time that I've ever had any errors reported, and there have been no shutdowns, clean or otherwise in several months. Can anyone give me some pointers on how I go about evaluating the cause and severity of the issues as this is unfamiliar territory for me. Thanks. P.s. I'm happy to provide the diagnostics too if that helps. Quote Link to comment
trurl Posted November 3, 2021 Share Posted November 3, 2021 8 minutes ago, Lumpy_BD said: P.s. I'm happy to provide the diagnostics too if that helps. Probably not going to get any useful responses without them. Quote Link to comment
Lumpy_BD Posted November 3, 2021 Author Share Posted November 3, 2021 4 hours ago, trurl said: Probably not going to get any useful responses without them. Yeah fair enough lol... tower-diagnostics-20211103-1644.zip Quote Link to comment
JorgeB Posted November 3, 2021 Share Posted November 3, 2021 For now only parity2 was out of sync, this suggests it's not a data disk issue or RAM, or both parity drives would be out of sync, I would start by replacing the Asmedia controller using port multipliers, it's generating some ATA errors and it's also not good for performance, and coincidentally or not, parity2 is connected there, if you use a recommended controller it should also improve performance, last check took almost 3 days which is a lot more than it could be, though you were using the array. P.S. please disable mover logging unless needed to troubleshoot the mover, or it generates a lot of log spam. Quote Link to comment
trurl Posted November 3, 2021 Share Posted November 3, 2021 Nov 2 22:03:59 Tower kernel: md: recovery thread: Q corrected, sector=7038653368 Nov 2 22:03:59 Tower kernel: md: recovery thread: Q corrected, sector=7038653376 Looks like you have correcting parity checks scheduled. It is recommend that you only correct parity when you have determined that it needs correcting. You don't want a problem with another disk to corrupt parity. And you do have a problem with a disk Nov 2 19:21:52 Tower kernel: ata5.01: failed command: WRITE FPDMA QUEUED Nov 2 19:21:52 Tower kernel: ata5.01: cmd 61/20:a8:30:1d:ce/00:00:bc:02:00/40 tag 21 ncq dma 16384 out Nov 2 19:21:52 Tower kernel: res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Nov 2 19:21:52 Tower kernel: ata5.01: status: { DRDY } Nov 2 19:21:53 Tower kernel: ata5.15: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Nov 2 19:21:53 Tower kernel: ata5.00: SATA link up 6.0 Gbps (SStatus 133 SControl 330) Nov 2 19:21:53 Tower kernel: ata5.01: hard resetting link Looks like a connection issue. Unfortunately I can't tell which disk it is because you haven't rebooted in a long time and that information has "rotated" off the syslogs included in diagnostics. Probably you have some older syslogs in /var/log (syslog.3, syslog.4...) that might tell. You should disable mover logging unless you are trying to diagnose an issue with mover since those entries in syslog are not anonymized. Quote Link to comment
trurl Posted November 3, 2021 Share Posted November 3, 2021 5 minutes ago, JorgeB said: parity2 is connected there How can you tell? Quote Link to comment
Lumpy_BD Posted November 3, 2021 Author Share Posted November 3, 2021 Thanks for the help all. I hadn't realised I'd left mover logging enabled. This is now disabled. I've also turned off corrective parity checks. Would there be any benefit now to rebooting, and starting a new parity check to see if the results are any different? Quote Link to comment
JorgeB Posted November 3, 2021 Share Posted November 3, 2021 11 minutes ago, trurl said: How can you tell? lsscsi in the diags lists all the devices and you can crosscheck with the devices letters for example on the SMART reports, parity2 is sdh: [6:0:0:0] disk ATA WDC WD40EFRX-68W 0A80 /dev/sdg /dev/sg7 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/6:0:0:0 [/sys/devices/pci0000:00/0000:00:1c.7/0000:08:00.0/ata5/host6/target6:0:0/6:0:0:0] [6:1:0:0] disk ATA WDC WD120EMFZ-11 0A81 /dev/sdh /dev/sg8 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=1 timeout=30 dir: /sys/bus/scsi/devices/6:1:0:0 [/sys/devices/pci0000:00/0000:00:1c.7/0000:08:00.0/ata5/host6/target6:1:0/6:1:0:0] [6:2:0:0] disk ATA WDC WD120EDAZ-11 0A81 /dev/sdi /dev/sg9 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/6:2:0:0 [/sys/devices/pci0000:00/0000:00:1c.7/0000:08:00.0/ata5/host6/target6:2:0/6:2:0:0] [7:0:0:0] disk ATA WDC WD120EMFZ-11 0A81 /dev/sdj /dev/sg10 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/7:0:0:0 [/sys/devices/pci0000:00/0000:00:1c.7/0000:08:00.0/ata6/host7/target7:0:0/7:0:0:0] These are all the disks connected to the Asmedia controller, and parity2 shares the same SATA port (and multiplier) with two other disks, you can see that because they all are under ata5, the one with ata6 is using the other port. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.