Greyberry Posted April 4, 2023 Share Posted April 4, 2023 (edited) Hi. I had a failing drive a view days ago. 24 errors on the write collumn indicates that the disk went offline immediatly, so it had not a lot of time to corrupt the data fortunatelly. After seeing this i swaped the disk with a new one and rebuilt the data from parity. Note that i did use the array while this process was going on. There were no errors or what so ever in the syncing process, so i thought I am good but out of curiosity initiated a parity check. The parity check finished today, and throw 1024 errors. Now I don't know if the corrupt data is on my freshly inserted disk, or on the parity disk. What should I do now? Since I am using only two drives in the array (one data, one parity) it is therotically possible to see which files are effected. How can I see, which ones? server-diagnostics-20230404-1254.zip Edited April 4, 2023 by Greyberry Quote Link to comment
JorgeB Posted April 4, 2023 Share Posted April 4, 2023 There are multiple ATA errors with disk1, suggesting a power/connection problem, check/replace those cables and run a correcting check. Quote Link to comment
Greyberry Posted April 4, 2023 Author Share Posted April 4, 2023 Thank you for your reply. I also see them in the syslog, but how do I know that this is in fact disk1? (Not that I don't belive you, but so that i can debug it for myself in the future.) If disk1 (data-disk) is the problem, wouldn't it be better to remove disk1 from the array and start a new sync/repair process from parity once the issues are resolved? instead of doing a correcting parity check, which would write the corrupt data from the data-disk to parity.? Quote Link to comment
JorgeB Posted April 4, 2023 Share Posted April 4, 2023 5 minutes ago, Greyberry said: but how do I know that this is in fact disk1? On the main GUI page click on the disk icon for that disk to see the related log. 5 minutes ago, Greyberry said: If disk1 (data-disk) is the problem, wouldn't it be better to remove disk1 from the array and start a new sync/repair process from parity once the issues are resolved? Doesn't look like a disk problem, more a power/connection problem, but if errors persist after replacing the cables (both power and SATA) it could be the disk. Quote Link to comment
Greyberry Posted April 4, 2023 Author Share Posted April 4, 2023 4 minutes ago, JorgeB said: On the main GUI page click on the disk icon for that disk to see the related log. You saw the ATA errors in the syslog. What i wanted to know is, how do you know that these are related to disk1? (and not parity disk?) Apr 3 19:21:28 SERVER kernel: ata3.00: failed command: READ FPDMA QUEUED Apr 3 19:21:28 SERVER kernel: ata3.00: cmd 60/28:58:10:12:1e/02:00:e3:00:00/40 tag 11 ncq dma 282624 in Apr 3 19:21:28 SERVER kernel: res 40/00:60:38:14:1e/00:00:e3:00:00/40 Emask 0x10 (ATA bus error) Apr 3 19:21:28 SERVER kernel: ata3.00: status: { DRDY } Apr 3 19:21:28 SERVER kernel: ata3.00: failed command: READ FPDMA QUEUED Apr 3 19:21:28 SERVER kernel: ata3.00: cmd 60/d0:60:38:14:1e/02:00:e3:00:00/40 tag 12 ncq dma 368640 in Apr 3 19:21:28 SERVER kernel: res 40/00:60:38:14:1e/00:00:e3:00:00/40 Emask 0x10 (ATA bus error) Apr 3 19:21:28 SERVER kernel: ata3.00: status: { DRDY } Apr 3 19:21:28 SERVER kernel: ata3: hard resetting link Apr 3 19:21:29 SERVER kernel: ata3: SATA link down (SStatus 0 SControl 300) Apr 3 19:21:34 SERVER kernel: ata3: hard resetting link Apr 3 19:21:39 SERVER kernel: ata3: link is slow to respond, please be patient (ready=0) Apr 3 19:21:42 SERVER kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Apr 3 19:21:42 SERVER kernel: ata3.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK) filtered out Apr 3 19:21:42 SERVER kernel: ata3.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION OVERLAY) filtered out Apr 3 19:21:42 SERVER kernel: ata3.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK) filtered out Apr 3 19:21:42 SERVER kernel: ata3.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION OVERLAY) filtered out 4 minutes ago, JorgeB said: Doesn't look like a disk problem, more a power/connection problem, but if errors persist after replacing the cables (both power and SATA) it could be the disk. Yes I know. But: Quote check/replace those cables and run a correcting check. Doesn't it make more sense to do a DISK-REBUILD (parity --> data) instead of a correcting parity-check? (data --> parity) Because in this case it is more likely that the data is corrupt and the parity is in tact. Quote Link to comment
JorgeB Posted April 4, 2023 Share Posted April 4, 2023 3 minutes ago, Greyberry said: What i wanted to know is, how do you know that these are related to disk1? Click here: 7 minutes ago, Greyberry said: Doesn't it make more sense to do a DISK-REBUILD (parity --> data) instead of a correcting parity-check? (data --> parity) Because in this case it is more likely that the data is corrupt and the parity is in tact. You can do that, but unless you have checksums or were using btrfs/zfs no way to know for certain. Quote Link to comment
Greyberry Posted April 4, 2023 Author Share Posted April 4, 2023 39 minutes ago, JorgeB said: Click here: You couldn't do that on my machine, could you? I wanted to know how you knew FROM THE LOGS the errors were from disk1. 39 minutes ago, JorgeB said: You can do that, but unless you have checksums or were using btrfs/zfs no way to know for certain. yeah disk1 (data-disk) is corrupt, so I think it is better to rebuild the data from parity. Quote Link to comment
JorgeB Posted April 4, 2023 Share Posted April 4, 2023 52 minutes ago, Greyberry said: You couldn't do that on my machine, could you? Not sure what you mean, why not? Misread as you couldn't do that. I see which disk it is based on the full diags, depending on which controller it is using, in this case using lsscsi.txt, but for you it's easier to just click that. 1 Quote Link to comment
Greyberry Posted April 4, 2023 Author Share Posted April 4, 2023 2 hours ago, JorgeB said: Not sure what you mean, why not? Misread as you couldn't do that. I see which disk it is based on the full diags, depending on which controller it is using, in this case using lsscsi.txt, but for you it's easier to just click that. Thank you! 🙂 Now I know. Sometimes it is faster to do it via terminal or look into the diagnostics when you have that opened anyway. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.