June 23, 20233 yr I was getting crc count error warnings for one of my disks (sdi in attached diagnostics), so I was moving it around to different drive bays (so as to use different SATA ports and cables), but somehow its always the same disk which seems to show CRC errors. Nothing was wrong with data / filesystems during this, except for the latest reboot. After the reboot, a parity check started automatically 1. can I know what caused this? I assume unclean shutdown, timeout on something exceeded, but I can't see what 2. the trouble disk was on a btrfs pool of 2 HDDs. Why did the array get affected? 3. Anything else I am missing? is there another underlying issue I upgraded from 6.11.5 to 6.12.1 yesterday, and it went smooth including importing a zfs pool. I don't believe this is probably the culprit, but still worth a mention the log is full of disk errors, but I cannot pinpoint which disk. I am assuming sdi based on creeping CRC errors diagnostics attached godaam-diagnostics-20230623-1312.zip
June 23, 20233 yr 2 hours ago, apandey said: 1. can I know what caused this? I assume unclean shutdown Correct, there should be diags saved in the flash drive /logs folder from that shutdown that might show the issue.
June 23, 20233 yr Author 6 hours ago, JorgeB said: Correct, there should be diags saved in the flash drive /logs folder from that shutdown that might show the issue. nice. I didn't know a diag is automatically saved in such a case. Attached the one from shutdown. seems the troubled btrfs pool drive had problems unmounting diags attached godaam-diagnostics-20230623-1255.zip I am still not clear why this should mark the array dirty. a bit scary if issues with a cache pool would affect array operations so what should be my next steps? I am not very familiar with btrfs recovery. I ran a scrub before reboot and it seemed OK, not sure how I see the same issue that the logs see. If the disk needs replacing, how do I replace it?
June 23, 20233 yr Solution The issues with tsdb2 disk start write from boot, and are constantly spamming the log, looks more like a connection issue, replace cables for that disk or swap it to a different controller.
June 23, 20233 yr Author 13 minutes ago, JorgeB said: The issues with tsdb2 disk start write from boot, and are constantly spamming the log, looks more like a connection issue, replace cables for that disk or swap it to a different controller. OK, I will swap things out once the current parity check finishes. I have also increased the shutdown timeout in disk settings for now to avoid this for next shutdown I did move the drive from motherboard SATA to my LSI controller when the crc errors first showed up. The drive bays use different cables too. I will also examine the drive side connectors this time when I swap it. I have one more spare slot where I can try to put the disk If this continues, is there a way I can downgrade tsdb to a single drive pool temporarily? would like to take the disk out and test it outside the system if swapping cables etc doesnt work
June 23, 20233 yr 27 minutes ago, apandey said: I did move the drive from motherboard SATA to my LSI controller It's not using the LSI controller, it's connected to the Intel SCU controller, that's the secondary Intel SATA/SAS controller
June 24, 20233 yr Author 10 hours ago, JorgeB said: It's not using the LSI controller, it's connected to the Intel SCU controller, that's the secondary Intel SATA/SAS controller arrrrgggh. sorry, and thanks for spotting this. I mistakenly moved the other 2TB drive to the LSI controller. no wonder the crc errors didn't stop. I have a spare port on the motherboard SATA, so will rewire to that and report back
June 24, 20233 yr Author I have switched the affected drive to the other motherboard SATA controller, so far so good started a scrub on the tsdb pool, and its reporting some errors. I did not check the fix errors Scrub started: Sat Jun 24 11:12:36 2023 Status: running Duration: 0:13:38 Time left: 3:06:22 ETA: Sat Jun 24 14:32:36 2023 Total to scrub: 2.48TiB Bytes scrubbed: 173.21GiB (6.82%) Rate: 216.83MiB/s Error summary: verify=2 csum=120266 Corrected: 0 Uncorrectable: 0 Unverified: 0 how do I go about fixing the errors? should I run with fix errors checkbox? need to wait for initial scrube to finish, or just cancel and do the fix steps? EDIT: started a correcting scrub Edited June 24, 20233 yr by apandey updated on scrub
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.