jademonkee Posted February 6, 2023 Share Posted February 6, 2023 (edited) Hi all, On Sunday morning I awoke to see notifications on my phone saying: Warning: crc error count is 1 Warning: crc error count is 2 Warning: crc error count is 5 (there have been none since this initial escalation) And my BTRFS script was also producing "ERRORS on cache pool" (I've since disabled its hourly schedule). I wasn't able to attend to the problem until late this morning (Monday), and the system log has filled up in that time Now when I go into SMART, my 2nd Cache disk has the following message: A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. There's also no temperature showing on the front page for that disk (the other cache disk does show temp). My drives are well beyond their MTBF (warranty of 75TB written, and I've written ~130TB), so I assumed it's them dying and have ordered 2x 1TB replacements, arriving today. To replace them, I am planning on following this guide: That is, adding one of the new disks to the chassis, assigning it in place of the 2nd cache drive, and starting the array. Then once that's finished, shutting down, removing the dead drive, connecting the other new 1TB SSD, then assigning that new disk in place of the final old disk and starting the array. Seems easy. However, before I start, I note on the cache disk page that it says under "Balance Status": Current usage ratio: 44.9 % --- Full Balance recommended Is that because my second disk has dropped out? Our should I perform a balance before I attach the new disk? I've attached my diagnostics for your perusal. Are there any problems with the steps I plan to take? Am I at risk of data loss? Also: Should I have just shutdown and tried re-seating my SATA cable instead of assuming a dead disk? ie is the disk definitely dead? Thanks for your help and insight. EDIT: I should also add that I ran this command before I started typing up this post, then afterwards. As you can see, the error rates are increasing: root@Percy:~# btrfs dev stats /mnt/cache [/dev/sdb1].write_io_errs 2469391 [/dev/sdb1].read_io_errs 1065906 [/dev/sdb1].flush_io_errs 0 [/dev/sdb1].corruption_errs 0 [/dev/sdb1].generation_errs 0 [/dev/sdc1].write_io_errs 0 [/dev/sdc1].read_io_errs 0 [/dev/sdc1].flush_io_errs 0 [/dev/sdc1].corruption_errs 0 [/dev/sdc1].generation_errs 0 root@Percy:~# btrfs dev stats /mnt/cache [/dev/sdb1].write_io_errs 2488143 [/dev/sdb1].read_io_errs 1067924 [/dev/sdb1].flush_io_errs 0 [/dev/sdb1].corruption_errs 0 [/dev/sdb1].generation_errs 0 [/dev/sdc1].write_io_errs 0 [/dev/sdc1].read_io_errs 0 [/dev/sdc1].flush_io_errs 0 [/dev/sdc1].corruption_errs 0 [/dev/sdc1].generation_errs 0 percy-diagnostics-20230206-1131.zip Edited February 6, 2023 by jademonkee Added btrfs dev stats command output Quote Link to comment
JorgeB Posted February 6, 2023 Share Posted February 6, 2023 Replace both cables for both devices and post new diags after array start. Quote Link to comment
jademonkee Posted February 6, 2023 Author Share Posted February 6, 2023 Not so easy to just swap the cable (as it's a SAS > x4 SATA cable), so I placed the old disks on the two spare SATA connections on my expansion card. After entering my encryption key to start the array, Firefox said that it will have to resend the info to show the page, I hit ok, but now all the disk slots are selectable, but there's no option to start/stop the array, only shutdown (see attached) Diagnostics attached. percy-diagnostics-20230206-1347.zip Quote Link to comment
JorgeB Posted February 6, 2023 Share Posted February 6, 2023 22 minutes ago, jademonkee said: After entering my encryption key to start the array, Firefox said that it will have to resend the info to show the page, I hit ok, but now all the disk slots are selectable That's a known Firefox problem, reboot and use a different browser (or don't hit resend). Quote Link to comment
jademonkee Posted February 6, 2023 Author Share Posted February 6, 2023 🤦♂️ Ok, have rebooted. Latest diags attached. I should note that last time I booted, I received a warning that the crc error count was now 18. percy-diagnostics-20230206-1422.zip Quote Link to comment
JorgeB Posted February 6, 2023 Share Posted February 6, 2023 Pool is mounting and everything looks good so far, run a correcting scrub and post new diags if there are uncorrectable errors. Quote Link to comment
jademonkee Posted February 6, 2023 Author Share Posted February 6, 2023 Result of scrub: UUID: 54142ec0-63e0-4706-afde-ebb28ee3d5d1 Scrub started: Mon Feb 6 15:14:53 2023 Status: finished Duration: 0:02:26 Total to scrub: 104.18GiB Rate: 730.69MiB/s Error summary: verify=6869 csum=314706 Corrected: 321575 Uncorrectable: 0 Unverified: 0 Under Balance Status it still says: Current usage ratio: 44.8 % --- Full Balance recommended Should I balance it? I'm not entirely sure what it does... I'll reset the error count on the User Script and reschedule it to run hourly. And now I have to decide on if I'll keep or return the new SSD, too... Quote Link to comment
jademonkee Posted February 6, 2023 Author Share Posted February 6, 2023 Hrmm. I cleared the errors using: root@Percy:~# btrfs dev stats -z /mnt/cache [/dev/sde1].write_io_errs 2539524 [/dev/sde1].read_io_errs 1083292 [/dev/sde1].flush_io_errs 0 [/dev/sde1].corruption_errs 314706 [/dev/sde1].generation_errs 6869 [/dev/sdd1].write_io_errs 0 [/dev/sdd1].read_io_errs 0 [/dev/sdd1].flush_io_errs 0 [/dev/sdd1].corruption_errs 0 [/dev/sdd1].generation_errs 0 And re-scheduled the hourly script to check the pool for errors. Then I went and re-enabled the Docker service, but now I get the error on the Docker page: Docker Service failed to start. Diags attached. percy-diagnostics-20230206-1528.zip Quote Link to comment
Solution JorgeB Posted February 6, 2023 Solution Share Posted February 6, 2023 No need to balance for now, and the devices look OK, more likely it was a cable/connection problem, keep monitoring the stats, and you need to recreate the docker image. Quote Link to comment
jademonkee Posted February 6, 2023 Author Share Posted February 6, 2023 Great stuff, thanks. I've now recreated the Docker image (as well as the custom Docker network for Swag etc), and everything seems to be working well. I'll keep an eye on everything over the coming days. Thanks so much for your help. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.