Cache error - best way forward to preserve data?

jademonkee · February 6, 2023

Hi all,

On Sunday morning I awoke to see notifications on my phone saying:

Warning: crc error count is 1

Warning: crc error count is 2

Warning: crc error count is 5

(there have been none since this initial escalation)

And my BTRFS script was also producing "ERRORS on cache pool" (I've since disabled its hourly schedule).

I wasn't able to attend to the problem until late this morning (Monday), and the system log has filled up in that time

Now when I go into SMART, my 2nd Cache disk has the following message:

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

There's also no temperature showing on the front page for that disk (the other cache disk does show temp).

My drives are well beyond their MTBF (warranty of 75TB written, and I've written ~130TB), so I assumed it's them dying and have ordered 2x 1TB replacements, arriving today.

To replace them, I am planning on following this guide:

That is, adding one of the new disks to the chassis, assigning it in place of the 2nd cache drive, and starting the array. Then once that's finished, shutting down, removing the dead drive, connecting the other new 1TB SSD, then assigning that new disk in place of the final old disk and starting the array.

Seems easy.

However, before I start, I note on the cache disk page that it says under "Balance Status":

Current usage ratio: 44.9 % --- Full Balance recommended

Is that because my second disk has dropped out? Our should I perform a balance before I attach the new disk?

I've attached my diagnostics for your perusal.

Are there any problems with the steps I plan to take? Am I at risk of data loss?

Also: Should I have just shutdown and tried re-seating my SATA cable instead of assuming a dead disk? ie is the disk definitely dead?

Thanks for your help and insight.

EDIT:

I should also add that I ran this command before I started typing up this post, then afterwards. As you can see, the error rates are increasing:

root@Percy:~# btrfs dev stats /mnt/cache
[/dev/sdb1].write_io_errs    2469391
[/dev/sdb1].read_io_errs     1065906
[/dev/sdb1].flush_io_errs    0
[/dev/sdb1].corruption_errs  0
[/dev/sdb1].generation_errs  0
[/dev/sdc1].write_io_errs    0
[/dev/sdc1].read_io_errs     0
[/dev/sdc1].flush_io_errs    0
[/dev/sdc1].corruption_errs  0
[/dev/sdc1].generation_errs  0
root@Percy:~# btrfs dev stats /mnt/cache
[/dev/sdb1].write_io_errs    2488143
[/dev/sdb1].read_io_errs     1067924
[/dev/sdb1].flush_io_errs    0
[/dev/sdb1].corruption_errs  0
[/dev/sdb1].generation_errs  0
[/dev/sdc1].write_io_errs    0
[/dev/sdc1].read_io_errs     0
[/dev/sdc1].flush_io_errs    0
[/dev/sdc1].corruption_errs  0
[/dev/sdc1].generation_errs  0

percy-diagnostics-20230206-1131.zip

Edited February 6, 2023 by jademonkee
Added btrfs dev stats command output

JorgeB · February 6, 2023

Replace both cables for both devices and post new diags after array start.

jademonkee · February 6, 2023

Not so easy to just swap the cable (as it's a SAS > x4 SATA cable), so I placed the old disks on the two spare SATA connections on my expansion card.

After entering my encryption key to start the array, Firefox said that it will have to resend the info to show the page, I hit ok, but now all the disk slots are selectable, but there's no option to start/stop the array, only shutdown (see attached)

Diagnostics attached.

percy-diagnostics-20230206-1347.zip

JorgeB · February 6, 2023

22 minutes ago, jademonkee said:

After entering my encryption key to start the array, Firefox said that it will have to resend the info to show the page, I hit ok, but now all the disk slots are selectable

That's a known Firefox problem, reboot and use a different browser (or don't hit resend).

jademonkee · February 6, 2023

🤦‍♂️

Ok, have rebooted.

Latest diags attached.

I should note that last time I booted, I received a warning that the crc error count was now 18.

percy-diagnostics-20230206-1422.zip

JorgeB · February 6, 2023

Pool is mounting and everything looks good so far, run a correcting scrub and post new diags if there are uncorrectable errors.

jademonkee · February 6, 2023

Result of scrub:

UUID:             54142ec0-63e0-4706-afde-ebb28ee3d5d1
Scrub started:    Mon Feb  6 15:14:53 2023
Status:           finished
Duration:         0:02:26
Total to scrub:   104.18GiB
Rate:             730.69MiB/s
Error summary:    verify=6869 csum=314706
  Corrected:      321575
  Uncorrectable:  0
  Unverified:     0

Under Balance Status it still says:

Current usage ratio: 44.8 % --- Full Balance recommended

Should I balance it? I'm not entirely sure what it does...

I'll reset the error count on the User Script and reschedule it to run hourly.

And now I have to decide on if I'll keep or return the new SSD, too...

jademonkee · February 6, 2023

Hrmm. I cleared the errors using:

root@Percy:~# btrfs dev stats -z /mnt/cache
[/dev/sde1].write_io_errs    2539524
[/dev/sde1].read_io_errs     1083292
[/dev/sde1].flush_io_errs    0
[/dev/sde1].corruption_errs  314706
[/dev/sde1].generation_errs  6869
[/dev/sdd1].write_io_errs    0
[/dev/sdd1].read_io_errs     0
[/dev/sdd1].flush_io_errs    0
[/dev/sdd1].corruption_errs  0
[/dev/sdd1].generation_errs  0

And re-scheduled the hourly script to check the pool for errors.

Then I went and re-enabled the Docker service, but now I get the error on the Docker page:

Docker Service failed to start.

Diags attached.

percy-diagnostics-20230206-1528.zip

JorgeB · February 6, 2023

No need to balance for now, and the devices look OK, more likely it was a cable/connection problem, keep monitoring the stats, and you need to recreate the docker image.

jademonkee · February 6, 2023

Great stuff, thanks.

I've now recreated the Docker image (as well as the custom Docker network for Swag etc), and everything seems to be working well.

I'll keep an eye on everything over the coming days.

Thanks so much for your help.

Cache error - best way forward to preserve data?

Recommended Posts

jademonkee

Link to comment

JorgeB

Link to comment

jademonkee

Link to comment

JorgeB

Link to comment

jademonkee

Link to comment

JorgeB

Link to comment

jademonkee

Link to comment

jademonkee

Link to comment

JorgeB

Link to comment

jademonkee

Link to comment

Join the conversation