Jump to content

Cache error - best way forward to preserve data?


Go to solution Solved by JorgeB,

Recommended Posts

Hi all,

On Sunday morning I awoke to see notifications on my phone saying:

Warning: crc error count is 1

Warning: crc error count is 2

Warning: crc error count is 5

(there have been none since this initial escalation)

And my BTRFS script was also producing "ERRORS on cache pool" (I've since disabled its hourly schedule).

I wasn't able to attend to the problem until late this morning (Monday), and the system log has filled up in that time :/

Now when I go into SMART, my 2nd Cache disk has the following message:

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

There's also no temperature showing on the front page for that disk (the other cache disk does show temp).

My drives are well beyond their MTBF (warranty of 75TB written, and I've written ~130TB), so I assumed it's them dying and have ordered 2x 1TB replacements, arriving today.

To replace them, I am planning on following this guide:

That is, adding one of the new disks to the chassis, assigning it in place of the 2nd cache drive, and starting the array. Then once that's finished, shutting down, removing the dead drive, connecting the other new 1TB SSD, then assigning that new disk in place of the final old disk and starting the array.

Seems easy.

However, before I start, I note on the cache disk page that it says under "Balance Status":

Current usage ratio: 44.9 % --- Full Balance recommended

Is that because my second disk has dropped out? Our should I perform a balance before I attach the new disk?

I've attached my diagnostics for your perusal.

Are there any problems with the steps I plan to take? Am I at risk of data loss?

Also: Should I have just shutdown and tried re-seating my SATA cable instead of assuming a dead disk? ie is the disk definitely dead?

Thanks for your help and insight.

 

EDIT:

I should  also add that I ran this command before I started typing up this post, then afterwards. As you can see, the error rates are increasing:

root@Percy:~# btrfs dev stats /mnt/cache
[/dev/sdb1].write_io_errs    2469391
[/dev/sdb1].read_io_errs     1065906
[/dev/sdb1].flush_io_errs    0
[/dev/sdb1].corruption_errs  0
[/dev/sdb1].generation_errs  0
[/dev/sdc1].write_io_errs    0
[/dev/sdc1].read_io_errs     0
[/dev/sdc1].flush_io_errs    0
[/dev/sdc1].corruption_errs  0
[/dev/sdc1].generation_errs  0
root@Percy:~# btrfs dev stats /mnt/cache
[/dev/sdb1].write_io_errs    2488143
[/dev/sdb1].read_io_errs     1067924
[/dev/sdb1].flush_io_errs    0
[/dev/sdb1].corruption_errs  0
[/dev/sdb1].generation_errs  0
[/dev/sdc1].write_io_errs    0
[/dev/sdc1].read_io_errs     0
[/dev/sdc1].flush_io_errs    0
[/dev/sdc1].corruption_errs  0
[/dev/sdc1].generation_errs  0

percy-diagnostics-20230206-1131.zip

Edited by jademonkee
Added btrfs dev stats command output
Link to comment

Not so easy to just swap the cable (as it's a SAS > x4 SATA cable), so I placed the old disks on the two spare SATA connections on my expansion card.

After entering my encryption key to start the array, Firefox said that it will have to resend the info to show the page, I hit ok, but now all the disk slots are selectable, but there's no option to start/stop the array, only shutdown (see attached)

Diagnostics attached.

 

ArrayScreenshot.png

percy-diagnostics-20230206-1347.zip

Link to comment

Result of scrub:

UUID:             54142ec0-63e0-4706-afde-ebb28ee3d5d1
Scrub started:    Mon Feb  6 15:14:53 2023
Status:           finished
Duration:         0:02:26
Total to scrub:   104.18GiB
Rate:             730.69MiB/s
Error summary:    verify=6869 csum=314706
  Corrected:      321575
  Uncorrectable:  0
  Unverified:     0

 

Under Balance Status it still says:

Current usage ratio: 44.8 % --- Full Balance recommended

Should I balance it? I'm not entirely sure what it does...

I'll reset the error count on the User Script and reschedule it to run hourly.

 

And now I have to decide on if I'll keep or return the new SSD, too...

Link to comment

Hrmm. I cleared the errors using:

root@Percy:~# btrfs dev stats -z /mnt/cache
[/dev/sde1].write_io_errs    2539524
[/dev/sde1].read_io_errs     1083292
[/dev/sde1].flush_io_errs    0
[/dev/sde1].corruption_errs  314706
[/dev/sde1].generation_errs  6869
[/dev/sdd1].write_io_errs    0
[/dev/sdd1].read_io_errs     0
[/dev/sdd1].flush_io_errs    0
[/dev/sdd1].corruption_errs  0
[/dev/sdd1].generation_errs  0

And re-scheduled the hourly script to check the pool for errors.

Then I went and re-enabled the Docker service, but now I get the error on the Docker page:

Docker Service failed to start.

Diags attached.

percy-diagnostics-20230206-1528.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...