Cache drive showed as unassigned? Docker won't start now


Recommended Posts

Got an alert that one of my cache drives was missing, but it showed as an unassigned device. 

 

I rebooted server and it's detected as part of cache again but docker service is now failing to start.

 

What's the proper way to solve an issue like this? Is the rest of the data in cache pool likely to be corrupt?

Link to comment
3 hours ago, johnnie.black said:

Docker image is likely corrupt, so likely will any more data using NOCOW shares, rest should be fixable with a scrub, more info here.

Thanks, the NOCOW was key, my /system/ and /vms/ shares are NOCOW (both were this way by default from what I recall).

 

docker.img rebuilt and running btrfs scrub start -B /mnt/cache/ now -- so far no output, not sure how long to expect it to take, or if I can continue to rebuild docker containers while this goes. 

 

The odd thing is this same disk blinked offline before, but it was OVER a year ago (early April of 2018).

Strange....

Link to comment
3 hours ago, johnnie.black said:

Docker image is likely corrupt, so likely will any more data using NOCOW shares, rest should be fixable with a scrub, more info here.

Scrubbing didn't take long, but now btrfs dev stats /mnt/cache shows corrupt errors now when it didn't before the scrub. Not sure what to make of this.

 

image.png.c81a0c014f3818e98cdb4c0f42c2c201.png

Link to comment
2 minutes ago, CorneliousJD said:

both were this way by default from what I recall

They are.

 

3 minutes ago, CorneliousJD said:

not sure how long to expect it to take

Depends on much data and how fast cache is, since you used -B you can check progress with:

btrfs scrub status /mnt/cache

And you can continue to use the cache normally, with reduced performance

  • Upvote 1
Link to comment
1 minute ago, CorneliousJD said:

but now btrfs dev stats /mnt/cache shows corrupt errors now when it didn't before the scrub

That's normal, what's important is that there were no uncorrectable errors, all data on COW shares is now fine and correctly mirrored, any data on NOCOW shares can still be corrupt, like mentioned in the link, no way to check or correct it since it's not checksummed.

 

You should also reset current stats and use a script to monitor for the future.

  • Like 1
Link to comment
1 minute ago, johnnie.black said:

That's normal, what's important is that there were no uncorrectable errors, all data on COW shares is now fine and correctly mirrored, any data on NOCOW shares can still be corrupt, like mentioned in the link, no way to check or correct it since it's not checksummed.

 

You should also reset current stats and use a script to monitor for the future.

Awesome, good to know :) I can at least get back to normal work-mode today for now, I'll circle back on the scripts in a few hours.

 

Was awesome waking up to the fix - thanks @johnnie.black -- you've helped me before in the past as well, much appreciated! 

Link to comment

So I'm working on setting up the scripts for ease of use and the one that generates a warning seemed to work once, but now it's not, I had deleted the script to create them with better names/descriptions (and so the folders would get named correctly in /flash/config/plugins/user.scripts/scripts/) 

 

Now when I copy/paste your script in I get the following error, and it doesn't actually generate the warning.

 

/tmp/user.scripts/tmpScripts/Cache - Error Check/script: line 4: syntax error: unexpected end of file

 

The script is only 3 lines long, so there's nothing on line 4.

Is there a way I should be closing out this script to run cleanly?

Link to comment
1 minute ago, johnnie.black said:

If you're using user scrips there's nothing needed to do.

Yep, I am, just thought it was odd that it was the only one showing that error - the rest of the scripts I have exit cleanly. 

 

I'll go ahead and schedule that for hourly runs. I created a scrub script and a script to reset error counts incase this happens again.

the SSD is on a backplane so I can't really change any cables, and it hasn't happened for over a year until now. I'll be prepared this time though with the new scripts added incase it happens again.

 

Thank you again for your time and info!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.