CorneliousJD Posted July 22, 2019 Share Posted July 22, 2019 Got an alert that one of my cache drives was missing, but it showed as an unassigned device. I rebooted server and it's detected as part of cache again but docker service is now failing to start. What's the proper way to solve an issue like this? Is the rest of the data in cache pool likely to be corrupt? Quote Link to comment
JorgeB Posted July 22, 2019 Share Posted July 22, 2019 3 hours ago, CorneliousJD said: Is the rest of the data in cache pool likely to be corrupt? Docker image is likely corrupt, so likely will any more data using NOCOW shares, rest should be fixable with a scrub, more info here. 1 Quote Link to comment
CorneliousJD Posted July 22, 2019 Author Share Posted July 22, 2019 3 hours ago, johnnie.black said: Docker image is likely corrupt, so likely will any more data using NOCOW shares, rest should be fixable with a scrub, more info here. Thanks, the NOCOW was key, my /system/ and /vms/ shares are NOCOW (both were this way by default from what I recall). docker.img rebuilt and running btrfs scrub start -B /mnt/cache/ now -- so far no output, not sure how long to expect it to take, or if I can continue to rebuild docker containers while this goes. The odd thing is this same disk blinked offline before, but it was OVER a year ago (early April of 2018). Strange.... Quote Link to comment
CorneliousJD Posted July 22, 2019 Author Share Posted July 22, 2019 3 hours ago, johnnie.black said: Docker image is likely corrupt, so likely will any more data using NOCOW shares, rest should be fixable with a scrub, more info here. Scrubbing didn't take long, but now btrfs dev stats /mnt/cache shows corrupt errors now when it didn't before the scrub. Not sure what to make of this. Quote Link to comment
JorgeB Posted July 22, 2019 Share Posted July 22, 2019 2 minutes ago, CorneliousJD said: both were this way by default from what I recall They are. 3 minutes ago, CorneliousJD said: not sure how long to expect it to take Depends on much data and how fast cache is, since you used -B you can check progress with: btrfs scrub status /mnt/cache And you can continue to use the cache normally, with reduced performance 1 Quote Link to comment
JorgeB Posted July 22, 2019 Share Posted July 22, 2019 1 minute ago, CorneliousJD said: but now btrfs dev stats /mnt/cache shows corrupt errors now when it didn't before the scrub That's normal, what's important is that there were no uncorrectable errors, all data on COW shares is now fine and correctly mirrored, any data on NOCOW shares can still be corrupt, like mentioned in the link, no way to check or correct it since it's not checksummed. You should also reset current stats and use a script to monitor for the future. 1 Quote Link to comment
CorneliousJD Posted July 22, 2019 Author Share Posted July 22, 2019 1 minute ago, johnnie.black said: That's normal, what's important is that there were no uncorrectable errors, all data on COW shares is now fine and correctly mirrored, any data on NOCOW shares can still be corrupt, like mentioned in the link, no way to check or correct it since it's not checksummed. You should also reset current stats and use a script to monitor for the future. Awesome, good to know I can at least get back to normal work-mode today for now, I'll circle back on the scripts in a few hours. Was awesome waking up to the fix - thanks @johnnie.black -- you've helped me before in the past as well, much appreciated! Quote Link to comment
CorneliousJD Posted July 22, 2019 Author Share Posted July 22, 2019 So I'm working on setting up the scripts for ease of use and the one that generates a warning seemed to work once, but now it's not, I had deleted the script to create them with better names/descriptions (and so the folders would get named correctly in /flash/config/plugins/user.scripts/scripts/) Now when I copy/paste your script in I get the following error, and it doesn't actually generate the warning. /tmp/user.scripts/tmpScripts/Cache - Error Check/script: line 4: syntax error: unexpected end of file The script is only 3 lines long, so there's nothing on line 4. Is there a way I should be closing out this script to run cleanly? Quote Link to comment
JorgeB Posted July 22, 2019 Share Posted July 22, 2019 If you're using user scrips there's nothing needed to do. Quote Link to comment
CorneliousJD Posted July 22, 2019 Author Share Posted July 22, 2019 1 minute ago, johnnie.black said: If you're using user scrips there's nothing needed to do. Yep, I am, just thought it was odd that it was the only one showing that error - the rest of the scripts I have exit cleanly. I'll go ahead and schedule that for hourly runs. I created a scrub script and a script to reset error counts incase this happens again. the SSD is on a backplane so I can't really change any cables, and it hasn't happened for over a year until now. I'll be prepared this time though with the new scripts added incase it happens again. Thank you again for your time and info! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.