July 22, 20196 yr Got an alert that one of my cache drives was missing, but it showed as an unassigned device. I rebooted server and it's detected as part of cache again but docker service is now failing to start. What's the proper way to solve an issue like this? Is the rest of the data in cache pool likely to be corrupt?
July 22, 20196 yr Community Expert 3 hours ago, CorneliousJD said: Is the rest of the data in cache pool likely to be corrupt? Docker image is likely corrupt, so likely will any more data using NOCOW shares, rest should be fixable with a scrub, more info here.
July 22, 20196 yr Author 3 hours ago, johnnie.black said: Docker image is likely corrupt, so likely will any more data using NOCOW shares, rest should be fixable with a scrub, more info here. Thanks, the NOCOW was key, my /system/ and /vms/ shares are NOCOW (both were this way by default from what I recall). docker.img rebuilt and running btrfs scrub start -B /mnt/cache/ now -- so far no output, not sure how long to expect it to take, or if I can continue to rebuild docker containers while this goes. The odd thing is this same disk blinked offline before, but it was OVER a year ago (early April of 2018). Strange....
July 22, 20196 yr Author 3 hours ago, johnnie.black said: Docker image is likely corrupt, so likely will any more data using NOCOW shares, rest should be fixable with a scrub, more info here. Scrubbing didn't take long, but now btrfs dev stats /mnt/cache shows corrupt errors now when it didn't before the scrub. Not sure what to make of this.
July 22, 20196 yr Community Expert 2 minutes ago, CorneliousJD said: both were this way by default from what I recall They are. 3 minutes ago, CorneliousJD said: not sure how long to expect it to take Depends on much data and how fast cache is, since you used -B you can check progress with: btrfs scrub status /mnt/cache And you can continue to use the cache normally, with reduced performance
July 22, 20196 yr Community Expert 1 minute ago, CorneliousJD said: but now btrfs dev stats /mnt/cache shows corrupt errors now when it didn't before the scrub That's normal, what's important is that there were no uncorrectable errors, all data on COW shares is now fine and correctly mirrored, any data on NOCOW shares can still be corrupt, like mentioned in the link, no way to check or correct it since it's not checksummed. You should also reset current stats and use a script to monitor for the future.
July 22, 20196 yr Author 1 minute ago, johnnie.black said: That's normal, what's important is that there were no uncorrectable errors, all data on COW shares is now fine and correctly mirrored, any data on NOCOW shares can still be corrupt, like mentioned in the link, no way to check or correct it since it's not checksummed. You should also reset current stats and use a script to monitor for the future. Awesome, good to know I can at least get back to normal work-mode today for now, I'll circle back on the scripts in a few hours. Was awesome waking up to the fix - thanks @johnnie.black -- you've helped me before in the past as well, much appreciated!
July 22, 20196 yr Author So I'm working on setting up the scripts for ease of use and the one that generates a warning seemed to work once, but now it's not, I had deleted the script to create them with better names/descriptions (and so the folders would get named correctly in /flash/config/plugins/user.scripts/scripts/) Now when I copy/paste your script in I get the following error, and it doesn't actually generate the warning. /tmp/user.scripts/tmpScripts/Cache - Error Check/script: line 4: syntax error: unexpected end of file The script is only 3 lines long, so there's nothing on line 4. Is there a way I should be closing out this script to run cleanly?
July 22, 20196 yr Author 1 minute ago, johnnie.black said: If you're using user scrips there's nothing needed to do. Yep, I am, just thought it was odd that it was the only one showing that error - the rest of the scripts I have exit cleanly. I'll go ahead and schedule that for hourly runs. I created a scrub script and a script to reset error counts incase this happens again. the SSD is on a backplane so I can't really change any cables, and it hasn't happened for over a year until now. I'll be prepared this time though with the new scripts added incase it happens again. Thank you again for your time and info!
Archived
This topic is now archived and is closed to further replies.