Giggity_Grant Posted May 27, 2020 Share Posted May 27, 2020 (edited) Hi all, I'm running into some significant cache disk IO errors, and am trying to determine if one (or both) cache drives are failing and need replacement. I rebooted my server this morning, only to find that my two cache NVMe drives (2 x Inland Premium 1TB NVMe in Raid 1 cache pool) were not recognized at all. After rebooting again, the system was able to identify and mount both cache drives. However, after array startup, there was a significant amount of cache disk IO errors which kept appearing in the syslog. Additionally, Docker would not start. I balanced and scrubbed the cache drives (with "repair corrupted blocks" enabled) via the GUI, and rebooted, which appears to have stabilized things. Everything appears to be working just fine as of now. With that said, the SMART data for the two cache drives show high error count, and running "btrfs device stats /mnt/cache" shows a huge quantity of IO errors on one of the cache disks. I've tried to run SMART check, but neither quick or extended SMART check will run on either cache drive Does the high number of disk IO errors indicate cache corruption, or is at least one of the cache disks failing and needs to be replaced? Edit: Small update for clarity tower-diagnostics-20200527-0935.zip tower-diagnostics-20200527-0834.zip Edited May 27, 2020 by Giggity_Grant Updated scrub description for more clarity Quote Link to comment
Giggity_Grant Posted May 27, 2020 Author Share Posted May 27, 2020 Also, these NVMe drives have been in service for about 1 year. They were in Raid0 config for about 6 months. I moved back to Raid1 a few weeks ago. Granted, I've downloaded alot of Linux ISOs (download and media folder cache use set to "YES"), but 127TB writes seems excessive. I currently only have 39.8TB total stored on my array. Quote Link to comment
JorgeB Posted May 28, 2020 Share Posted May 28, 2020 The btrfs device errors are most likely the result of one NVMe device dropping offline, see here how to reset them and monitor so you'll be warned if it happens again, if/when it does grab the diagnostics before rebooting and post them here. Quote Link to comment
Giggity_Grant Posted May 28, 2020 Author Share Posted May 28, 2020 Thank you @johnnie.black!! What are your thoughts on the total TBW over one year? Quote Link to comment
JorgeB Posted May 28, 2020 Share Posted May 28, 2020 1 minute ago, Giggity_Grant said: What are your thoughts on the total TBW over one year? On the low side, this is mine after about 3 years: Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.