tcharron Posted July 13, 2020 Posted July 13, 2020 I just saw a huge amount of errors logged. I never got a warning or error from unraid about this, and the web interface shows nothing concerning.  This is just a small portion of what I'm seeing flood my logs...  Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527220, flush 476552, corrupt 0, gen 0 Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527221, flush 476552, corrupt 0, gen 0 Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527222, flush 476552, corrupt 0, gen 0 Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527223, flush 476552, corrupt 0, gen 0 Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527224, flush 476552, corrupt 0, gen 0 Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1 Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1 Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1 Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4 Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1 Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1 Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1 Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4 Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1 Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1 Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1 Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4 Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1 Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4 Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4 Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4 Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4 Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4 Here is the relevant drive information: DEVICE IDENTIFICATION TEMP. READS WRITES ERRORS FS SIZE USED FREE VIEW Cache KINGSTON_SV300S37A240G_50026B726706EDE1 - 240 GB (sdk) 25 C 41,038,636 27,137,657 0 btrfs 496 GB 210 GB 286 GB Browse /mnt/cache Cache 2 INTEL_SSDSC2KW512G8_PHLA8222024A512DGN - 512 GB (sdm) * 41,038,636 27,137,657 0 Device is part of cache pool Cache 3 KINGSTON_SV300S37A240G_50026B776407138B - 240 GB (sdn) 22 C 36,990,993 20,943,909 0 Device is part of cache pool This seems like a failing (failed?) drive, but I don't know if it is sdk or sdm, or if it is something else entirely. Any ideas?  Quote
JorgeB Posted July 13, 2020 Posted July 13, 2020 1 minute ago, tcharron said: Any ideas? Not without the full diagnostics: Tools -> Diagnostics, please. Quote
JorgeB Posted July 14, 2020 Posted July 14, 2020 Syslog doesn't show the beginning of the problem, but it shows that one of your cache devices (sdm) dropped offline: Jul 11 04:40:37 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 279461563, rd 207240495, flush 384145, corrupt 0, gen 0 Â And by the number of errors it happened some time ago or multiple times, see here for more info. Quote
tcharron Posted July 14, 2020 Author Posted July 14, 2020 So a run of 'btrfs scrub /mnt/cache' has found about a million uncorrectable errors within 30 seconds.  Does this mean that my entire cache drive is toast, and now I need to rebuild it? Given it is a pool, I would have expected to be able to rebuild (much like when array drives fail), should any single drive have errors. Is there any way to do this? Can I just remove sdm and replace it?  Quote
tcharron Posted July 14, 2020 Author Posted July 14, 2020 2 minutes ago, tcharron said: So a run of 'btrfs scrub /mnt/cache' has found about a million uncorrectable errors within 30 seconds.  Does this mean that my entire cache drive is toast, and now I need to rebuild it? Given it is a pool, I would have expected to be able to rebuild (much like when array drives fail), should any single drive have errors. Is there any way to do this? Can I just remove sdm and replace it?  Actually.. When I try to get a smart report for sdm, it tells me that the drive is offline! This may be good news!  Why doesn't the unraid interface show anywhere that the drive is gone?? I get (from the link you provided) that the error count is wrong, but a red ball on this page would go a long way...  Quote
JorgeB Posted July 14, 2020 Posted July 14, 2020 4 hours ago, tcharron said: Actually.. When I try to get a smart report for sdm, it tells me that the drive is offline! I already mentioned that: 11 hours ago, johnnie.black said: but it shows that one of your cache devices (sdm) dropped offline: The link I posted also says what you should do. Quote
tcharron Posted July 15, 2020 Author Posted July 15, 2020 So checked the cables and the device seems stable now. I was able to rebuild the drive. root@Tower:~# btrfs scrub status /mnt/cache UUID: 100735db-0e88-4450-a406-40f3efdd2bb7 Scrub started: Tue Jul 14 20:31:42 2020 Status: finished Duration: 0:16:16 Total to scrub: 434.15GiB Rate: 455.50MiB/s Error summary: verify=17806 csum=6157411 Corrected: 6175217 Uncorrectable: 0 Unverified: 0 root@Tower:~# Thanks for your help! 1 Quote
JorgeB Posted July 15, 2020 Posted July 15, 2020 As long as there are no uncorrectable errors all should be fine. Quote
tcharron Posted July 15, 2020 Author Posted July 15, 2020 My docker.img was somehow corrupted. I deleted it and restored my dockers using CA, and all seems well.  Quote
JorgeB Posted July 15, 2020 Posted July 15, 2020 That's normal if you're using the default system share, since it doesn't checksum the data, so also no way to verify or fix it, that's also mentioned in the FAQ link. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.