[SOLVED] BTRFS issues

tcharron · July 13, 2020

I just saw a huge amount of errors logged. I never got a warning or error from unraid about this, and the web interface shows nothing concerning.

This is just a small portion of what I'm seeing flood my logs...

Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527220, flush 476552, corrupt 0, gen 0
Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527221, flush 476552, corrupt 0, gen 0
Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527222, flush 476552, corrupt 0, gen 0
Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527223, flush 476552, corrupt 0, gen 0
Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527224, flush 476552, corrupt 0, gen 0
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4

Here is the relevant drive information:

DEVICE	IDENTIFICATION	TEMP.	READS	WRITES	ERRORS	FS	SIZE	USED	FREE	VIEW
Cache	KINGSTON_SV300S37A240G_50026B726706EDE1 - 240 GB (sdk)	25 C	41,038,636	27,137,657	0	btrfs	496 GB	210 GB 286 GB Browse /mnt/cache
Cache 2	INTEL_SSDSC2KW512G8_PHLA8222024A512DGN - 512 GB (sdm)	*	41,038,636	27,137,657	0	Device is part of cache pool	
Cache 3	KINGSTON_SV300S37A240G_50026B776407138B - 240 GB (sdn)	22 C	36,990,993	20,943,909	0	Device is part of cache pool

This seems like a failing (failed?) drive, but I don't know if it is sdk or sdm, or if it is something else entirely. Any ideas?

JorgeB · July 13, 2020

1 minute ago, tcharron said:

Any ideas?

Not without the full diagnostics: Tools -> Diagnostics, please.

tcharron · July 13, 2020

tower-diagnostics-20200713-1449.zip

JorgeB · July 14, 2020

Syslog doesn't show the beginning of the problem, but it shows that one of your cache devices (sdm) dropped offline:

Jul 11 04:40:37 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 279461563, rd 207240495, flush 384145, corrupt 0, gen 0

And by the number of errors it happened some time ago or multiple times, see here for more info.

tcharron · July 14, 2020

So a run of 'btrfs scrub /mnt/cache' has found about a million uncorrectable errors within 30 seconds.

Does this mean that my entire cache drive is toast, and now I need to rebuild it? Given it is a pool, I would have expected to be able to rebuild (much like when array drives fail), should any single drive have errors. Is there any way to do this? Can I just remove sdm and replace it?

tcharron · July 14, 2020

2 minutes ago, tcharron said:

So a run of 'btrfs scrub /mnt/cache' has found about a million uncorrectable errors within 30 seconds.

Does this mean that my entire cache drive is toast, and now I need to rebuild it? Given it is a pool, I would have expected to be able to rebuild (much like when array drives fail), should any single drive have errors. Is there any way to do this? Can I just remove sdm and replace it?

Actually.. When I try to get a smart report for sdm, it tells me that the drive is offline! This may be good news!

Why doesn't the unraid interface show anywhere that the drive is gone?? I get (from the link you provided) that the error count is wrong, but a red ball on this page would go a long way...

JorgeB · July 14, 2020

4 hours ago, tcharron said:

Actually.. When I try to get a smart report for sdm, it tells me that the drive is offline!

I already mentioned that:

11 hours ago, johnnie.black said:

but it shows that one of your cache devices (sdm) dropped offline:

The link I posted also says what you should do.

tcharron · July 15, 2020

So checked the cables and the device seems stable now. I was able to rebuild the drive.

root@Tower:~#
btrfs scrub status /mnt/cache
UUID:             100735db-0e88-4450-a406-40f3efdd2bb7
Scrub started:    Tue Jul 14 20:31:42 2020
Status:           finished
Duration:         0:16:16
Total to scrub:   434.15GiB
Rate:             455.50MiB/s
Error summary:    verify=17806 csum=6157411
  Corrected:      6175217
  Uncorrectable:  0
  Unverified:     0
root@Tower:~#

Thanks for your help!

JorgeB · July 15, 2020

As long as there are no uncorrectable errors all should be fine.

tcharron · July 15, 2020

My docker.img was somehow corrupted. I deleted it and restored my dockers using CA, and all seems well.

JorgeB · July 15, 2020

That's normal if you're using the default system share, since it doesn't checksum the data, so also no way to verify or fix it, that's also mentioned in the FAQ link.

[SOLVED] BTRFS issues

Recommended Posts

tcharron

Link to comment

JorgeB

Link to comment

tcharron

Link to comment

JorgeB

Link to comment

tcharron

Link to comment

tcharron

Link to comment

JorgeB

Link to comment

tcharron

Link to comment

JorgeB

Link to comment

tcharron

Link to comment

JorgeB

Link to comment

Join the conversation