[SOLVED] BTRFS issues


Recommended Posts

I just saw a huge amount of errors logged.  I never got a warning or error from unraid about this, and the web interface shows nothing concerning.

 

This is just a small portion of what I'm seeing flood my logs...

 

Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527220, flush 476552, corrupt 0, gen 0
Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527221, flush 476552, corrupt 0, gen 0
Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527222, flush 476552, corrupt 0, gen 0
Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527223, flush 476552, corrupt 0, gen 0
Jul 13 14:26:42 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 323772515, rd 238527224, flush 476552, corrupt 0, gen 0
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS warning (device sdk1): lost page write due to IO error on /dev/sdm1
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4
Jul 13 14:26:43 Tower kernel: BTRFS error (device sdk1): error writing primary super block to device 4

Here is the relevant drive information:

DEVICE	IDENTIFICATION	TEMP.	READS	WRITES	ERRORS	FS	SIZE	USED	FREE	VIEW
Cache	KINGSTON_SV300S37A240G_50026B726706EDE1 - 240 GB (sdk)	25 C	41,038,636	27,137,657	0	btrfs	496 GB	210 GB 286 GB Browse /mnt/cache
Cache 2	INTEL_SSDSC2KW512G8_PHLA8222024A512DGN - 512 GB (sdm)	*	41,038,636	27,137,657	0	Device is part of cache pool	
Cache 3	KINGSTON_SV300S37A240G_50026B776407138B - 240 GB (sdn)	22 C	36,990,993	20,943,909	0	Device is part of cache pool

This seems like a failing (failed?) drive, but I don't know if it is sdk or sdm, or if it is something else entirely.  Any ideas?

 

Link to comment

Syslog doesn't show the beginning of the problem, but it shows that one of your cache devices (sdm) dropped offline:

Jul 11 04:40:37 Tower kernel: BTRFS error (device sdk1): bdev /dev/sdm1 errs: wr 279461563, rd 207240495, flush 384145, corrupt 0, gen 0

 

And by the number of errors it happened some time ago or multiple times, see here for more info.

Link to comment

So a run of 'btrfs scrub /mnt/cache' has found about a million uncorrectable errors within 30 seconds.

 

Does this mean that my entire cache drive is toast, and now I need to rebuild it?  Given it is a pool, I would have expected to be able to rebuild (much like when array drives fail), should any single drive have errors.  Is there any way to do this?  Can I just remove sdm and replace it?

 

Link to comment
2 minutes ago, tcharron said:

So a run of 'btrfs scrub /mnt/cache' has found about a million uncorrectable errors within 30 seconds.

 

Does this mean that my entire cache drive is toast, and now I need to rebuild it?  Given it is a pool, I would have expected to be able to rebuild (much like when array drives fail), should any single drive have errors.  Is there any way to do this?  Can I just remove sdm and replace it?

 

Actually.. When I try to get a smart report for sdm, it tells me that the drive is offline!  This may be good news!

 

Why doesn't the unraid interface show anywhere that the drive is gone??  I get (from the link you provided) that the error count is wrong, but a red ball on this page would go a long way...

 

image.thumb.png.ee7dfbf22fba457dc5cf8906db8ec0d7.png

Link to comment
4 hours ago, tcharron said:

Actually.. When I try to get a smart report for sdm, it tells me that the drive is offline! 

I already mentioned that:

11 hours ago, johnnie.black said:

but it shows that one of your cache devices (sdm) dropped offline:

The link I posted also says what you should do.

Link to comment

So checked the cables and the device seems stable now.  I was able to rebuild the drive.

root@Tower:~#
btrfs scrub status /mnt/cache
UUID:             100735db-0e88-4450-a406-40f3efdd2bb7
Scrub started:    Tue Jul 14 20:31:42 2020
Status:           finished
Duration:         0:16:16
Total to scrub:   434.15GiB
Rate:             455.50MiB/s
Error summary:    verify=17806 csum=6157411
  Corrected:      6175217
  Uncorrectable:  0
  Unverified:     0
root@Tower:~#

Thanks for your help!

  • Like 1
Link to comment
  • JorgeB changed the title to [SOLVED] BTRFS issues

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.