(SOLVED) - Cache drive in BTRFS pool gone bad, how to detect and resolve

ctrlbreak · February 18, 2019

I have a cache pool of two drives, which has been working fine.

Today I got an email saying the regularly scheduled TRIM had failed, so I checked the logs and there are LOADS of errors related to one of the drives in the log, going back at least 2 days, like this:

Feb 18 01:18:11 bigboi kernel: BTRFS error (device sdm1): bdev /dev/sdm1 errs: wr 21253557, rd 11461666, flush 205573, corrupt 0, gen 0
Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#22 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#22 CDB: opcode=0x2a 2a 00 00 16 32 50 00 00 20 00
Feb 18 01:18:11 bigboi kernel: print_req_error: I/O error, dev sdm, sector 1454672
Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#23 CDB: opcode=0x2a 2a 00 00 16 6f 90 00 00 40 00
Feb 18 01:18:11 bigboi kernel: print_req_error: I/O error, dev sdm, sector 1470352
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS error (device sdm1): error writing primary super block to device 1
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS error (device sdm1): error writing primary super block to device 1

However, in the web gui there is no indication that one of the cache drives is bad, and everthing (unraid itself, all my dockers) appears to be functioning fine. If there was a serious disk issue I would have expected this to be reported to me?

Any ideas how best to proceed? The good news is I have a full daily backup of the appdata share, plus I'm running a pool, so I should be fine long-term. But I'm not sure how best to deal with the issue right now.

I'm running unRaid 6.6.6

Any help appreciated!

Edited February 23, 2019 by ctrlbreak

JorgeB · February 18, 2019

Please post the diagnostics: Tools -> Diagnostics

ctrlbreak · February 23, 2019

Thanks. Slow reply as I was out of town.

Diagnostics attached. Still getting the errors, still no indication in the web gui of any issues and ostensibly everything functioning fine.

bigboi-diagnostics-20190223-0934.zip

JorgeB · February 23, 2019

Syslog rotated and doesn't show the start of the problem, but looks like cache1 SSD dropped offline, see here for more info on what to do and how to better monitor a pool:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

ctrlbreak · February 23, 2019

Thanks.

I've run the command and it's showing a bunch of errors for /dev/sdm1 (the drive in question). I tried a bunch of stuff but then rebooted from the command line as I couldn't get a clean shutdown. Now the device isn't picked up at all, the btrfs dev stats command shows:

[devid:1].write_io_errs    66715124
[devid:1].read_io_errs     47572829
[devid:1].flush_io_errs    367590
[devid:1].corruption_errs  0
[devid:1].generation_errs  0
[/dev/sdk1].write_io_errs    0
[/dev/sdk1].read_io_errs     0
[/dev/sdk1].flush_io_errs    0
[/dev/sdk1].corruption_errs  0
[/dev/sdk1].generation_errs  0

And the logs are showing a whole load of relocating messages:

Feb 23 11:38:37 bigboi kernel: BTRFS info (device sdk1): relocating block group 3785372205056 flags data|raid1
Feb 23 11:38:42 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:38:46 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:38:46 bigboi kernel: BTRFS info (device sdk1): relocating block group 3784298463232 flags data|raid1
Feb 23 11:38:50 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:38:57 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:38:57 bigboi kernel: BTRFS info (device sdk1): relocating block group 3783224721408 flags data|raid1
Feb 23 11:39:02 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:06 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:06 bigboi kernel: BTRFS info (device sdk1): relocating block group 3782150979584 flags data|raid1
Feb 23 11:39:11 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:15 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:15 bigboi kernel: BTRFS info (device sdk1): relocating block group 3781077237760 flags data|raid1
Feb 23 11:39:19 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:23 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:23 bigboi kernel: BTRFS info (device sdk1): relocating block group 3780003495936 flags data|raid1

Is this to be expected because of the missing device in the pool?

Should I run a scrub, or see if replacing the cable restores the device first? Is there a good way of establishing if an SSD is bad? I'm pretty new to SSD drives so the various admin tools aren't obvious to me.

Thanks for your help!

(latest diagnostics attached).

bigboi-diagnostics-20190223-1141.zip

JorgeB · February 23, 2019

4 minutes ago, ctrlbreak said:

Is this to be expected because of the missing device in the pool?

Yes, it's balancing to single device, wait for the balance to finish, you can later add the 2nd device if you get it to come up, try a different port/cable, but wipe the SSD before re-adding to the pool, you can do that with:

blkdiscard /dev/sdX

ctrlbreak · February 23, 2019

Great, thanks for the help.

Hope the SSD is not bad, though it's under warranty if it is.

Good job I had a pool - only got the second drive about 3 months ago!

(SOLVED) - Cache drive in BTRFS pool gone bad, how to detect and resolve

Recommended Posts

ctrlbreak

Link to comment

JorgeB

Link to comment

ctrlbreak

Link to comment

JorgeB

Link to comment

ctrlbreak

Link to comment

JorgeB

Link to comment

ctrlbreak

Link to comment

Join the conversation