(SOLVED) - Cache drive in BTRFS pool gone bad, how to detect and resolve


Recommended Posts

I have a cache pool of two drives, which has been working fine.

 

Today I got an email saying the regularly scheduled TRIM had failed, so I checked the logs and there are LOADS of errors related to one of the drives in the log, going back at least 2 days, like this:

 

Feb 18 01:18:11 bigboi kernel: BTRFS error (device sdm1): bdev /dev/sdm1 errs: wr 21253557, rd 11461666, flush 205573, corrupt 0, gen 0
Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#22 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#22 CDB: opcode=0x2a 2a 00 00 16 32 50 00 00 20 00
Feb 18 01:18:11 bigboi kernel: print_req_error: I/O error, dev sdm, sector 1454672
Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#23 CDB: opcode=0x2a 2a 00 00 16 6f 90 00 00 40 00
Feb 18 01:18:11 bigboi kernel: print_req_error: I/O error, dev sdm, sector 1470352
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS error (device sdm1): error writing primary super block to device 1
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1
Feb 18 01:18:11 bigboi kernel: BTRFS error (device sdm1): error writing primary super block to device 1

 

However, in the web gui there is no indication that one of the cache drives is bad, and everthing (unraid itself, all my dockers) appears to be functioning fine. If there was a serious disk issue I would have expected this to be reported to me?

 

Any ideas how best to proceed? The good news is I have a full daily backup of the appdata share, plus I'm running a pool, so I should be fine long-term. But I'm not sure how best to deal with the issue right now.

 

I'm running unRaid 6.6.6

 

Any help appreciated! :)

 

Edited by ctrlbreak
Link to comment

Thanks.

 

I've run the command and it's showing a bunch of errors for /dev/sdm1 (the drive in question). I tried a bunch of stuff but then rebooted from the command line as I couldn't get a clean shutdown. Now the device isn't picked up at all, the btrfs dev stats command shows:

 

[devid:1].write_io_errs    66715124
[devid:1].read_io_errs     47572829
[devid:1].flush_io_errs    367590
[devid:1].corruption_errs  0
[devid:1].generation_errs  0
[/dev/sdk1].write_io_errs    0
[/dev/sdk1].read_io_errs     0
[/dev/sdk1].flush_io_errs    0
[/dev/sdk1].corruption_errs  0
[/dev/sdk1].generation_errs  0

 

And the logs are showing a whole load of relocating messages:

 

Feb 23 11:38:37 bigboi kernel: BTRFS info (device sdk1): relocating block group 3785372205056 flags data|raid1
Feb 23 11:38:42 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:38:46 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:38:46 bigboi kernel: BTRFS info (device sdk1): relocating block group 3784298463232 flags data|raid1
Feb 23 11:38:50 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:38:57 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:38:57 bigboi kernel: BTRFS info (device sdk1): relocating block group 3783224721408 flags data|raid1
Feb 23 11:39:02 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:06 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:06 bigboi kernel: BTRFS info (device sdk1): relocating block group 3782150979584 flags data|raid1
Feb 23 11:39:11 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:15 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:15 bigboi kernel: BTRFS info (device sdk1): relocating block group 3781077237760 flags data|raid1
Feb 23 11:39:19 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:23 bigboi kernel: BTRFS info (device sdk1): found 4 extents
Feb 23 11:39:23 bigboi kernel: BTRFS info (device sdk1): relocating block group 3780003495936 flags data|raid1

 

Is this to be expected because of the missing device in the pool?

 

Should I run a scrub, or see if replacing the cable restores the device first? Is there a good way of establishing if an SSD is bad? I'm pretty new to SSD drives so the various admin tools aren't obvious to me.

 

Thanks for your help!

 

(latest diagnostics attached).

 

 

bigboi-diagnostics-20190223-1141.zip

Link to comment
4 minutes ago, ctrlbreak said:

Is this to be expected because of the missing device in the pool?

Yes, it's balancing to single device, wait for the balance to finish, you can later add the 2nd device if you get it to come up, try a different port/cable, but wipe the SSD before re-adding to the pool, you can do that with:

 

blkdiscard /dev/sdX

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.