ctrlbreak Posted February 18, 2019 Share Posted February 18, 2019 (edited) I have a cache pool of two drives, which has been working fine. Today I got an email saying the regularly scheduled TRIM had failed, so I checked the logs and there are LOADS of errors related to one of the drives in the log, going back at least 2 days, like this: Feb 18 01:18:11 bigboi kernel: BTRFS error (device sdm1): bdev /dev/sdm1 errs: wr 21253557, rd 11461666, flush 205573, corrupt 0, gen 0 Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#22 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#22 CDB: opcode=0x2a 2a 00 00 16 32 50 00 00 20 00 Feb 18 01:18:11 bigboi kernel: print_req_error: I/O error, dev sdm, sector 1454672 Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Feb 18 01:18:11 bigboi kernel: sd 12:0:0:0: [sdm] tag#23 CDB: opcode=0x2a 2a 00 00 16 6f 90 00 00 40 00 Feb 18 01:18:11 bigboi kernel: print_req_error: I/O error, dev sdm, sector 1470352 Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1 Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1 Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1 Feb 18 01:18:11 bigboi kernel: BTRFS error (device sdm1): error writing primary super block to device 1 Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1 Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1 Feb 18 01:18:11 bigboi kernel: BTRFS warning (device sdm1): lost page write due to IO error on /dev/sdm1 Feb 18 01:18:11 bigboi kernel: BTRFS error (device sdm1): error writing primary super block to device 1 However, in the web gui there is no indication that one of the cache drives is bad, and everthing (unraid itself, all my dockers) appears to be functioning fine. If there was a serious disk issue I would have expected this to be reported to me? Any ideas how best to proceed? The good news is I have a full daily backup of the appdata share, plus I'm running a pool, so I should be fine long-term. But I'm not sure how best to deal with the issue right now. I'm running unRaid 6.6.6 Any help appreciated! Edited February 23, 2019 by ctrlbreak Quote Link to comment
JorgeB Posted February 18, 2019 Share Posted February 18, 2019 Please post the diagnostics: Tools -> Diagnostics Quote Link to comment
ctrlbreak Posted February 23, 2019 Author Share Posted February 23, 2019 Thanks. Slow reply as I was out of town. Diagnostics attached. Still getting the errors, still no indication in the web gui of any issues and ostensibly everything functioning fine. bigboi-diagnostics-20190223-0934.zip Quote Link to comment
JorgeB Posted February 23, 2019 Share Posted February 23, 2019 Syslog rotated and doesn't show the start of the problem, but looks like cache1 SSD dropped offline, see here for more info on what to do and how to better monitor a pool: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582 Quote Link to comment
ctrlbreak Posted February 23, 2019 Author Share Posted February 23, 2019 Thanks. I've run the command and it's showing a bunch of errors for /dev/sdm1 (the drive in question). I tried a bunch of stuff but then rebooted from the command line as I couldn't get a clean shutdown. Now the device isn't picked up at all, the btrfs dev stats command shows: [devid:1].write_io_errs 66715124 [devid:1].read_io_errs 47572829 [devid:1].flush_io_errs 367590 [devid:1].corruption_errs 0 [devid:1].generation_errs 0 [/dev/sdk1].write_io_errs 0 [/dev/sdk1].read_io_errs 0 [/dev/sdk1].flush_io_errs 0 [/dev/sdk1].corruption_errs 0 [/dev/sdk1].generation_errs 0 And the logs are showing a whole load of relocating messages: Feb 23 11:38:37 bigboi kernel: BTRFS info (device sdk1): relocating block group 3785372205056 flags data|raid1 Feb 23 11:38:42 bigboi kernel: BTRFS info (device sdk1): found 4 extents Feb 23 11:38:46 bigboi kernel: BTRFS info (device sdk1): found 4 extents Feb 23 11:38:46 bigboi kernel: BTRFS info (device sdk1): relocating block group 3784298463232 flags data|raid1 Feb 23 11:38:50 bigboi kernel: BTRFS info (device sdk1): found 4 extents Feb 23 11:38:57 bigboi kernel: BTRFS info (device sdk1): found 4 extents Feb 23 11:38:57 bigboi kernel: BTRFS info (device sdk1): relocating block group 3783224721408 flags data|raid1 Feb 23 11:39:02 bigboi kernel: BTRFS info (device sdk1): found 4 extents Feb 23 11:39:06 bigboi kernel: BTRFS info (device sdk1): found 4 extents Feb 23 11:39:06 bigboi kernel: BTRFS info (device sdk1): relocating block group 3782150979584 flags data|raid1 Feb 23 11:39:11 bigboi kernel: BTRFS info (device sdk1): found 4 extents Feb 23 11:39:15 bigboi kernel: BTRFS info (device sdk1): found 4 extents Feb 23 11:39:15 bigboi kernel: BTRFS info (device sdk1): relocating block group 3781077237760 flags data|raid1 Feb 23 11:39:19 bigboi kernel: BTRFS info (device sdk1): found 4 extents Feb 23 11:39:23 bigboi kernel: BTRFS info (device sdk1): found 4 extents Feb 23 11:39:23 bigboi kernel: BTRFS info (device sdk1): relocating block group 3780003495936 flags data|raid1 Is this to be expected because of the missing device in the pool? Should I run a scrub, or see if replacing the cable restores the device first? Is there a good way of establishing if an SSD is bad? I'm pretty new to SSD drives so the various admin tools aren't obvious to me. Thanks for your help! (latest diagnostics attached). bigboi-diagnostics-20190223-1141.zip Quote Link to comment
JorgeB Posted February 23, 2019 Share Posted February 23, 2019 4 minutes ago, ctrlbreak said: Is this to be expected because of the missing device in the pool? Yes, it's balancing to single device, wait for the balance to finish, you can later add the 2nd device if you get it to come up, try a different port/cable, but wipe the SSD before re-adding to the pool, you can do that with: blkdiscard /dev/sdX Quote Link to comment
ctrlbreak Posted February 23, 2019 Author Share Posted February 23, 2019 Great, thanks for the help. Hope the SSD is not bad, though it's under warranty if it is. Good job I had a pool - only got the second drive about 3 months ago! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.