[Solved] Probable physical cache drive issue

fritzdis · December 12, 2021

Quick summary since this turned into a very detailed post:

There's probably a physical issue with one of my cache drives (RAID1), but the SMART info looks OK. I'm hoping someone can chime in on what happened and what I should do next.

I shut down my server to install a new drive. When I booted up & started the array, I started getting repeating errors in the log (didn't check the log right away):

Dec 11 23:52:12 SF-unRAID kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 44138, rd 22, flush 0, corrupt 0, gen 0
Dec 11 23:52:17 SF-unRAID kernel: handle_bad_sector: 23 callbacks suppressed
Dec 11 23:52:17 SF-unRAID kernel: attempt to access beyond end of device
Dec 11 23:52:17 SF-unRAID kernel: sdd1: rw=34817, want=889003608, limit=838860736
Dec 11 23:52:17 SF-unRAID kernel: btrfs_dev_stat_print_on_error: 23 callbacks suppressed

This device is part of a BTRFS RAID1 cache pool. Searching for that error brought me to this thread indicating a drive error. Given the info there, I started a non-correcting BTRFS scrub, which produced this output:

Scrub started:    Sun Dec 12 00:26:12 2021
Status:           finished
Duration:         0:03:31
Total to scrub:   205.88GiB
Rate:             995.87MiB/s
Error summary:    read=7228792
  Corrected:      0
  Uncorrectable:  0
  Unverified:     0

The log also showed some new errors:

Dec 12 00:28:53 SF-unRAID kernel: BTRFS warning (device sdd1): i/o error at logical 1534910537728 on dev /dev/sdd1, physical 455527899136, root 5, inode 259, offset 6868205568, length 4096, links 1 (path: system/docker/docker.img)
Dec 12 00:28:53 SF-unRAID kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 1996317, rd 7292557, flush 0, corrupt 0, gen 0

Since everything pointed to a drive issue, I ran short & long SMART self-tests, which completed without error. The SMART attributes look good to me, except for 233:

#	ATTRIBUTE NAME	FLAG	VALUE	WORST	THRESHOLD	TYPE	UPDATED	FAILED	RAW VALUE
5	Reallocated sector count	0x0033	100	100	005	Pre-fail	Always	Never	0
9	Power on hours			0x0032	091	091	000	Old age		Always	Never	44866 (5y, 1m, 12d, 10h)
12	Power cycle count		0x0032	099	099	000	Old age		Always	Never	38
175	Program fail count chip		0x0032	100	100	000	Old age		Always	Never	0
176	Erase fail count chip		0x0032	100	100	000	Old age		Always	Never	0
177	Wear leveling count		0x0012	094	094	000	Old age		Always	Never	961
178	Used rsvd block count chip	0x0013	095	095	005	Pre-fail	Always	Never	472
179	Used rsvd block count tot	0x0012	095	095	000	Old age		Always	Never	253
180	Unused rsvd block count tot	0x0012	095	095	000	Old age		Always	Never	10024
181	Program fail count total	0x0032	100	100	000	Old age		Always	Never	0
182	Erase fail count total		0x0032	100	100	000	Old age		Always	Never	0
194	Temperature celsius		0x0032	069	049	000	Old age		Always	Never	31
195	ECC error rate			0x0032	100	100	000	Old age		Always	Never	0
198	Offline uncorrectable		0x0030	100	100	000	Old age		Offline	Never	0
199	CRC error count			0x003e	100	100	000	Old age		Always	Never	0
202	Exception mode status		0x0033	100	100	090	Pre-fail	Always	Never	0
233	Media wearout indicator		0x0032	001	001	000	Old age		Always	Never	518009717538

Notably, #177 to #180 don't appear to show an issue, so I'm not sure if #233 is a good indicator of the SSD life or not. I have logs from 2016 & 2019 showing the #233 value already at 001 (much lower raw than current), whereas #177 decreased from 099 to 097 to 094. I'm inclined to think #177 is the standard wearout indicator for this drive. But #178 to #180 are unchanged, so I'm not sure what to think.

I viewed the logs very recently, so I'm basically certain this started with the reboot. I did a few things:

Move the server slightly
Added a drive to the hot-swap bay
Replaced the UPS battery (had been putting it off unwisely)
Booted with an external drive attached (which now comes up as sda, though I doubt it's relevant)

So my questions are:

Is this definitely a physical drive issue?
If so, I'll plan to replace with a larger SSD to match the 2nd one - is there any issue with that? (I'll look for instructions if that's the case, but if anyone wants to point me in the right direction or offer tips, that would be appreciated)
If not, what further diagnostic steps should I take? Should I try the correcting scrub?
Why did this start upon reboot? Maybe the SSD went into read-only mode when power cycled (how would I tell?)

Presumably, this will fill up my log unless I shut down my dockers, so I may have to reboot. Please let me know if there's any other info I should capture before doing so.

sf-unraid-diagnostics-20211212-0145.zip

JorgeB · December 12, 2021

Dec 11 11:05:46 SF-unRAID kernel: attempt to access beyond end of device
Dec 11 11:05:46 SF-unRAID kernel: sdd1: rw=34817, want=896828480, limit=838860736

This is the main issue, but also a strange one, please post output of:

lsblk -b

fritzdis · December 12, 2021

NAME   MAJ:MIN RM           SIZE RO TYPE MOUNTPOINT
loop0    7:0    0       12562432  1 loop /lib/modules
loop1    7:1    0       20738048  1 loop /lib/firmware
loop2    7:2    0    42949672960  0 loop /var/lib/docker
sda      8:0    0  4000787029504  0 disk 
└─sda1   8:1    0  4000785104896  0 part 
sdb      8:16   1     4012900352  0 disk 
└─sdb1   8:17   1     4010803200  0 part /boot
sdc      8:32   0   960197124096  0 disk 
└─sdc1   8:33   0   960197091328  0 part 
sdd      8:48   0   480103981056  0 disk 
└─sdd1   8:49   0   429496696832  0 part /mnt/cache
sde      8:64   0 12000138625024  0 disk 
└─sde1   8:65   0 12000138575360  0 part

That discrepancy for sdd1 definitely stands out. Like I said, I'm basically certain it wasn't an issue before the reboot, but maybe the power cycle on the drive triggered something.

sdc is the other drive in the pool. I would kinda like to have a bigger cache, though it would be nice to identify the issue. Happy to do any suggested troubleshooting, but if it's going to be a major pain to figure out, I don't mind just moving on.

fritzdis · December 12, 2021

I also forgot that I just installed the My Servers plugin (before the reboot). But I was looking at the log because the flash drive didn't seem to backup immediately (it's fine now). I did not see those errors occurring, and I'm pretty sure I had dockers running at the time that would have triggered the drive issue if it existed before the reboot.

JorgeB · December 12, 2021

31 minutes ago, fritzdis said:

sdd      8:48   0   480103981056  0 disk 
└─sdd1   8:49   0   429496696832  0 part /mnt/cache

Partition isn't using the full device capacity, you'll need to resize it or probably better yet, and to make sure all gets solved, is to backup cache, wipe the drives and re-format, then restore the data.

fritzdis · December 12, 2021

What do you think the chances are the drive will be totally fine?

I might just set all the shares to cache yes to clear things off, then to cache no once it's cleared (until I have more time to deal with this). If I had thought of doing that earlier, I think I wouldn't have posted right away (hard not to panic a bit when errors like that show up).

So assuming I'm able to move everything without issue, I'll probably mark this solved for now and follow up once I'm better able to address it.

fritzdis · December 12, 2021

OK, new (confusing) update:

After another reboot (because log file filled up from mover errors), here's the relevant output of lsblk -b:

NAME   MAJ:MIN RM           SIZE RO TYPE MOUNTPOINT
sdd      8:48   0   480103981056  0 disk 
└─sdd1   8:49   0   480103948288  0 part /mnt/cache

Somehow it fixed itself? No more of the BTRFS erros in the log yet. I certainly don't trust the filesystem on the drive to be in a good state considering all the previous errors, so I'll continue clearing off the cache in order to reinitialize it.

JorgeB · December 13, 2021

10 hours ago, fritzdis said:

Somehow it fixed itself?

That's very strange, partition could be damaged, you should re-format.

fritzdis · December 13, 2021

Agreed, no way I would use the cache pool as is, given the errors and prior partition status.

I've moved shares off the cache. There were some mover issues with Plex (seems like a known issue with broken symlinks). I've set all shares to cache : no, so the only thing left on the pool is the orphaned Plex files, which I haven't had time to deal with.

Without any of the shares using the cache, I don't think I'm at risk any more, except for maybe Plex breaking, but I can just reinstall that if needed. Not sure if I want to trust the drive going forward, but at least now I can take my time (going out of town soon).

[Solved] Probable physical cache drive issue

Recommended Posts

fritzdis

Link to comment

JorgeB

Link to comment

fritzdis

Link to comment

fritzdis

Link to comment

JorgeB

Link to comment

fritzdis

Link to comment

fritzdis

Link to comment

JorgeB

Link to comment

fritzdis

Link to comment

Join the conversation