fritzdis Posted December 12, 2021 Share Posted December 12, 2021 Quick summary since this turned into a very detailed post: There's probably a physical issue with one of my cache drives (RAID1), but the SMART info looks OK. I'm hoping someone can chime in on what happened and what I should do next. I shut down my server to install a new drive. When I booted up & started the array, I started getting repeating errors in the log (didn't check the log right away): Dec 11 23:52:12 SF-unRAID kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 44138, rd 22, flush 0, corrupt 0, gen 0 Dec 11 23:52:17 SF-unRAID kernel: handle_bad_sector: 23 callbacks suppressed Dec 11 23:52:17 SF-unRAID kernel: attempt to access beyond end of device Dec 11 23:52:17 SF-unRAID kernel: sdd1: rw=34817, want=889003608, limit=838860736 Dec 11 23:52:17 SF-unRAID kernel: btrfs_dev_stat_print_on_error: 23 callbacks suppressed This device is part of a BTRFS RAID1 cache pool. Searching for that error brought me to this thread indicating a drive error. Given the info there, I started a non-correcting BTRFS scrub, which produced this output: Scrub started: Sun Dec 12 00:26:12 2021 Status: finished Duration: 0:03:31 Total to scrub: 205.88GiB Rate: 995.87MiB/s Error summary: read=7228792 Corrected: 0 Uncorrectable: 0 Unverified: 0 The log also showed some new errors: Dec 12 00:28:53 SF-unRAID kernel: BTRFS warning (device sdd1): i/o error at logical 1534910537728 on dev /dev/sdd1, physical 455527899136, root 5, inode 259, offset 6868205568, length 4096, links 1 (path: system/docker/docker.img) Dec 12 00:28:53 SF-unRAID kernel: BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 1996317, rd 7292557, flush 0, corrupt 0, gen 0 Since everything pointed to a drive issue, I ran short & long SMART self-tests, which completed without error. The SMART attributes look good to me, except for 233: # ATTRIBUTE NAME FLAG VALUE WORST THRESHOLD TYPE UPDATED FAILED RAW VALUE 5 Reallocated sector count 0x0033 100 100 005 Pre-fail Always Never 0 9 Power on hours 0x0032 091 091 000 Old age Always Never 44866 (5y, 1m, 12d, 10h) 12 Power cycle count 0x0032 099 099 000 Old age Always Never 38 175 Program fail count chip 0x0032 100 100 000 Old age Always Never 0 176 Erase fail count chip 0x0032 100 100 000 Old age Always Never 0 177 Wear leveling count 0x0012 094 094 000 Old age Always Never 961 178 Used rsvd block count chip 0x0013 095 095 005 Pre-fail Always Never 472 179 Used rsvd block count tot 0x0012 095 095 000 Old age Always Never 253 180 Unused rsvd block count tot 0x0012 095 095 000 Old age Always Never 10024 181 Program fail count total 0x0032 100 100 000 Old age Always Never 0 182 Erase fail count total 0x0032 100 100 000 Old age Always Never 0 194 Temperature celsius 0x0032 069 049 000 Old age Always Never 31 195 ECC error rate 0x0032 100 100 000 Old age Always Never 0 198 Offline uncorrectable 0x0030 100 100 000 Old age Offline Never 0 199 CRC error count 0x003e 100 100 000 Old age Always Never 0 202 Exception mode status 0x0033 100 100 090 Pre-fail Always Never 0 233 Media wearout indicator 0x0032 001 001 000 Old age Always Never 518009717538 Notably, #177 to #180 don't appear to show an issue, so I'm not sure if #233 is a good indicator of the SSD life or not. I have logs from 2016 & 2019 showing the #233 value already at 001 (much lower raw than current), whereas #177 decreased from 099 to 097 to 094. I'm inclined to think #177 is the standard wearout indicator for this drive. But #178 to #180 are unchanged, so I'm not sure what to think. I viewed the logs very recently, so I'm basically certain this started with the reboot. I did a few things: Move the server slightly Added a drive to the hot-swap bay Replaced the UPS battery (had been putting it off unwisely) Booted with an external drive attached (which now comes up as sda, though I doubt it's relevant) So my questions are: Is this definitely a physical drive issue? If so, I'll plan to replace with a larger SSD to match the 2nd one - is there any issue with that? (I'll look for instructions if that's the case, but if anyone wants to point me in the right direction or offer tips, that would be appreciated) If not, what further diagnostic steps should I take? Should I try the correcting scrub? Why did this start upon reboot? Maybe the SSD went into read-only mode when power cycled (how would I tell?) Presumably, this will fill up my log unless I shut down my dockers, so I may have to reboot. Please let me know if there's any other info I should capture before doing so. sf-unraid-diagnostics-20211212-0145.zip Quote Link to comment
JorgeB Posted December 12, 2021 Share Posted December 12, 2021 Dec 11 11:05:46 SF-unRAID kernel: attempt to access beyond end of device Dec 11 11:05:46 SF-unRAID kernel: sdd1: rw=34817, want=896828480, limit=838860736 This is the main issue, but also a strange one, please post output of: lsblk -b Quote Link to comment
fritzdis Posted December 12, 2021 Author Share Posted December 12, 2021 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 12562432 1 loop /lib/modules loop1 7:1 0 20738048 1 loop /lib/firmware loop2 7:2 0 42949672960 0 loop /var/lib/docker sda 8:0 0 4000787029504 0 disk └─sda1 8:1 0 4000785104896 0 part sdb 8:16 1 4012900352 0 disk └─sdb1 8:17 1 4010803200 0 part /boot sdc 8:32 0 960197124096 0 disk └─sdc1 8:33 0 960197091328 0 part sdd 8:48 0 480103981056 0 disk └─sdd1 8:49 0 429496696832 0 part /mnt/cache sde 8:64 0 12000138625024 0 disk └─sde1 8:65 0 12000138575360 0 part That discrepancy for sdd1 definitely stands out. Like I said, I'm basically certain it wasn't an issue before the reboot, but maybe the power cycle on the drive triggered something. sdc is the other drive in the pool. I would kinda like to have a bigger cache, though it would be nice to identify the issue. Happy to do any suggested troubleshooting, but if it's going to be a major pain to figure out, I don't mind just moving on. Quote Link to comment
fritzdis Posted December 12, 2021 Author Share Posted December 12, 2021 I also forgot that I just installed the My Servers plugin (before the reboot). But I was looking at the log because the flash drive didn't seem to backup immediately (it's fine now). I did not see those errors occurring, and I'm pretty sure I had dockers running at the time that would have triggered the drive issue if it existed before the reboot. Quote Link to comment
JorgeB Posted December 12, 2021 Share Posted December 12, 2021 31 minutes ago, fritzdis said: sdd 8:48 0 480103981056 0 disk └─sdd1 8:49 0 429496696832 0 part /mnt/cache Partition isn't using the full device capacity, you'll need to resize it or probably better yet, and to make sure all gets solved, is to backup cache, wipe the drives and re-format, then restore the data. Quote Link to comment
fritzdis Posted December 12, 2021 Author Share Posted December 12, 2021 What do you think the chances are the drive will be totally fine? I might just set all the shares to cache yes to clear things off, then to cache no once it's cleared (until I have more time to deal with this). If I had thought of doing that earlier, I think I wouldn't have posted right away (hard not to panic a bit when errors like that show up). So assuming I'm able to move everything without issue, I'll probably mark this solved for now and follow up once I'm better able to address it. Quote Link to comment
Solution fritzdis Posted December 12, 2021 Author Solution Share Posted December 12, 2021 OK, new (confusing) update: After another reboot (because log file filled up from mover errors), here's the relevant output of lsblk -b: NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdd 8:48 0 480103981056 0 disk └─sdd1 8:49 0 480103948288 0 part /mnt/cache Somehow it fixed itself? No more of the BTRFS erros in the log yet. I certainly don't trust the filesystem on the drive to be in a good state considering all the previous errors, so I'll continue clearing off the cache in order to reinitialize it. Quote Link to comment
JorgeB Posted December 13, 2021 Share Posted December 13, 2021 10 hours ago, fritzdis said: Somehow it fixed itself? That's very strange, partition could be damaged, you should re-format. Quote Link to comment
fritzdis Posted December 13, 2021 Author Share Posted December 13, 2021 Agreed, no way I would use the cache pool as is, given the errors and prior partition status. I've moved shares off the cache. There were some mover issues with Plex (seems like a known issue with broken symlinks). I've set all shares to cache : no, so the only thing left on the pool is the orphaned Plex files, which I haven't had time to deal with. Without any of the shares using the cache, I don't think I'm at risk any more, except for maybe Plex breaking, but I can just reinstall that if needed. Not sure if I want to trust the drive going forward, but at least now I can take my time (going out of town soon). Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.