KDP Posted February 23, 2023 Posted February 23, 2023 (edited) Two days ago I swapped out some drives from my array. When I rebooted the server I started getting a lot of errors on one of my SSD drives in my cache pool and my dockerimage was corrupted. I went back in to my server and made sure that all of my cables were secure on the SSD drives. I have not seen any errors in my log since and have rebuilt all my dockers. I have run btrfs dev stats /mnt/cache/ and it shows the following [/dev/sdb1].write_io_errs 32262152 [/dev/sdb1].read_io_errs 30052334 [/dev/sdb1].flush_io_errs 90551 [/dev/sdb1].corruption_errs 460309 [/dev/sdb1].generation_errs 4267 [/dev/sdc1].write_io_errs 0 [/dev/sdc1].read_io_errs 0 [/dev/sdc1].flush_io_errs 0 [/dev/sdc1].corruption_errs 0 [/dev/sdc1].generation_errs 0 Those numbers have not changed at all for the last 24 hours that the system has been running. I also ran a SMART extended test and it shows 1 Raw read error rate 0x0032 100 100 050 Old age Always Never 0 5 Reallocated sector count 0x0032 100 100 050 Old age Always Never 0 9 Power on hours 0x0032 100 100 050 Old age Always Never 34814 (3y, 11m, 17d, 14h) 12 Power cycle count 0x0032 100 100 050 Old age Always Never 34 160 Unknown attribute 0x0032 100 100 050 Old age Always Never 0 161 Unknown attribute 0x0033 100 100 050 Pre-fail Always Never 100 163 Unknown attribute 0x0032 100 100 050 Old age Always Never 10 164 Unknown attribute 0x0032 100 100 050 Old age Always Never 286621 165 Unknown attribute 0x0032 100 100 050 Old age Always Never 2130 166 Unknown attribute 0x0032 100 100 050 Old age Always Never 339 167 Unknown attribute 0x0032 100 100 050 Old age Always Never 581 168 Unknown attribute 0x0032 100 100 050 Old age Always Never 7000 169 Unknown attribute 0x0032 100 100 050 Old age Always Never 92 175 Program fail count chip 0x0032 100 100 050 Old age Always Never 0 176 Erase fail count chip 0x0032 100 100 050 Old age Always Never 0 177 Wear leveling count 0x0032 100 100 050 Old age Always Never 0 178 Used rsvd block count chip 0x0032 100 100 050 Old age Always Never 0 181 Program fail count total 0x0032 100 100 050 Old age Always Never 0 182 Erase fail count total 0x0032 100 100 050 Old age Always Never 0 192 Power-off retract count 0x0032 100 100 050 Old age Always Never 15 194 Temperature celsius 0x0022 100 100 050 Old age Always Never 40 195 Hardware ECC recovered 0x0032 100 100 050 Old age Always Never 278670 196 Reallocated event count 0x0032 100 100 050 Old age Always Never 0 197 Current pending sector 0x0032 100 100 050 Old age Always Never 0 198 Offline uncorrectable 0x0032 100 100 050 Old age Always Never 0 199 UDMA CRC error count 0x0032 100 100 050 Old age Always Never 1 232 Available reservd space 0x0032 100 100 050 Old age Always Never 100 241 Total lbas written 0x0030 100 100 050 Old age Offline Never 2296155 242 Total lbas read 0x0030 100 100 050 Old age Offline Never 1118873 245 Unknown attribute 0x0032 100 100 050 Old age Always Never 2687976 Warning: ATA error count 0 inconsistent with error log pointer 1 ATA Error Count: 0 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error -4 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 00 00 00 00 00 00 00 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- e5 00 00 00 00 00 00 08 00:00:00.000 CHECK POWER MODE b0 d5 01 00 4f c2 00 08 00:00:00.000 SMART READ LOG b0 d1 01 01 4f c2 00 08 00:00:00.000 SMART READ ATTRIBUTE THRESHOLDS [OBS-4] ec 00 01 00 00 00 00 08 00:00:00.000 IDENTIFY DEVICE b0 d5 01 01 4f c2 00 08 00:00:00.000 SMART READ LOG I will keep a close eye on it for a few more days, but can I toss it up to my bumbling in the case or should I be concerned with the drive? Edited February 23, 2023 by KDP Quote
KDP Posted February 23, 2023 Author Posted February 23, 2023 Diagnostics elvis-diagnostics-20230223-0852.zip Quote
JorgeB Posted February 23, 2023 Posted February 23, 2023 Server was rebooted after the problem so we can't see what happened, most often when a device drops offline like that it's a cable problem, suggest replacing those to rule that out them run a scrub, also see here for better pool monitoring. Quote
KDP Posted February 23, 2023 Author Posted February 23, 2023 I have already run a scrub and implemented the noted script. I believe that I dislodged a cable while in the case and am hoping after reseating them that this was indeed the issue. I just wanted a second opinion. Thank you for taking the time to take a look at everything! 1 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.