nmills3 Posted November 8, 2023 Share Posted November 8, 2023 I just started getting crc errors on one of my cache drives. i've never had issues with this drive before, but in the last hour or so it's racked up about 4000 errors. I'm assuming i need to replace the drive but i'm not sure how to replace the cache drive. it's part of a 3 drive raid 1 pool, so am i safe to just stop the array, rip out the old drive and throw a new one in, or do i need to move all the files of the cache drives? Quote Link to comment
JorgeB Posted November 8, 2023 Share Posted November 8, 2023 Replace the SATA cable first. Quote Link to comment
nmills3 Posted November 8, 2023 Author Share Posted November 8, 2023 well it's using one of those sas to 4x sata adapters from the hba. is it likely that the cable suddenly went bad? Quote Link to comment
nmills3 Posted November 8, 2023 Author Share Posted November 8, 2023 also once i replace the cable, do i need to do anything else? if it was the array i'd run a parity check after, but i don't know if the cache needs anything like that Quote Link to comment
JorgeB Posted November 8, 2023 Share Posted November 8, 2023 UDMA CRC errors are almost always a SATA cable problem, and any cable can go bad at any time. Quote Link to comment
nmills3 Posted November 8, 2023 Author Share Posted November 8, 2023 so i just tried stopping the array so that it wouldn't keep getting errors while i wait for a replacement cable. The server just sat with the loading thing for a while and wouldn't load any other pages. Now the web ui is responsive again but the array still isn't stopped and the log is filled with this message Nov 8 19:32:50 Tower kernel: BTRFS error (device sdf1): error writing primary super block to device 2 sdf is the first drive in the cache pool. but sdg which is the second in that pool is the one that was giving the crc errors Quote Link to comment
nmills3 Posted November 8, 2023 Author Share Posted November 8, 2023 ok, the array finally stopped. Hopefully my cache data still exists Quote Link to comment
nmills3 Posted November 8, 2023 Author Share Posted November 8, 2023 tower-diagnostics-20231108-1938.zip i'll attach a diagnostics just in case it's useful Quote Link to comment
nmills3 Posted November 8, 2023 Author Share Posted November 8, 2023 so i just reseated the cables and restarted the server and now most of my docker containers using appdata won't start and a few of them have errors saying about a read only file system. The syslog also has some errors about a checksum and it giving a lot of I/O errors from the same drive as before. I'll attach a new diagnostic to this post. So on a scale of 1-10, how screwed am i in terms of the data on the cache pool? tower-diagnostics-20231108-2045.zip Quote Link to comment
nmills3 Posted November 8, 2023 Author Share Posted November 8, 2023 also ignore the drive errors from sde. that's an unassigned drive in a slot that i know is bad Quote Link to comment
nmills3 Posted November 8, 2023 Author Share Posted November 8, 2023 @JorgeB any other advice you have for what I can do would be appreciated Quote Link to comment
JorgeB Posted November 9, 2023 Share Posted November 9, 2023 Looks like the pool is corrupt but there's a lot of log spam because of the bad drive, disconnect it and post new diags after array start. Quote Link to comment
nmills3 Posted November 9, 2023 Author Share Posted November 9, 2023 ok, i've disconnected both of the bad drives and it seems to be working now. The cache pool had enough space to just convert into a 2 drive raid 1 and the main array has been running with a missing disk for about 2 months already. So everything seems to be working now. I'm going to try and reinstall the ssd once the replacement cable shows up, but until then, at least the server can limp along. I'll attach new diagnostics anyway just incase there's anything useful tower-diagnostics-20231109-1056.zip Quote Link to comment
nmills3 Posted November 11, 2023 Author Share Posted November 11, 2023 @JorgeB so my situation has got even more strange. So i just replaced the cable and transfered the server to a new case without a cheap hot swap backplane and now the normal drives all seem to be working fine, but now my cache pool is completely dead. So i moved everything, checked all the disks showed up, re-assigned a disk to the main array that had a faulty cable and the started the array and now the cache that was working this morning as a 2 drive raid 1 is just showing 2 disks with the error "Unmountable: Unsupported or no file system". I have no idea what could have happened to them. I did a clean shutdown before hand and all seemed to be working fine since my last update, nothing seemed to be corrupted and everything was working, and now this has happened. tower-diagnostics-20231111-1817.zip Quote Link to comment
nmills3 Posted November 12, 2023 Author Share Posted November 12, 2023 So i think i found a bug in unraid. It seems like that ssd that i removed a few days ago was causing the cache to not mount properly. With that drive installed but not assigned to the cache pool (it was a 3 drive pool with the 2nd slot unassigned) the pool would fail to mount, there was also an issue with the superblock size but that was fixed with the command "btrfs rescue fix-device-size" after running a check on the pool. Then when running a "btrfs filesystem show" command, i noticed it was listing the installed but not assigned drive as a part of the pool but with a different storage size (about 80gb less than the drives that were actually assigned to the pool). So i shut down and ripped out that drive again and now the server starts fine and the pool seems to be functioning properly. So i think that maybe when unraid removed the disk from the pool when i removed it the first time, it didn't actually remove it properly and it was still trying to use it as a part of the pool Quote Link to comment
nmills3 Posted November 12, 2023 Author Share Posted November 12, 2023 Should i make this into a new post? i think this might be a bit outside of the scope of the original issue at this point Quote Link to comment
JorgeB Posted November 12, 2023 Share Posted November 12, 2023 Nov 11 15:42:30 Tower emhttpd: Total devices 3 FS bytes used 847.91GiB Nov 11 15:42:30 Tower emhttpd: devid 1 size 931.51GiB used 787.03GiB path /dev/sdc1 Nov 11 15:42:30 Tower emhttpd: devid 2 size 931.51GiB used 718.03GiB path /dev/sdd1 Nov 11 15:42:30 Tower emhttpd: devid 3 size 931.51GiB used 787.03GiB path /dev/sdb1 Nov 11 15:42:30 Tower emhttpd: cache: invalid config: total_devices 3 num_misplaced 1 num_missing 0 The pool currently consists of 3 devices, but only two are assigned, so it doesn't mount, this is not a bug but by design. Quote Link to comment
nmills3 Posted November 12, 2023 Author Share Posted November 12, 2023 but when i removed the ssd and started the array a few days ago. it asked me if i wanted to remove the device and i said yes, shouldn't that mean it was removed from the pool? Quote Link to comment
JorgeB Posted November 13, 2023 Share Posted November 13, 2023 Possibly it should have but it was not, so all three are required now, you can re-import the pool with all 3 (not just adding the unassigned device, if you need help to re-import let me know) then post new diags. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.