CRC Errors on cache drive

November 8, 20232 yr

I just started getting crc errors on one of my cache drives. i've never had issues with this drive before, but in the last hour or so it's racked up about 4000 errors. I'm assuming i need to replace the drive but i'm not sure how to replace the cache drive. it's part of a 3 drive raid 1 pool, so am i safe to just stop the array, rip out the old drive and throw a new one in, or do i need to move all the files of the cache drives?

Quote

November 8, 20232 yr

Community Expert

Replace the SATA cable first.

Quote

November 8, 20232 yr

Author

well it's using one of those sas to 4x sata adapters from the hba. is it likely that the cable suddenly went bad?

Quote

November 8, 20232 yr

Author

also once i replace the cable, do i need to do anything else? if it was the array i'd run a parity check after, but i don't know if the cache needs anything like that

Quote

November 8, 20232 yr

Community Expert

UDMA CRC errors are almost always a SATA cable problem, and any cable can go bad at any time.

Quote

November 8, 20232 yr

Author

so i just tried stopping the array so that it wouldn't keep getting errors while i wait for a replacement cable. The server just sat with the loading thing for a while and wouldn't load any other pages. Now the web ui is responsive again but the array still isn't stopped and the log is filled with this message

Nov 8 19:32:50 Tower kernel: BTRFS error (device sdf1): error writing primary super block to device 2

sdf is the first drive in the cache pool. but sdg which is the second in that pool is the one that was giving the crc errors

Quote

November 8, 20232 yr

Author

ok, the array finally stopped. Hopefully my cache data still exists

Quote

November 8, 20232 yr

Author

tower-diagnostics-20231108-1938.zip

i'll attach a diagnostics just in case it's useful

Quote

November 8, 20232 yr

Author

so i just reseated the cables and restarted the server and now most of my docker containers using appdata won't start and a few of them have errors saying about a read only file system. The syslog also has some errors about a checksum and it giving a lot of I/O errors from the same drive as before. I'll attach a new diagnostic to this post. So on a scale of 1-10, how screwed am i in terms of the data on the cache pool?

tower-diagnostics-20231108-2045.zip

Quote

November 8, 20232 yr

Author

also ignore the drive errors from sde. that's an unassigned drive in a slot that i know is bad

Quote

November 8, 20232 yr

Author

@JorgeB any other advice you have for what I can do would be appreciated

Quote

November 9, 20232 yr

Community Expert

Looks like the pool is corrupt but there's a lot of log spam because of the bad drive, disconnect it and post new diags after array start.

Quote

November 9, 20232 yr

Author

ok, i've disconnected both of the bad drives and it seems to be working now. The cache pool had enough space to just convert into a 2 drive raid 1 and the main array has been running with a missing disk for about 2 months already. So everything seems to be working now. I'm going to try and reinstall the ssd once the replacement cable shows up, but until then, at least the server can limp along.

I'll attach new diagnostics anyway just incase there's anything useful

tower-diagnostics-20231109-1056.zip

Quote

November 11, 20232 yr

Author

@JorgeB so my situation has got even more strange. So i just replaced the cable and transfered the server to a new case without a cheap hot swap backplane and now the normal drives all seem to be working fine, but now my cache pool is completely dead.

So i moved everything, checked all the disks showed up, re-assigned a disk to the main array that had a faulty cable and the started the array and now the cache that was working this morning as a 2 drive raid 1 is just showing 2 disks with the error "Unmountable: Unsupported or no file system". I have no idea what could have happened to them. I did a clean shutdown before hand and all seemed to be working fine since my last update, nothing seemed to be corrupted and everything was working, and now this has happened.

tower-diagnostics-20231111-1817.zip

Quote

November 12, 20232 yr

Author

So i think i found a bug in unraid. It seems like that ssd that i removed a few days ago was causing the cache to not mount properly. With that drive installed but not assigned to the cache pool (it was a 3 drive pool with the 2nd slot unassigned) the pool would fail to mount, there was also an issue with the superblock size but that was fixed with the command "btrfs rescue fix-device-size" after running a check on the pool. Then when running a "btrfs filesystem show" command, i noticed it was listing the installed but not assigned drive as a part of the pool but with a different storage size (about 80gb less than the drives that were actually assigned to the pool). So i shut down and ripped out that drive again and now the server starts fine and the pool seems to be functioning properly. So i think that maybe when unraid removed the disk from the pool when i removed it the first time, it didn't actually remove it properly and it was still trying to use it as a part of the pool

Quote

November 12, 20232 yr

Author

Should i make this into a new post? i think this might be a bit outside of the scope of the original issue at this point

Quote

November 12, 20232 yr

Community Expert

Nov 11 15:42:30 Tower emhttpd:  Total devices 3 FS bytes used 847.91GiB
Nov 11 15:42:30 Tower emhttpd:  devid    1 size 931.51GiB used 787.03GiB path /dev/sdc1
Nov 11 15:42:30 Tower emhttpd:  devid    2 size 931.51GiB used 718.03GiB path /dev/sdd1
Nov 11 15:42:30 Tower emhttpd:  devid    3 size 931.51GiB used 787.03GiB path /dev/sdb1
Nov 11 15:42:30 Tower emhttpd: cache: invalid config: total_devices 3 num_misplaced 1 num_missing 0

The pool currently consists of 3 devices, but only two are assigned, so it doesn't mount, this is not a bug but by design.

Quote

November 12, 20232 yr

Author

but when i removed the ssd and started the array a few days ago. it asked me if i wanted to remove the device and i said yes, shouldn't that mean it was removed from the pool?

Quote

November 13, 20232 yr

Community Expert

Possibly it should have but it was not, so all three are required now, you can re-import the pool with all 3 (not just adding the unassigned device, if you need help to re-import let me know) then post new diags.

Quote

CRC Errors on cache drive

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)