Ongoing SSD BTRFS Errors

Followers

February 1, 20242 yr

Day 3 of errors continuing....diags attached.

I have previously erased / formatted the drives and rebuilt docker, restored everything back to where it was - just for the errors to return again overnight.

Next step was to wipe the drives / reformat, rinse / repeat same issue while restoring files - have some beefy extra files (~45GB worth) and it never completed restoring it.

Ram Memtest for ~1 hour - 2 complete test cycles - no errors. (2*4GB)

Base on some other comments i've seen I decided to replace the SATA cables. erase / reformat - in the process of restoring the extra files again and it freezes up. I restart the system and i have CRC errors on one of the drives (had them show up before on the other drive in the pool). but the pool mounts and evertyhing so i'm reinstalling my containers to see what it does or doesn't do. the space used on the pool is in the right range ~137GB out of 1TB. interstingly enough, it's now showing one of these drives as part of the pool and in unassigned devices not mounted. this didn't happen prior to me swapping cables out.

one of the ssd's is a little over a year old, the other shows ~2+years powered on. Some of the daily stuff hitting the cache 24*7 would be plex transcode share and my omada SDN working folders. not sure if that would put excessive wear and tear on the drive. I did do SMART tests on both of them when they were working earlier and they both reported no issues.

thought or suggestions for next troubleshooting steps?

couple of things coming to mind: the cables i used to replace with are "older" but never really used previously. the ones i took out i believe came with my motherboard from a build from last year. (not this pc build - this mb / cpu is ~10yo). i did use the same ports on the board that i've been using for over year now. trying to change just one thing at a time where feasible

rlyeh-diagnostics-20240201-1431.zip

Quote

February 1, 20242 yr

Author

fwiw i'm thinking of putting cables back since they're newer, then trying to rebuild everything with just a single drive (the newer one) and see what that does or doesn't do...

also have a small 240GB drive that's been sitting on shelf for awhile. another option to try with.

Edited February 1, 20242 yr by Azreal
sp

Quote

February 2, 20242 yr

Community Expert

Crucial SSD is dropping offline:

Feb  1 11:59:50 Rlyeh kernel: ata1: COMRESET failed (errno=-32)
Feb  1 11:59:50 Rlyeh kernel: ata1: reset failed, giving up
Feb  1 11:59:50 Rlyeh kernel: ata1.00: disable device
Feb  1 11:59:50 Rlyeh kernel: ata1: EH complete
Feb  1 11:59:50 Rlyeh kernel: sd 1:0:0:0: rejecting I/O to offline device
Feb  1 11:59:50 Rlyeh kernel: I/O error, dev sdb, sector 73832 op 0x1:(WRITE) flags 0x1800 phys_seg 88 prio class 2

Could be cables, but since it's an MX500 see here:

https://forums.unraid.net/topic/137795-6115-issues-with-btrfs-cache-after-ssds-replacement-and-hw-upgrade/?do=findComment&comment=1265949

Also see here for better pool monitoring:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

Quote

February 2, 20242 yr

Author

6 hours ago, JorgeB said:
Crucial SSD is dropping offline:
Feb  1 11:59:50 Rlyeh kernel: ata1: COMRESET failed (errno=-32)
Feb  1 11:59:50 Rlyeh kernel: ata1: reset failed, giving up
Feb  1 11:59:50 Rlyeh kernel: ata1.00: disable device
Feb  1 11:59:50 Rlyeh kernel: ata1: EH complete
Feb  1 11:59:50 Rlyeh kernel: sd 1:0:0:0: rejecting I/O to offline device
Feb  1 11:59:50 Rlyeh kernel: I/O error, dev sdb, sector 73832 op 0x1:(WRITE) flags 0x1800 phys_seg 88 prio class 2
Could be cables, but since it's an MX500 see here:

https://forums.unraid.net/topic/137795-6115-issues-with-btrfs-cache-after-ssds-replacement-and-hw-upgrade/?do=findComment&comment=1265949

Thanks i do have that firmware - i ran into issues with that immediately when i set it up and got pointed in the right direction.

reading the pooling info now.

setup my 240GB as a single cache drive and got everything up and working. setup another cache pool with the 2 1TB drives that started all of this, i've been able to copy large amounts of files over wwithout any errors. left powered on overnight no errors this morning. once my regular scheduled parity check finishes i'm going to try switching back to the drives that started all of this and see how it goes. i'm getting good at tearing it down and spinning it back up at least :).

6 hours ago, JorgeB said:

Quote

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Followers

Go to topic listing

Ongoing SSD BTRFS Errors

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)