So I've been noticing a lot of errors recently that look like this on my cache drive... Sometimes it appears for the first one, and sometimes it appears for the 2nd one.
Nov 12 10:32:48 Tower kernel: BTRFS warning (device sdb1): lost page write due to IO error on /dev/sdd1
### [PREVIOUS LINE REPEATED 2 TIMES] ###
Nov 12 10:32:48 Tower kernel: BTRFS error (device sdb1): error writing primary super block to device 2
Nov 12 10:32:48 Tower kernel: BTRFS warning (device sdb1): lost page write due to IO error on /dev/sdd1
### [PREVIOUS LINE REPEATED 2 TIMES] ###
Nov 12 10:32:48 Tower kernel: BTRFS error (device sdb1): error writing primary super block to device 2
Nov 12 10:32:48 Tower kernel: BTRFS warning (device sdb1): lost page write due to IO error on /dev/sdd1
### [PREVIOUS LINE REPEATED 2 TIMES] ###
Nov 12 10:32:48 Tower kernel: BTRFS error (device sdb1): error writing primary super block to device 2
Nov 12 10:32:48 Tower kernel: BTRFS warning (device sdb1): lost page write due to IO error on /dev/sdd1
Nov 12 10:32:48 Tower kernel: BTRFS error (device sdb1): error writing primary super block to device 2
### [PREVIOUS LINE REPEATED 8 TIMES] ###
Nov 12 10:32:50 Tower kernel: print_req_error: 1288 callbacks suppressed
Nov 12 10:32:50 Tower kernel: print_req_error: I/O error, dev sdd, sector 31464
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Nov 12 10:32:50 Tower kernel: btrfs_dev_stat_print_on_error: 4120 callbacks suppressed
Nov 12 10:32:50 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdd1 errs: wr 40584686, rd 14313628, flush 2491, corrupt 0, gen 0
Nov 12 10:32:50 Tower kernel: print_req_error: I/O error, dev sdd, sector 2336
I've tried replacing the cables, and I still got these errors. I never ran into anything functionally wrong though.
EDIT: (adding small seciton for more context i forgot to add...)
Today I noticed that one of my shares had some data on the cache/ drive for a share that was already set to use no cache. I manually copied those files, but in the process, /mnt/user became unreachable something like `transport has no endpoint`... Hence why I decided to take down the server for checks.
end of Edit
I took my server down just so I could run a check, but then realized... my 2nd cache drive wouldn't show up. After reconnecting, it was showing up only under unassigned devices. On top of that I was running into the /dev/sdx is size zero error in the syslog.
I rebooted one more time, and this time it auto-started with 1 of the cache drives... and for some reason it was reporting something like 1.2 TB used / 800 GB unused when my drives are 1 TB each...
Now this is the part where I'm concerned I might have screwed up by stupidity? I started it again, and now it says Unmountable: No File System. It's now in the process of being stopped infinitely retrying to unmounting user share.
Did I permanently screw up? Or is this still recoverable?
tower-diagnostics-20201112-1841.zip
Edit 2: Adding picture of how it looks like now after another reboot
I'm able to access the cache drive, so I'm backing it up for now.