Yesterday evening, I updated to 6.12.0. Everything came back online fine and I went to bed. Woke up this morning to find that all my dockers had stopped as well as the docker engine. Looking through syslog I can see a ton of "BTRF error" reports on my nvme cache drive that holds /appdata and /system. The main array looks OK. I note that just prior to these errors was a scheduled trim task but not sure if that is related.
Sample of the syslog:
Jun 19 04:56:22 Server kernel: BTRFS critical (device nvme0n1p1): corrupt leaf: root=2 block=156609576960 slot=44, invalid key objectid, have 18446612688409594768 expect to be aligned to 4096
Jun 19 04:56:22 Server kernel: BTRFS info (device nvme0n1p1): leaf 156609576960 gen 15881817 total ptrs 105 free space 8093 owner 2
Jun 19 04:56:22 Server kernel: item 0 key (1956950016 168 45056) itemoff 16230 itemsize 53
Jun 19 04:56:22 Server kernel: extent refs 1 gen 13815472 flags 1
Jun 19 04:56:22 Server kernel: ref#0: extent data backref root 5 objectid 265178326 offset 24576 count 1
...
Jun 19 04:56:26 Server kernel: I/O error, dev loop2, sector 31924224 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
Jun 19 04:56:26 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 36, rd 0, flush 0, corrupt 0, gen 0
Jun 19 04:56:26 Server kernel: loop: Write error at byte offset 20109533184, length 4096.
...
Jun 19 07:10:50 Server kernel: BTRFS error (device nvme0n1p1: state EA): parent transid verify failed on logical 156753166336 mirror 1 wanted 15881817 found 15881803
### [PREVIOUS LINE REPEATED 7 TIMES] ###
Jun 19 07:11:05 Server kernel: verify_parent_transid: 3 callbacks suppressed
I noticed that the timing of the syslog errors corresponded to my nightly rsync job that copies "/mnt/user" to a standby server. The rsync log from that job reports that most files copied across to the standby server but around ~800 failed due to read errors. So I now have a mixed set of files on the standby as well. I do have a full archive from a week ago so I should be able to reconstruct a reasonable set of /appdata and /system files. Thankfully only around 50 of the 800 impacted files are what I would call important (e.g. influxdb data, home-assistant storage etc.) and should be recoverable from what I have. The rest are things like plex metadata files and the like.
I haven't yet rebooted the server or tried to repair the nvme drive. I have however taken a copy of the remaining files from the cache drive onto a spare unassigned drive so that is positive at least. I think my options are either: 1) try a Scrub/Repair of the cache drive from the UnRaid GUI, or, 2) reboot the machine and see what happens. If both of those fail then I assume I need to reformat the drive and start the reconstruction process. Any advice on the best way forward ?
I have attached my diagnostic files as I would really like to know what caused this issue in the first place. Is it coincidental with the OS upgrade? Appreciate any insight.
server-diagnostics-20230619-1635.zip