Jump to content

Uggh, cache drive borked


Go to solution Solved by JorgeB,

Recommended Posts

Yesterday evening, I updated to 6.12.0. Everything came back online fine and I went to bed. Woke up this morning to find that all my dockers had stopped as well as the docker engine. Looking through syslog I can see a ton of "BTRF error" reports on my nvme cache drive that holds /appdata and /system. The main array looks OK. I note that just prior to these errors was a scheduled trim task but not sure if that is related.

 

Sample of the syslog:

 

Jun 19 04:56:22 Server kernel: BTRFS critical (device nvme0n1p1): corrupt leaf: root=2 block=156609576960 slot=44, invalid key objectid, have 18446612688409594768 expect to be aligned to 4096
Jun 19 04:56:22 Server kernel: BTRFS info (device nvme0n1p1): leaf 156609576960 gen 15881817 total ptrs 105 free space 8093 owner 2
Jun 19 04:56:22 Server kernel: 	item 0 key (1956950016 168 45056) itemoff 16230 itemsize 53
Jun 19 04:56:22 Server kernel: 		extent refs 1 gen 13815472 flags 1
Jun 19 04:56:22 Server kernel: 		ref#0: extent data backref root 5 objectid 265178326 offset 24576 count 1
...
Jun 19 04:56:26 Server kernel: I/O error, dev loop2, sector 31924224 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
Jun 19 04:56:26 Server kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 36, rd 0, flush 0, corrupt 0, gen 0
Jun 19 04:56:26 Server kernel: loop: Write error at byte offset 20109533184, length 4096.
...
Jun 19 07:10:50 Server kernel: BTRFS error (device nvme0n1p1: state EA): parent transid verify failed on logical 156753166336 mirror 1 wanted 15881817 found 15881803
### [PREVIOUS LINE REPEATED 7 TIMES] ###
Jun 19 07:11:05 Server kernel: verify_parent_transid: 3 callbacks suppressed

 

I noticed that the timing of the syslog errors corresponded to my nightly rsync job that copies "/mnt/user" to a standby server. The rsync log from that job reports that most files copied across to the standby server but around ~800 failed due to read errors. So I now have a mixed set of files on the standby as well. I do have a full archive from a week ago so I should be able to reconstruct a reasonable set of /appdata and /system files. Thankfully only around 50 of the 800 impacted files are what I would call important (e.g. influxdb data, home-assistant storage etc.) and should be recoverable from what I have. The rest are things like plex metadata files and the like.

 

I haven't yet rebooted the server or tried to repair the nvme drive. I have however taken a copy of the remaining files from the cache drive onto a spare unassigned drive so that is positive at least. I think my options are either: 1) try a Scrub/Repair of the cache drive from the UnRaid GUI, or, 2) reboot the machine and see what happens. If both of those fail then I assume I need to reformat the drive and start the reconstruction process. Any advice on the best way forward ?

 

I have attached my diagnostic files as I would really like to know what caused this issue in the first place. Is it coincidental with the OS upgrade? Appreciate any insight.

server-diagnostics-20230619-1635.zip

Link to comment

@JorgeB, many thanks for the suggestion. I ran memtest a few times and got clean passes so it doesn't seem to be a persistent memory problem at least. I restarted Unraid and the cache filesystem looks fine and I can access the files that were previously erroring out. Which is great. The experience does leave me a bit wary and wonder what could have caused the problem. Is there any other testing tool that I can use or any way to monitor this going forward ? Thanks again for the help.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...