gmight Posted December 10, 2023 Share Posted December 10, 2023 I've been working a problem with my unraid sever for about 3 weeks. The infamous tainted CPU problem (macvlan issue) kernel panic attached for giggles. Note that the server had been up and stable since 6.9.2. Well, I had some power issues in my area, found out that my UPS has a bad battery and when the power issue happened the server shutdown as expected when you have a bad battery. Powered back on, everything good then within 12 hours bam! macvlan kernel errors and locked UI. Upgraded to 6.12.4 since it promised to address macvlan stuff - nope, didnt work. At that point I did a bunch of reading, went from macvlan to ipvlan, I made the necessary adjustments on my switch and kernel panics end - yay. One thing worth mentioning. With all the reboots and locks ups. My docker image got corrupted (its what I think). So last weekend I spent some time restoring my dockers image. Fairly easy, no issues restoring but my VM's were there and worked just fine. No issues. Fast forward to today, everything was good and normal - I ran an errand and when I got back my cache was erroring out. Syslog reports the following: Dec 9 13:01:16 semperfi kernel: BTRFS critical (device nvme0n1p1): corrupt leaf: block=3821052706816 slot=11 extent bytenr=3749963485184 len=8192 unknown inline ref type: 129 Dec 9 13:01:16 semperfi kernel: BTRFS info (device nvme0n1p1): leaf 3821052706816 gen 2451645 total ptrs 166 free space 3335 owner 2 Dec 9 13:01:16 semperfi kernel: BTRFS error (device nvme0n1p1): block=3821052706816 write time tree block corruption detected Dec 9 13:01:16 semperfi kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2494: errno=-5 IO failure (Error while writing out transaction) Dec 9 13:01:16 semperfi kernel: BTRFS info (device nvme0n1p1: state E): forced readonly grep syslog for nvme after reboot: Dec 9 18:48:19 semperfi kernel: nvme nvme0: pci function 0000:01:00.0 Dec 9 18:48:19 semperfi kernel: nvme nvme1: pci function 0000:04:00.0 Dec 9 18:48:19 semperfi kernel: nvme nvme1: 24/0/0 default/read/poll queues Dec 9 18:48:19 semperfi kernel: nvme1n1: p1 Dec 9 18:48:19 semperfi kernel: BTRFS: device fsid 1d7002b3-d702-4b6d-9706-05ab85e49ff2 devid 2 transid 2451644 /dev/nvme1n1p1 scanned by udevd (1166) Dec 9 18:48:19 semperfi kernel: nvme nvme0: 24/0/0 default/read/poll queues Dec 9 18:48:19 semperfi kernel: nvme0n1: p1 Dec 9 18:48:19 semperfi kernel: BTRFS: device fsid 1d7002b3-d702-4b6d-9706-05ab85e49ff2 devid 1 transid 2451644 /dev/nvme0n1p1 scanned by udevd (1166) Dec 9 18:48:54 semperfi emhttpd: INTEL_SSDPE2MD800G4_CVFT72450037800QGN (nvme0n1) 512 1562824368 Dec 9 18:48:54 semperfi emhttpd: INTEL_SSDPE2MD800G4_CVFT5200000X800HGN (nvme1n1) 512 1562824368 Dec 9 18:48:54 semperfi emhttpd: import 30 cache device: (nvme1n1) INTEL_SSDPE2MD800G4_CVFT5200000X800HGN Dec 9 18:48:54 semperfi emhttpd: import 31 cache device: (nvme0n1) INTEL_SSDPE2MD800G4_CVFT72450037800QGN Dec 9 18:48:54 semperfi emhttpd: read SMART /dev/nvme0n1 Dec 9 18:48:54 semperfi emhttpd: read SMART /dev/nvme1n1 Dec 9 18:52:12 semperfi kernel: nvme1n1: p1 Dec 9 18:52:17 semperfi kernel: nvme0n1: p1 Dec 9 18:55:31 semperfi emhttpd: devid 1 size 745.21GiB used 374.06GiB path /dev/nvme0n1p1 Dec 9 18:55:31 semperfi emhttpd: devid 2 size 745.21GiB used 374.06GiB path /dev/nvme1n1p1 Dec 9 18:55:31 semperfi kernel: BTRFS info (device nvme0n1p1): using crc32c (crc32c-intel) checksum algorithm Dec 9 18:55:31 semperfi kernel: BTRFS info (device nvme0n1p1): using free space tree Dec 9 18:55:31 semperfi kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Dec 9 18:55:31 semperfi kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Dec 9 18:55:31 semperfi kernel: BTRFS info (device nvme0n1p1): enabling ssd optimizations Dec 9 18:55:31 semperfi kernel: BTRFS info (device nvme0n1p1): start tree-log replay Dec 9 18:55:31 semperfi kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on logical 3821051052032 mirror 2 wanted 2451649 found 2451645 Dec 9 18:55:31 semperfi kernel: BTRFS error (device nvme0n1p1): parent transid verify failed on logical 3821051052032 mirror 1 wanted 2451649 found 2451645 Dec 9 18:55:31 semperfi kernel: BTRFS warning (device nvme0n1p1): failed to read log tree Dec 9 18:55:31 semperfi root: mount: /mnt/cache: can't read superblock on /dev/nvme0n1p1. Dec 9 18:55:31 semperfi kernel: BTRFS error (device nvme0n1p1): open_ctree failed Dec 9 18:57:43 semperfi kernel: nvme1n1: p1 Dec 9 18:59:29 semperfi kernel: nvme0n1: p1 Not cool. So naturally, I reboot and when i manually bring up the array, it comes up without cache. So no docker or VMs, the data in the array is untouched - huge plus. When I got to Fix Common Problems I see this message: My dual NVME cache reports this: All my docker and VM's should be there but of course, I don't know what to do at this point. The above errors links me to this particular forum if I'm dealing with the "unmountable disk..." error. So here I am. I hope that I can get some help for this one, i rather not start building both docker and VMs after a format. I've attached my diag to the thread. System: MB=Supermicro X11SPi-TF CPU=Xeon 4214 MEM=256gb (4x64gb) Micron Tech @2400MT/s Cache= 2 intel P3700 U.2 NVMEs dmesg | grep error [ 473.318420] BTRFS error (device nvme0n1p1): parent transid verify failed on logical 3821051052032 mirror 2 wanted 2451649 found 2451645 [ 473.318587] BTRFS error (device nvme0n1p1): parent transid verify failed on logical 3821051052032 mirror 1 wanted 2451649 found 2451645 [ 473.321780] BTRFS error (device nvme0n1p1): open_ctree failed I read a few threads so far, I ended up here. If this is where I need to go, then I'll just wipe the disk and get my containers back. I'll get my VM's up as needed. Also ppl point to memory? I can easily swap that out if that's a thing. I will start a memtest tonight when I go to sleep to see if anything pops - I doubt it but I'll do it anyway. Again, i was fully recovered from the macvlan\kernel panics but all of sudden this happens within 24 hours of adding VLANs. I've never had this kind of issue before so I'm hoping I can get some guidance from my fellow unraid users. Sorry if I'm all over the place, naturally, I'm in panic but trying to keep my cool the best I can. I'm still going through logs, trying to find a clue. Thanks for your feedback. semperfi-diagnostics-20231209-1939.zip kernel panic dec5.txt Quote Link to comment
Solution JorgeB Posted December 10, 2023 Solution Share Posted December 10, 2023 If the log tree is the only problem this may help: btrfs rescue zero-log /dev/nvme0n1p1 Then restart the array. Quote Link to comment
gmight Posted December 10, 2023 Author Share Posted December 10, 2023 Nice, that brought the array up on the first go. Commands: root@semperfi:/mnt/user# btrfs rescue zero-log /dev/nvme0n1p1 parent transid verify failed on 3821051052032 wanted 2451649 found 2451645 parent transid verify failed on 3821051052032 wanted 2451649 found 2451645 Couldn't setup log root tree Clearing log on /dev/nvme0n1p1, previous log_root 3821051052032, level 0 root@semperfi:/mnt/user# btrfs rescue zero-log /dev/nvme1n1p1 Clearing log on /dev/nvme1n1p1, previous log_root 0, level 0 Ran Scrub immediately after "no errors found". Ran fix common problems scan " nothing to report" grep for nvme Spoiler root@semperfi:~# cat /var/log/syslog | grep -E "nvme" Dec 10 06:19:50 semperfi emhttpd: read SMART /dev/nvme0n1 Dec 10 06:19:50 semperfi emhttpd: read SMART /dev/nvme1n1 Dec 10 06:19:56 semperfi emhttpd: INTEL_SSDPE2MD800G4_CVFT72450037800QGN (nvme0n1) 512 1562824368 Dec 10 06:19:56 semperfi emhttpd: INTEL_SSDPE2MD800G4_CVFT5200000X800HGN (nvme1n1) 512 1562824368 Dec 10 06:19:56 semperfi emhttpd: import 30 cache device: (nvme1n1) INTEL_SSDPE2MD800G4_CVFT5200000X800HGN Dec 10 06:19:56 semperfi emhttpd: import 31 cache device: (nvme0n1) INTEL_SSDPE2MD800G4_CVFT72450037800QGN Dec 10 06:19:57 semperfi emhttpd: read SMART /dev/nvme0n1 Dec 10 06:19:57 semperfi emhttpd: read SMART /dev/nvme1n1 Dec 10 06:20:31 semperfi emhttpd: devid 1 size 745.21GiB used 374.06GiB path /dev/nvme0n1p1 Dec 10 06:20:31 semperfi emhttpd: devid 2 size 745.21GiB used 374.06GiB path /dev/nvme1n1p1 Dec 10 06:20:31 semperfi kernel: BTRFS info (device nvme0n1p1): using crc32c (crc32c-intel) checksum algorithm Dec 10 06:20:31 semperfi kernel: BTRFS info (device nvme0n1p1): using free space tree Dec 10 06:20:31 semperfi kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Dec 10 06:20:31 semperfi kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Dec 10 06:20:31 semperfi kernel: BTRFS info (device nvme0n1p1): enabling ssd optimizations Dec 10 06:20:31 semperfi kernel: BTRFS info (device nvme0n1p1): cleaning free space cache v1 Dec 10 06:20:31 semperfi kernel: BTRFS info (device nvme0n1p1): checking UUID tree Dec 10 06:20:31 semperfi kernel: BTRFS info (device nvme0n1p1: state M): turning on async discard Dec 10 06:23:56 semperfi emhttpd: spinning down /dev/nvme1n1 Dec 10 06:23:56 semperfi emhttpd: sdspin /dev/nvme1n1 down: 1 Dec 10 06:29:46 semperfi kernel: BTRFS info (device nvme0n1p1): scrub: started on devid 1 Dec 10 06:29:46 semperfi kernel: BTRFS info (device nvme0n1p1): scrub: started on devid 2 Dec 10 06:33:49 semperfi kernel: BTRFS info (device nvme0n1p1): scrub: finished on devid 2 with status: 0 Dec 10 06:33:51 semperfi kernel: BTRFS info (device nvme0n1p1): scrub: finished on devid 1 with status: 0 root@semperfi:~# dmesg | grep error [ 473.318420] BTRFS error (device nvme0n1p1): parent transid verify failed on logical 3821051052032 mirror 2 wanted 2451649 found 2451645 [ 473.318587] BTRFS error (device nvme0n1p1): parent transid verify failed on logical 3821051052032 mirror 1 wanted 2451649 found 2451645 [ 473.321780] BTRFS error (device nvme0n1p1): open_ctree failed [41584.904323] BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [41584.906278] BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [41584.909332] BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [41584.909725] BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 At this point I wonder if there's anything else that I need to look at? Do you recommend any other diagnostics that I should run on cache to make sure it's normal? And if it's not too much to ask, how does the log tree get messed up like that? This time, I was away from my system, no user intervention. In other words, how can I prevent this from happening again and I'm trying to learn a bit more. I threw a fresh diag on the thread. I appreciate the assist JorgeB, hella of a save bro! Thank you! semperfi-diagnostics-20231210-0701.zip Quote Link to comment
JorgeB Posted December 11, 2023 Share Posted December 11, 2023 Make sure backups are up to date, this issue can sometimes reoccur, if it does in the near future the same command should still work, but then recommend to backup and recreate the pool. Quote Link to comment
gmight Posted December 12, 2023 Author Share Posted December 12, 2023 Yea, with the corruption of my docker image last week and this weekend the log tree issue, I think you're right about this one circling back. If it does, I'll rebuild and maybe do a ZFS cache. I've been looking at what benefits I might gain, probably not many right? ZFS is best to protect from the human factor not the file system or other unknows. Regardless, I appreciate the advice and the fix. Marked solved, you rock dude. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.