October 22, 20241 yr Unraid has become very unstable. This is a new issue after updating. I think I went from 6.12.5 to 6.12.13, so a big step up. I have installed a couple of new dockers (audiobookshelf, fail2ban), but really have not done much with them yet. Looking at my logs it seems to be pointing to issues with my cache drive and or my docker file. I just don't know what specifically is causing the issue or how to fix it. Anyone have any ideas? A reboot temporarily fixes it, but I'm back with the same issues about a day later. Doing a software reboot just hangs the system and I have to manually power down. So I get an unclean shutdown and a parity check starts on reboot with about 19 hour check. Therefore, I can't use mover to try move things off my cache. I've disabled basically all my dockers expect for plex and pihole. I just cleaned up some space on the cache drive, but it wasn't full. I had over 100GB left on it. Any advice on where to start here would be greatly appreciated. I can follow guides, but when it comes to troubleshooting linux, I'm at a loss. tower-diagnostics-20241021-1916.zip
October 22, 20241 yr Community Expert There's filesystem corruption on the pool, recommend backing up and recreating, but since btrfs was also detecting data corruption, you should run memtest first.
October 22, 20241 yr Author Thanks for your reply. The data corruption detected by btrfs is on my cache, correct? That's the errors I'm seeing in the logs involving my cache pool. So is everything on my cache drive potentially corrupted? So when you say I have filesystem corruption are you talking about my flash drive? I'm not sure what exactly I need to re-create or how to go about doing that. I know how to backup my appdata, flash drive, etc. My concern is backing up corrupted data and restoring that. I'm assuming that the memory test is to try and figure out what caused the data corruption? I don't know what would have happened during my update, but that's my best guess as to when my issues started. Sorry for my ignorance. Edited October 22, 20241 yr by gath2
October 22, 20241 yr Community Expert 27 minutes ago, gath2 said: So is everything on my cache drive potentially corrupted? Most likely not, you can run a scrub to confirm, but recommend running memtest first.
October 22, 20241 yr Author So am I correct in that it's my cache drive that I need to figure out what exactly is corrupted and then ultimately why it got corrupted to hopefully prevent it again? The cache gets locked to read only after it finds corrupted files and pretty much hoses all the dockers because the appdata folder is stored on cache poole. Is there a way to identify the corrupted files using my logs? I'm still not sure what I am rebuilding. Entire AppData folder, Individual dockers, docker.img, something else? Thank you for all you do here JorgeB. Your input is invaluable to novice users like myself. Edited October 22, 20241 yr by gath2
October 23, 20241 yr Community Expert 12 hours ago, gath2 said: So am I correct in that it's my cache drive that I need to figure out what exactly is corrupted and then ultimately why it got corrupted to hopefully prevent it again? Yes, just run a scrub, it will list any corrupt files.
October 23, 20241 yr Author Parity finished rebuilding. I ran mover. I deleted all my old downloads that were on cache and ran the scrubber. Only found a single posterart.jpg file that was cornuted. Deleted it, ran the scrubber again and everything was find. Ran fix common problems and no issues. Mind you I have basically all my dockers off to prevent writing to the cache drive, so nothing new should really be written to the drive to hopefully prevent new problems until I figure out what is going on. Watched a few things on Plex last night without any errors in the logs. This moring before work I attempt to run a manual AppData Backup and the following shows up in my syslog shortly after starting. It was backing up the first docker, which was already stopped. The web interface becomes unresponsive. Only docker running was plex and pihole. Unfortunately, I can't pull a new diagnostic without the interface. This was a HARD lock. previous dockers just stopped responding and I could get most things to shut down before rebooting. Hitting the power button did nothing. I had to hold it it and do a hard reset. Possibly a RAM issue, but looking at other people reporting issues it seems that it typically turns out to be plugin or docker related. Currently swapped memory out with a spare set I had and running a memory test. That's all I had time to do before going into work. I checked my previous AppData backup, which is from 3 days ago. Everything backed up except for Radarr, which had the corrupted file and didn't backup. I have to go back a week to get a good backup there, but not a big deal. Assuming my new memory tests well, and I get the system back up. I'm not sure what to look at next if and when it crashes again. Hopefully this error logs means more to someone else than it does to me. Oct 23 07:02:35 Tower kernel: BUG: unable to handle page fault for address: 0000000000008000 Oct 23 07:02:35 Tower kernel: #PF: supervisor read access in kernel mode Oct 23 07:02:35 Tower kernel: #PF: error_code(0x0000) - not-present page Oct 23 07:02:35 Tower kernel: PGD 18d170067 P4D 18d170067 PUD 18d181067 PMD 0 Oct 23 07:02:35 Tower kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI Oct 23 07:02:35 Tower kernel: CPU: 8 PID: 0 Comm: swapper/8 Tainted: P O 6.1.106-Unraid #1 Oct 23 07:02:35 Tower kernel: Hardware name: To Be Filled By O.E.M. Z790 Pro RS/D4/Z790 Pro RS/D4, BIOS 2.05 10/05/2022 Oct 23 07:02:35 Tower kernel: RIP: 0010:percpu_ref_get_many+0xd/0x2a Oct 23 07:02:35 Tower kernel: Code: 31 c0 48 89 42 78 eb 0b 48 8b 02 48 89 42 70 eb 02 0f 0b 48 89 d8 5b 5d c3 cc cc cc cc 55 48 89 fd 53 48 89 f3 e8 f0 0e ea ff <48> 8b 45 00 a8 03 74 0a 48 8b 45 08 f0 48 01 18 eb 04 65 48 01 18 Oct 23 07:02:35 Tower kernel: RSP: 0018:ffffc900003a4e08 EFLAGS: 00010002 Oct 23 07:02:35 Tower kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000 Oct 23 07:02:35 Tower kernel: RDX: ffff8881002c0000 RSI: 0000000000000001 RDI: 0000000000008000 Oct 23 07:02:35 Tower kernel: RBP: 0000000000008000 R08: ffff8882b8c61500 R09: ffffffff810da269 Oct 23 07:02:35 Tower kernel: R10: ffff8882b8c61500 R11: 0000000000032d40 R12: 0000000000008000 Oct 23 07:02:35 Tower kernel: R13: 0000000000000202 R14: ffff888100f64700 R15: 00000000fffffff8 Oct 23 07:02:35 Tower kernel: FS: 0000000000000000(0000) GS:ffff88904fa00000(0000) knlGS:0000000000000000 Oct 23 07:02:35 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 23 07:02:35 Tower kernel: CR2: 0000000000008000 CR3: 00000001766c6002 CR4: 0000000000770ee0 Oct 23 07:02:35 Tower kernel: PKRU: 55555554 Oct 23 07:02:35 Tower kernel: Call Trace: Oct 23 07:02:35 Tower kernel: <IRQ> Oct 23 07:02:35 Tower kernel: ? __die_body+0x1a/0x5c Oct 23 07:02:35 Tower kernel: ? page_fault_oops+0x329/0x376 Oct 23 07:02:35 Tower kernel: ? do_user_addr_fault+0x12e/0x465 Oct 23 07:02:35 Tower kernel: ? exc_page_fault+0xfb/0x11d Oct 23 07:02:35 Tower kernel: ? asm_exc_page_fault+0x22/0x30 Oct 23 07:02:35 Tower kernel: ? rcu_do_batch+0x27b/0x51f Oct 23 07:02:35 Tower kernel: ? percpu_ref_get_many+0xd/0x2a Oct 23 07:02:35 Tower kernel: ? percpu_ref_get_many+0xd/0x2a Oct 23 07:02:35 Tower kernel: refill_obj_stock+0x60/0x138 Oct 23 07:02:35 Tower kernel: memcg_slab_free_hook+0x80/0xcf Oct 23 07:02:35 Tower kernel: kmem_cache_free+0xb7/0x154 Oct 23 07:02:35 Tower kernel: ? rcu_do_batch+0x27b/0x51f Oct 23 07:02:35 Tower kernel: rcu_do_batch+0x27b/0x51f Oct 23 07:02:35 Tower kernel: rcu_core+0x265/0x2ac Oct 23 07:02:35 Tower kernel: handle_softirqs+0x129/0x271 Oct 23 07:02:35 Tower kernel: __irq_exit_rcu+0x5e/0xb8 Oct 23 07:02:35 Tower kernel: sysvec_apic_timer_interrupt+0x85/0xa6 Oct 23 07:02:35 Tower kernel: </IRQ> Oct 23 07:02:35 Tower kernel: <TASK> Oct 23 07:02:35 Tower kernel: asm_sysvec_apic_timer_interrupt+0x16/0x20 Oct 23 07:02:35 Tower kernel: RIP: 0010:cpuidle_enter_state+0x11d/0x202 Oct 23 07:02:35 Tower kernel: Code: 86 c1 9f ff 45 84 ff 74 1b 9c 58 0f 1f 40 00 0f ba e0 09 73 08 0f 0b fa 0f 1f 44 00 00 31 ff e8 0a 7b a4 ff fb 0f 1f 44 00 00 <45> 85 e4 0f 88 ba 00 00 00 48 8b 04 24 49 63 cc 48 6b d1 68 49 29 Oct 23 07:02:35 Tower kernel: RSP: 0018:ffffc900001a7e98 EFLAGS: 00000246 Oct 23 07:02:35 Tower kernel: RAX: ffff88904fa00000 RBX: ffff88904fa37200 RCX: 0000000000000000 Oct 23 07:02:35 Tower kernel: RDX: 000073d5dfe18993 RSI: ffffffff820d99e1 RDI: ffffffff820d9eea Oct 23 07:02:35 Tower kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 Oct 23 07:02:35 Tower kernel: R10: 0000000000000020 R11: 0000000000000092 R12: 0000000000000001 Oct 23 07:02:35 Tower kernel: R13: ffffffff82321760 R14: 000073d5dfe18993 R15: 0000000000000000 Oct 23 07:02:35 Tower kernel: ? cpuidle_enter_state+0xf7/0x202 Oct 23 07:02:35 Tower kernel: cpuidle_enter+0x2a/0x38 Oct 23 07:02:35 Tower kernel: do_idle+0x18d/0x1fb Oct 23 07:02:35 Tower kernel: cpu_startup_entry+0x2a/0x2c Oct 23 07:02:35 Tower kernel: start_secondary+0x101/0x101 Oct 23 07:02:35 Tower kernel: secondary_startup_64_no_verify+0xce/0xdb Oct 23 07:02:35 Tower kernel: </TASK> Oct 23 07:02:35 Tower kernel: Modules linked in: udp_diag cmac cifs asn1_decoder cifs_arc4 cifs_md4 dns_resolver nvidia_uvm(PO) ipvlan veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) tun nf_tables nfnetlink ip6table_nat iptable_nat nf_nat tcp_diag inet_diag i915 drm_buddy i2c_algo_bit ttm drm_display_helper intel_gtt agpgart xt_connmark nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_mark iptable_mangle xt_comment xt_addrtype iptable_raw wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp llc bonding tls nvidia_drm(PO) nvidia_modeset(PO) intel_rapl_msr intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel nvidia(PO) kvm Oct 23 07:02:35 Tower kernel: crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 input_leds sha1_ssse3 led_class joydev drm_kms_helper aesni_intel crypto_simd cryptd mei_hdcp mei_pxp rapl wmi_bmof drm intel_cstate mpt3sas intel_uncore nvme i2c_i801 mei_me i2c_smbus r8169 ahci i2c_core nvme_core mei syscopyarea sysfillrect libahci raid_class sysimgblt realtek fb_sys_fops scsi_transport_sas video wmi intel_pmc_core backlight acpi_pad acpi_tad button unix Oct 23 07:02:35 Tower kernel: CR2: 0000000000008000 Oct 23 07:02:35 Tower kernel: ---[ end trace 0000000000000000 ]--- Oct 23 07:02:35 Tower kernel: RIP: 0010:percpu_ref_get_many+0xd/0x2a Oct 23 07:02:35 Tower kernel: Code: 31 c0 48 89 42 78 eb 0b 48 8b 02 48 89 42 70 eb 02 0f 0b 48 89 d8 5b 5d c3 cc cc cc cc 55 48 89 fd 53 48 89 f3 e8 f0 0e ea ff <48> 8b 45 00 a8 03 74 0a 48 8b 45 08 f0 48 01 18 eb 04 65 48 01 18 Oct 23 07:02:35 Tower kernel: RSP: 0018:ffffc900003a4e08 EFLAGS: 00010002 Oct 23 07:02:35 Tower kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000 Oct 23 07:02:35 Tower kernel: RDX: ffff8881002c0000 RSI: 0000000000000001 RDI: 0000000000008000 Oct 23 07:02:35 Tower kernel: RBP: 0000000000008000 R08: ffff8882b8c61500 R09: ffffffff810da269 Oct 23 07:02:35 Tower kernel: R10: ffff8882b8c61500 R11: 0000000000032d40 R12: 0000000000008000 Oct 23 07:02:35 Tower kernel: R13: 0000000000000202 R14: ffff888100f64700 R15: 00000000fffffff8 Oct 23 07:02:35 Tower kernel: FS: 0000000000000000(0000) GS:ffff88904fa00000(0000) knlGS:0000000000000000 Oct 23 07:02:35 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
October 23, 20241 yr Perhaps this is unrelated but... I had a disk pool(2x SSDs) and the only thing there was my dockers and I had nothing but issues there, dockers wouldnt update and other strange issues... after moving the dockers back to the array, I havent had any of these issues since. You might try doing just that, my faith with btrfs running on unraid is shaky at best
October 23, 20241 yr If you do have two disks in your cache pool... Please consider what I posted above
October 23, 20241 yr Community Expert 1 hour ago, gath2 said: Possibly a RAM issue, but looking at other people reporting issues it seems that it typically turns out to be plugin or docker related. Currently swapped memory out with a spare set I had and running a memory test. That's all I had time to do before going into work. RAM is the main suspect for data corruption, plugin/container issues should cause that, but see how it behaves with the new RAM.
October 23, 20241 yr Author @mathomas3 I appreciate the advice, but I only have one nvme cache drive in the pool. It's been working just fine. I'm really hoping it's just a RAM issue, but the issues seem too closely related to the time of me upgrading Unraid. The latest hard crash is worse than previous ones. There were a couple of lines that didn't show up in the log that I copied off the screen before rebooting: PKRU: 55555554 Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ----[ end Kernal panic - not syncing: Fatal exception in interrupt ]--- The only thing I have that has changed recently is the update to the latest version of Unraid. It seemed to upgrade smoothly without issues. I don't know if something was corrupted in the update, if I'm running something that doesn't like the newest build, or it's just something with the newest build. I may look at updating my BIOS as well as I've not done that in awhile.
October 24, 20241 yr Community Expert You can downgrade to the previous release you were using and retest.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.