kdwg Posted June 23, 2022 Share Posted June 23, 2022 (edited) I am not making any progress. My server randomly loses connection on all interfaces. This was already the case at the beginning, but then it passed and it ran for months (at times 2 months without me having to reboot due to upgrade or similar) flawlessly. Now with 6.10+ no 24 hours pass, it is gone and only a hard reset helps. I have also tried many things. The custom bridge runs as IPvlan, second NIC in another network as ActivePassive with connection to another router. Curiously, it has also happened that I could access the second interface in a separate network and act. There was nothing in the log. On the other hand I did a ping on the second NIC (192.168.100.0/24 as planned network, current primary is 192.168.10.5 with /24. By this time, ping on primary was not possible already), which did not work after the second ICMP ECHO: PING 192.168.100.5 (192.168.100.5): 56 data bytes 64 bytes from 192.168.100.5: seq=0 ttl=64 time=1.391 ms 64 bytes from 192.168.100.5: seq=1 ttl=64 time=0.884 ms ... next request: --- 192.168.100.5 ping statistics --- 20 packets transmitted, 1 packets received, 95% packet loss The system has Syslog mirror enabled and running a local syslog as well, with forward to an external rsyslog (of course with not benefit when the network is down. Thats why local syslog now with mirror active). Currently deactivated VM service to get into investigation and just running a few containers. When this happens, I cannot access the systems webgui, ssh nor containers (or VMs when they were still active). Today I also checked if the system is listed on my notebooks ARP Cache, its not! May be some hint. Build currently on 6.10.3 Ryzen 4650G Pro 64GB ECC (memtest for hours no issues) ASRock X570m pro4 (Running BIOS v3.70) 2*1TB nvme drives in btrfs raid1 2 120gb sata ssd, 1 xfs pool cache & 1 unassigned device 1* 500gb sata ssd unassigned device Array: 1*18tb seagate exos x18 parity 2*14tb wd red plus 1*6tb seagate ironwolf PSU: BeQuiet Pure Power 11 400W CM Onboard LAN i211 connected to a Fritzbox 6591 (Vodafone OEM). 1 i350-t2 connected to a different (WIfi-AX) router running openwrt i was not able to finish yet, just in this setup for testing if I could access the other NIC when it appears. I attached latest diagnositcs, maybe someone has an idea what may be wrong. Would be very thankful to get the system running like a fridge again. Happy to answer more questions and add more details, just writing this in my lunch break. Best citadel-diagnostics-20220623-1448.zip Edited July 1, 2022 by kdwg added PSU info Quote Link to comment
JorgeB Posted June 23, 2022 Share Posted June 23, 2022 27 minutes ago, kdwg said: The system has Syslog mirror enabled and running a local syslog as well Post this after the issue occurs. Quote Link to comment
kdwg Posted June 23, 2022 Author Share Posted June 23, 2022 11 minutes ago, JorgeB said: Post this after the issue occurs. The latest "hang" today with whole boot process. 20220623-syslog-flashmirrorexport.txt Quote Link to comment
JorgeB Posted June 23, 2022 Share Posted June 23, 2022 Not seeing anything related logged, did the server hang or you just lost LAN connectivity? Quote Link to comment
kdwg Posted June 23, 2022 Author Share Posted June 23, 2022 (edited) 54 minutes ago, JorgeB said: Not seeing anything related logged, did the server hang or you just lost LAN connectivity? That is my problem. It is occurring randomly. I have connected a display since boot now and will check if it appears again - just a matter of time. Since in some cases the other NIC was reachable, I think it is just loosing LAN connectivity. Or atleast ARP is somehow not working. Could it somehow be related to Sleep? I haven't setup anything like that yet. Any specific log I should dive deeper in? Edited June 23, 2022 by kdwg Logs & Sleep Quote Link to comment
JorgeB Posted June 23, 2022 Share Posted June 23, 2022 Looks to me more like the server is crashing, and when there's nothing logged it's usually hardware related, make sure the correct power supply idle control is set. Quote Link to comment
kdwg Posted June 23, 2022 Author Share Posted June 23, 2022 (edited) Disabled Global C-States now and check. But there was no problem a while back, which is a little strange. Hopefully 5.17 for amd-pstate coming soon.... Thanks so far. I will update later, I think the system has logs somewhere else. This would be a sign to not having a crash. And other NIC partly accessible. Also, of course, it is asking for parity check after hard-reset all the time. As I am not doing any writes currently, I cancel it most of the time. Edited June 23, 2022 by kdwg Quote Link to comment
kdwg Posted June 24, 2022 Author Share Posted June 24, 2022 (edited) Okay I try the following now. I set Global C-States to auto. What I did, was to disable ACP Power Gating. I have a feeling this can cause trouble or is a reason. Will have to see how the system behaviour changes. ------------------------------ Of course I did already applied before I posted the upper part, following happend: Just after a little while, the system hang. Display shows as attached here but I cant make any input/login. Also, I made the "ping test" again. Attached as well. First echo on second NIC came back. The system is also unresponsive over ipv6 Will reset now and revert back to Disable C-States & Power Gating enable. (Of course, the monitor i attached is dusty as hell ) Edited June 24, 2022 by kdwg Quote Link to comment
JorgeB Posted June 24, 2022 Share Posted June 24, 2022 You don't need to fully disable c-states, just enable the correct power supply control setting, unless it doesn't exist in your board BIOS. Quote Link to comment
kdwg Posted June 24, 2022 Author Share Posted June 24, 2022 (edited) There is no option with x570m pro4 bios v3.70 to set Power Supply Idle control. At least I cannot find it on common place. For Global C-States, there is enable disable and auto. disabled it for now. Edited June 24, 2022 by kdwg Quote Link to comment
kdwg Posted June 24, 2022 Author Share Posted June 24, 2022 (edited) Okay, probably have something. This time, display showed something. May thats be a different issue / cause... But no Jun 24 20:30:01 citadel docker: RAM-Disk synced Jun 24 20:45:06 citadel cache_dirs: Stopping cache_dirs process 16227 Jun 24 20:45:07 citadel cache_dirs: cache_dirs service rc.cachedirs: Stopped Jun 24 20:45:07 citadel cache_dirs: Arguments=-p 1 -u -i audio -i backup -i documents -i games -i import -i isos -i misc -i movies -i pictures -i shows -i temp -i veeam -i zcrap -W 150 -X 300 -Y 600 -U 55000 -l off -a -noleaf -name .Recycle.Bin -prune -o -name -temp Jun 24 20:45:07 citadel cache_dirs: Max Scan Secs=10, Min Scan Secs=1 Jun 24 20:45:07 citadel cache_dirs: Scan Type=adaptive Jun 24 20:45:07 citadel cache_dirs: Min Scan Depth=4 Jun 24 20:45:07 citadel cache_dirs: Max Scan Depth=none Jun 24 20:45:07 citadel cache_dirs: Use Command='find -noleaf -name .Recycle.Bin -prune -o -name -temp' Jun 24 20:45:07 citadel cache_dirs: ---------- Caching Directories --------------- Jun 24 20:45:07 citadel cache_dirs: audio Jun 24 20:45:07 citadel cache_dirs: backup Jun 24 20:45:07 citadel cache_dirs: documents Jun 24 20:45:07 citadel cache_dirs: games Jun 24 20:45:07 citadel cache_dirs: import Jun 24 20:45:07 citadel cache_dirs: isos Jun 24 20:45:07 citadel cache_dirs: misc Jun 24 20:45:07 citadel cache_dirs: movies Jun 24 20:45:07 citadel cache_dirs: pictures Jun 24 20:45:07 citadel cache_dirs: shows Jun 24 20:45:07 citadel cache_dirs: temp Jun 24 20:45:07 citadel cache_dirs: veeam Jun 24 20:45:07 citadel cache_dirs: zcrap Jun 24 20:45:07 citadel cache_dirs: ---------------------------------------------- Jun 24 20:45:07 citadel cache_dirs: Setting Included dirs: audio,backup,documents,games,import,isos,misc,movies,pictures,shows,temp,veeam,zcrap Jun 24 20:45:07 citadel cache_dirs: Setting Excluded dirs: Jun 24 20:45:07 citadel cache_dirs: min_disk_idle_before_restarting_scan_sec=150 Jun 24 20:45:07 citadel cache_dirs: scan_timeout_sec_idle=300 Jun 24 20:45:07 citadel cache_dirs: scan_timeout_sec_busy=600 Jun 24 20:45:07 citadel cache_dirs: scan_timeout_sec_stable=30 Jun 24 20:45:07 citadel cache_dirs: frequency_of_full_depth_scan_sec=604800 Jun 24 20:45:07 citadel cache_dirs: Including /mnt/user in scan Jun 24 20:45:07 citadel cache_dirs: cache_dirs service rc.cachedirs: Started: '/usr/local/emhttp/plugins/dynamix.cache.dirs/scripts/cache_dirs -p 1 -u -i "audio" -i "backup" -i "documents" -i "games" -i "import" -i "isos" -i "misc" -i "movies" -i "pictures" -i "shows" -i "temp" -i "veeam" -i "zcrap" -W 150 -X 300 -Y 600 -U 55000 -l off -a '-noleaf -name .Recycle.Bin -prune -o -name -temp' 2>/dev/null' Jun 24 20:46:33 citadel kernel: BUG: kernel NULL pointer dereference, address: 0000000000000088 Jun 24 20:46:33 citadel kernel: #PF: supervisor read access in kernel mode Jun 24 20:46:33 citadel kernel: #PF: error_code(0x0000) - not-present page Jun 24 20:46:33 citadel kernel: PGD 1526c9067 P4D 1526c9067 PUD 104273067 PMD 0 Jun 24 20:46:33 citadel kernel: Oops: 0000 [#1] SMP NOPTI Jun 24 20:46:33 citadel kernel: CPU: 9 PID: 32715 Comm: shfs Not tainted 5.15.46-Unraid #1 Jun 24 20:46:33 citadel kernel: Hardware name: To Be Filled By O.E.M. X570M Pro4/X570M Pro4, BIOS P3.70 02/23/2022 Jun 24 20:46:33 citadel kernel: RIP: 0010:__mod_lruvec_state+0x13/0x44 Jun 24 20:46:33 citadel kernel: Code: 25 28 00 00 00 74 05 e8 0c 95 61 00 48 83 c4 10 5b 5d 41 5c 41 5d c3 0f 1f 44 00 00 41 55 48 63 d2 55 48 89 fd 49 89 d5 41 50 <48> 8b bf 88 00 00 00 89 74 24 04 e8 f1 08 fa ff e8 c7 c1 ff ff 8b Jun 24 20:46:33 citadel kernel: RSP: 0018:ffffc90002f479d0 EFLAGS: 00010046 Jun 24 20:46:33 citadel kernel: RAX: 0000000000000000 RBX: ffffea00197c79c0 RCX: 000000000000000e Jun 24 20:46:33 citadel kernel: RDX: 0000000000000050 RSI: 0000000000000013 RDI: 0000000000000000 Jun 24 20:46:33 citadel kernel: RBP: 0000000000000000 R08: 0000000000000013 R09: ffff88901e2fc000 Jun 24 20:46:33 citadel kernel: R10: ffffc90002f47a40 R11: 0000000000000286 R12: 0000000000000001 Jun 24 20:46:33 citadel kernel: R13: 0000000000000050 R14: ffffc90002f47ac0 R15: 0000000000000001 Jun 24 20:46:33 citadel kernel: FS: 0000147f12a0b640(0000) GS:ffff888fde240000(0000) knlGS:0000000000000000 Jun 24 20:46:33 citadel kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 24 20:46:33 citadel kernel: CR2: 0000000000000088 CR3: 00000001011dc000 CR4: 0000000000350ee0 Jun 24 20:46:33 citadel kernel: Call Trace: Jun 24 20:46:33 citadel kernel: <TASK> Jun 24 20:46:33 citadel kernel: __mod_lruvec_page_state+0x65/0x70 Jun 24 20:46:33 citadel kernel: ? __mod_lruvec_page_state+0x65/0x70 Jun 24 20:46:33 citadel kernel: ? __add_to_page_cache_locked+0x1cb/0x296 Jun 24 20:46:33 citadel kernel: ? lruvec_page_state+0x36/0x36 Jun 24 20:46:33 citadel kernel: ? add_to_page_cache_lru+0x56/0xbb Jun 24 20:46:33 citadel kernel: ? pagecache_get_page+0x1ac/0x1ff Jun 24 20:46:33 citadel kernel: ? prepare_pages+0x77/0x143 Jun 24 20:46:33 citadel kernel: ? btrfs_buffered_write+0x2cc/0x5ec Jun 24 20:46:33 citadel kernel: ? btrfs_file_write_iter+0x2d0/0x360 Jun 24 20:46:33 citadel kernel: ? do_iter_readv_writev+0x99/0xdc Jun 24 20:46:33 citadel kernel: ? do_iter_write+0x81/0xc2 Jun 24 20:46:33 citadel kernel: ? iter_file_splice_write+0x143/0x2e5 Jun 24 20:46:33 citadel kernel: ? pipe_read+0x300/0x327 Jun 24 20:46:33 citadel kernel: ? do_splice+0x3a1/0x4ad Jun 24 20:46:33 citadel kernel: ? __do_sys_splice+0x14e/0x1e8 Jun 24 20:46:33 citadel kernel: ? do_syscall_64+0x83/0xa5 Jun 24 20:46:33 citadel kernel: ? entry_SYSCALL_64_after_hwframe+0x44/0xae Jun 24 20:46:33 citadel kernel: </TASK> Jun 24 20:46:33 citadel kernel: Modules linked in: ipvlan xt_nat xt_tcpudp veth xt_conntrack nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter xfs ip6table_nat nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod nct6775 hwmon_vid efivarfs iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables igb amdgpu amd64_edac edac_mce_amd gpu_sched drm_ttm_helper ttm drm_kms_helper drm agpgart kvm_amd kvm crct10dif_pclmul wmi_bmof crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl syscopyarea k10temp i2c_piix4 ccp nvme sysfillrect input_leds ahci i2c_algo_bit sysimgblt led_class tpm_crb fb_sys_fops libahci i2c_core nvme_core tpm_tis tpm_tis_core video tpm backlight wmi button acpi_cpufreq [last unloaded: igb] Jun 24 20:46:33 citadel kernel: CR2: 0000000000000088 Jun 24 20:46:33 citadel kernel: ---[ end trace fbbf4b794c1bbf9d ]--- Jun 24 20:46:33 citadel kernel: RIP: 0010:__mod_lruvec_state+0x13/0x44 Jun 24 20:46:33 citadel kernel: Code: 25 28 00 00 00 74 05 e8 0c 95 61 00 48 83 c4 10 5b 5d 41 5c 41 5d c3 0f 1f 44 00 00 41 55 48 63 d2 55 48 89 fd 49 89 d5 41 50 <48> 8b bf 88 00 00 00 89 74 24 04 e8 f1 08 fa ff e8 c7 c1 ff ff 8b Jun 24 20:46:33 citadel kernel: RSP: 0018:ffffc90002f479d0 EFLAGS: 00010046 Jun 24 20:46:33 citadel kernel: RAX: 0000000000000000 RBX: ffffea00197c79c0 RCX: 000000000000000e Jun 24 20:46:33 citadel kernel: RDX: 0000000000000050 RSI: 0000000000000013 RDI: 0000000000000000 Jun 24 20:46:33 citadel kernel: RBP: 0000000000000000 R08: 0000000000000013 R09: ffff88901e2fc000 Jun 24 20:46:33 citadel kernel: R10: ffffc90002f47a40 R11: 0000000000000286 R12: 0000000000000001 Jun 24 20:46:33 citadel kernel: R13: 0000000000000050 R14: ffffc90002f47ac0 R15: 0000000000000001 Jun 24 20:46:33 citadel kernel: FS: 0000147f12a0b640(0000) GS:ffff888fde240000(0000) knlGS:0000000000000000 Jun 24 20:46:33 citadel kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 24 20:46:33 citadel kernel: CR2: 0000000000000088 CR3: 00000001011dc000 CR4: 0000000000350ee0 Jun 24 20:47:36 citadel kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks: Jun 24 20:47:36 citadel kernel: rcu: 11-...0: (1 GPs behind) idle=0ad/1/0x4000000000000000 softirq=1139001/1139066 fqs=13275 Jun 24 20:47:36 citadel kernel: (detected by 1, t=60005 jiffies, g=1714169, q=759796) Jun 24 20:47:36 citadel kernel: Sending NMI from CPU 1 to CPUs 11: Jun 24 20:47:36 citadel kernel: NMI backtrace for cpu 11 Jun 24 20:47:36 citadel kernel: CPU: 11 PID: 80 Comm: kcompactd0 Tainted: G D 5.15.46-Unraid #1 Jun 24 20:47:36 citadel kernel: Hardware name: To Be Filled By O.E.M. X570M Pro4/X570M Pro4, BIOS P3.70 02/23/2022 Jun 24 20:47:36 citadel kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x78/0x18f Jun 24 20:47:36 citadel kernel: Code: 2a 08 8b 02 0f 92 c1 0f b6 c9 c1 e1 08 30 e4 09 c8 a9 00 01 ff ff 74 0c 0f ba e0 08 72 1a c6 42 01 00 eb 14 85 c0 74 0a 8b 02 <84> c0 74 04 f3 90 eb f6 66 c7 02 01 00 c3 48 c7 c1 40 c8 02 00 65 Jun 24 20:47:36 citadel kernel: RSP: 0018:ffffc90000483ae8 EFLAGS: 00000002 Jun 24 20:47:36 citadel kernel: RAX: 0000000000000101 RBX: 0000000000000001 RCX: 0000000000000000 Jun 24 20:47:36 citadel kernel: RDX: ffff8884074964d0 RSI: 0000000000000000 RDI: ffff8884074964d0 Jun 24 20:47:36 citadel kernel: RBP: ffffea00040a0d80 R08: ffffea003a0e9580 R09: 0000000000000008 Jun 24 20:47:36 citadel kernel: R10: 0000000000000000 R11: 00000000000305e0 R12: ffffea003a0e9580 Jun 24 20:47:36 citadel kernel: R13: 0000000000000003 R14: ffff8884074964c8 R15: ffff88901e2fc000 Jun 24 20:47:36 citadel kernel: FS: 0000000000000000(0000) GS:ffff888fde2c0000(0000) knlGS:0000000000000000 Jun 24 20:47:36 citadel kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 24 20:47:36 citadel kernel: CR2: 0000151a1793150c CR3: 000000014ee42000 CR4: 0000000000350ee0 Jun 24 20:47:36 citadel kernel: Call Trace: Jun 24 20:47:36 citadel kernel: <TASK> Jun 24 20:47:36 citadel kernel: queued_spin_lock_slowpath+0x7/0xa Jun 24 20:47:36 citadel kernel: migrate_page_move_mapping+0x15e/0x4c0 Jun 24 20:47:36 citadel kernel: btrfs_migratepage+0x1c/0xc4 Jun 24 20:47:36 citadel kernel: move_to_new_page+0x7d/0x204 Jun 24 20:47:36 citadel kernel: ? memcg_rstat_updated+0x12/0x45 Jun 24 20:47:36 citadel kernel: ? free_unref_page_prepare+0x127/0x156 Jun 24 20:47:36 citadel kernel: ? free_unref_page_commit.constprop.0+0x19/0xd9 Jun 24 20:47:36 citadel kernel: migrate_pages+0x605/0xa08 Jun 24 20:47:36 citadel kernel: ? compact_lock_irqsave+0x5e/0x5e Jun 24 20:47:36 citadel kernel: ? release_freepages+0x8f/0x8f Jun 24 20:47:36 citadel kernel: compact_zone+0x84f/0xa29 Jun 24 20:47:36 citadel kernel: ? set_next_entity+0x65/0x84 Jun 24 20:47:36 citadel kernel: ? __raw_spin_unlock+0x5/0x6 Jun 24 20:47:36 citadel kernel: proactive_compact_node+0x7f/0xac Jun 24 20:47:36 citadel kernel: kcompactd+0x24e/0x29c Jun 24 20:47:36 citadel kernel: ? init_wait_entry+0x29/0x29 Jun 24 20:47:36 citadel kernel: ? kcompactd_do_work+0x1bd/0x1bd Jun 24 20:47:36 citadel kernel: kthread+0xde/0xe3 Jun 24 20:47:36 citadel kernel: ? set_kthread_struct+0x32/0x32 Jun 24 20:47:36 citadel kernel: ret_from_fork+0x22/0x30 Jun 24 20:47:36 citadel kernel: </TASK> Jun 24 20:47:56 citadel emhttpd: cmd: /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/checkall Global C-States were disabled as described. I think this is a different issue. Edit: Got some more, forgot I made photos by phone. Added a picture and used OCR to extract the Logs, may check later if OCR was with useable output as I am a little in hurry. It says about eth0, which is the "primary" NIC. citadel-diagnostics-20220624-2108.zip ocr-panic.jpg.txt Edited June 24, 2022 by kdwg Added more info-. Quote Link to comment
JorgeB Posted June 25, 2022 Share Posted June 25, 2022 That look more like a hardware problem. Quote Link to comment
mgutt Posted June 25, 2022 Share Posted June 25, 2022 I would: - remove all changes from your go file - update the bios - load default bios settings - avoid using an USB3 port - test wiith only one RAM module - disable PCIe 4.0 (if NVMe is using it) - repair all partitions of all Array Disks and Pools. Quote Link to comment
kdwg Posted June 26, 2022 Author Share Posted June 26, 2022 On 6/25/2022 at 1:51 PM, mgutt said: I would: - remove all changes from your go file - update the bios - load default bios settings - avoid using an USB3 port - test wiith only one RAM module - disable PCIe 4.0 (if NVMe is using it) - repair all partitions of all Array Disks and Pools. I will run a blank Go file now and see how the system appears to behave. The version 3.7 im am running is the latest, I could think of downgrading it to 3.6 but I see if some more people with this MB running 3.7 and having issues. The mb itself only has USB3+ ports, I try to use two ports I have on the front instead. Both nvme drives rely on pcie 3.0. For now I will keep two ram modules, giving it less changes done at once. Thank you for advices. Will keep status posted here. Quote Link to comment
JonathanM Posted June 27, 2022 Share Posted June 27, 2022 8 hours ago, kdwg said: I will run a blank Go file If you do that Unraid won't start properly. Revert instead to the go file packaged with the installation zip archive. Quote Link to comment
kdwg Posted June 27, 2022 Author Share Posted June 27, 2022 (edited) On 6/27/2022 at 3:31 AM, JonathanM said: If you do that Unraid won't start properly. Revert instead to the go file packaged with the installation zip archive. Sorry for the wrong expression. I meant I remove all custom modifications. Edited June 29, 2022 by kdwg Quote Link to comment
kdwg Posted June 29, 2022 Author Share Posted June 29, 2022 (edited) Okay I have somehow gotten all kind of errors in the past time. The system has gotten unbelievable unstable. I will try to run the Bootdrive now on frontusb, I receive a slotmodule in the next days. I throw in a collection of panics I got (OCR extracted, may does have some false recognized chars): cpu_startup_entry+Ox1d/Ox1f sccondary_startup_64_no_verify+Oxb0/Oxbb </TASK> -L end trace fbbf4b794c1bbf9e l- igb 0000:06:00.0 etho: Reset adapter igb 0000:06:00.0 etho: igb: etho NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None igb 0000:06:00.0: exceed max 2 second reu: INFO: rou _sched detected stalls on CPUs/tasks: rou: 011-. . .0: (1 GPs behind) idle-Oad/1/0x4000000000000000 soft iry=1139001/1139066 fqs=52882 o(detected by 4, t=240019 jiffies, g=1714169, q=2275007) Sending NMI from CPU 4 to CPUs 11: NMI backtrace for cpu: 11 5.15.46-Unraid 11 CPU: 11 PID: 80 Comm: kcompactd0 Tainted: G Hardware name: To Be Filled By O.E.M. X570M Pro4/X570M Pro4, BIOS P3.70 02/23/2022 RIP: 0010: native _queued _spin_lock_slowpath+0x7e/0x18f that one today in noon, interestingly I accidently clicked on "plugins" header menu and the system just hang as it searched for plugin updates. Connected via wg by that time. What syslog says: Jun 28 11:22:17 citadel webGUI: Unsuccessful login user root from 10.253.0.3 Jun 28 11:22:23 citadel webGUI: Successful login user root from 10.253.0.3 Jun 28 11:24:06 citadel emhttpd: cmd: /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/checkall Here I had to hardreset again. Jun 28 17:36:32 citadel kernel: Linux version 5.15.46-Unraid (root@Develop) (gcc (GCC) 11.2.0, GNU ld version 2.37-slack15) #1 SMP Fri Jun 10 11:08:41 PDT 2022 Jun 28 17:36:32 citadel kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot Suprisingly, there has been a panic without a lock today, after I tried stopping File Activity Plugin in charge of the inotiy watches. This time, it appears as cache_dirs. The system kept running. Jun 29 02:52:09 citadel file.activity: Stopping File Activity Jun 29 02:52:10 citadel file.activity: File Activity inotify exiting Jun 29 02:55:23 citadel kernel: BUG: unable to handle page fault for address: ffffffffffffff89 Jun 29 02:55:23 citadel kernel: #PF: supervisor write access in kernel mode Jun 29 02:55:23 citadel kernel: #PF: error_code(0x0002) - not-present page Jun 29 02:55:23 citadel kernel: PGD 520e067 P4D 520e067 PUD 5210067 PMD 0 Jun 29 02:55:23 citadel kernel: Oops: 0002 [#1] SMP NOPTI Jun 29 02:55:23 citadel kernel: CPU: 3 PID: 13697 Comm: cache_dirs Not tainted 5.15.46-Unraid #1 Jun 29 02:55:23 citadel kernel: Hardware name: To Be Filled By O.E.M. X570M Pro4/X570M Pro4, BIOS P3.70 02/23/2022 Jun 29 02:55:23 citadel kernel: RIP: 0010:error_entry+0x83/0xe0 Jun 29 02:55:23 citadel kernel: Code: 44 00 00 48 25 ff e7 ff ff 0f 22 d8 41 5c 48 89 e7 e8 a1 e6 e0 ff 48 89 c4 41 54 c3 48 8d 0d 1b fd ff ff 48 39 8c 24 88 00 00 <00> 74 29 89 c8 48 39 84 24 88 00 00 00 74 15 48 81 bc 24 88 00 00 Jun 29 02:55:23 citadel kernel: RSP: 0000:ffffc90007c9ff58 EFLAGS: 00010002 Jun 29 02:55:23 citadel kernel: RAX: ffffc90007c9ff58 RBX: 0000000000000000 RCX: 0000000000000000 Jun 29 02:55:23 citadel kernel: RDX: 0000000000000000 RSI: fffffe00000b4000 RDI: ffffc90007c9ff58 Jun 29 02:55:23 citadel kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 Jun 29 02:55:23 citadel kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81a00ab8 Jun 29 02:55:23 citadel kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Jun 29 02:55:23 citadel kernel: FS: 0000150fb67bf740(0000) GS:ffff888fde0c0000(0000) knlGS:0000000000000000 Jun 29 02:55:23 citadel kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 29 02:55:23 citadel kernel: CR2: ffffffffffffff89 CR3: 00000001f57f2000 CR4: 0000000000350ee0 Jun 29 02:55:23 citadel kernel: Call Trace: Jun 29 02:55:23 citadel kernel: WARNING: stack recursion on stack type 1 Jun 29 02:55:23 citadel kernel: WARNING: can't access registers at error_entry+0x83/0xe0 Jun 29 02:55:23 citadel kernel: <TASK> Jun 29 02:55:23 citadel kernel: ? restore_regs_and_return_to_kernel+0x27/0x27 Jun 29 02:55:23 citadel kernel: </TASK> Jun 29 02:55:23 citadel kernel: Modules linked in: xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs md_mod nct6775 hwmon_vid efivarfs wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding igb amdgpu gpu_sched drm_ttm_helper ttm drm_kms_helper amd64_edac edac_mce_amd drm agpgart kvm_amd syscopyarea sysfillrect sysimgblt kvm crct10dif_pclmul crc32_pclmul crc32c_intel wmi_bmof ghash_clmulni_intel aesni_intel crypto_simd cryptd i2c_piix4 i2c_algo_bit rapl nvme k10temp fb_sys_fops i2c_core ccp input_leds ahci led_class tpm_crb libahci nvme_core tpm_tis tpm_tis_core video tpm backlight wmi button acpi_cpufreq [last unloaded: igb] Jun 29 02:55:23 citadel kernel: CR2: ffffffffffffff89 Jun 29 02:55:23 citadel kernel: ---[ end trace 40a36246779029d3 ]--- Jun 29 02:55:23 citadel kernel: RIP: 0010:error_entry+0x83/0xe0 Jun 29 02:55:23 citadel kernel: Code: 44 00 00 48 25 ff e7 ff ff 0f 22 d8 41 5c 48 89 e7 e8 a1 e6 e0 ff 48 89 c4 41 54 c3 48 8d 0d 1b fd ff ff 48 39 8c 24 88 00 00 <00> 74 29 89 c8 48 39 84 24 88 00 00 00 74 15 48 81 bc 24 88 00 00 Jun 29 02:55:23 citadel kernel: RSP: 0000:ffffc90007c9ff58 EFLAGS: 00010002 Jun 29 02:55:23 citadel kernel: RAX: ffffc90007c9ff58 RBX: 0000000000000000 RCX: 0000000000000000 Jun 29 02:55:23 citadel kernel: RDX: 0000000000000000 RSI: fffffe00000b4000 RDI: ffffc90007c9ff58 Jun 29 02:55:23 citadel kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 Jun 29 02:55:23 citadel kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81a00ab8 Jun 29 02:55:23 citadel kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Jun 29 02:55:23 citadel kernel: FS: 0000150fb67bf740(0000) GS:ffff888fde0c0000(0000) knlGS:0000000000000000 Jun 29 02:55:23 citadel kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 29 02:55:23 citadel kernel: CR2: ffffffffffffff89 CR3: 00000001f57f2000 CR4: 0000000000350ee0 Jun 29 02:57:56 citadel ool www[16754]: /usr/local/emhttp/plugins/file.activity/scripts/rc.file.activity 'update' And now the latest hang, just right a few seconds after the scheduled mover started. Nothing else in syslog mirrored flash of course.... could this be a hint towards Bootdrive? Shall I replace it? I checked it through chkdsk with no errors of course. Had no monitor connected so no info here.. AND: shall I boot in gui mode? Jun 29 06:22:14 citadel emhttpd: read SMART /dev/sdh Jun 29 06:36:48 citadel webGUI: Successful login user root from 192.168.10.135 Jun 29 06:37:28 citadel emhttpd: shcmd (31847): /usr/local/sbin/mover &> /dev/null & Jun 29 06:37:29 citadel emhttpd: read SMART /dev/sdf I see no SMART related errors. I will start an XFS Filesystem check in a few minutes and let it run in maintenenace mode. Sure, I still need to run it with one of each DIMMs only but couldnt do this yet as I ran another Memtest (okay only 9hours this time but...). On TODO. May it be related to a docker container? I have my dockerimage in DIR mode on an unassigned device formatted in xfs. @mgutt could this cause trouble? I am running this for a while now. Else I dont see any special. Still not sure if it might be related to "AMD GPU RESET BUG", haven't looked through it yet as far as I know only when passing through to a VM and not with iGPU. I just passthrough /dev/dri to emby and jellyfin docker. But my syslog during boot shows this: fbcon: amdgpudrmfb (fb0) is primary device Jun 29 02:36:41 citadel kernel: Console: switching to colour frame buffer device 240x67 Jun 29 02:36:41 citadel kernel: amdgpu 0000:0a:00.0: [drm] fb0: amdgpudrmfb frame buffer device Jun 29 02:36:41 citadel kernel: amdgpu 0000:0a:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0 I am still wondering why these two crashes in last 24h happened immediatly after "plugin update search" and "mover kickoff". If its helpful, some more info: When I run the NUTv2 plugin for my UPS, it disc&reconnects after a few seconds when the service is not restarted on boot. I switched to builtin tool then until this issue might be closer to being solved or at least having a step further to the cause. What else could be causing or any other option I could check/try? Any idea what I shall do? I really have no other clue so far. Many thanks in advance for any idea or help. citadel-diagnostics-20220629-0737.zip Edited June 29, 2022 by kdwg Quote Link to comment
kdwg Posted June 29, 2022 Author Share Posted June 29, 2022 (edited) Okay I may have found something. Running for 12h without lock/panic now, this is a record compared to previous days. I keep you posted and if, fingers crossed, the system keeps stable until sunday I will publish. EDIT: Forget what I've said. 5min after posting I have started a manual parity check, didn't go well after a while. citadel-diagnostics-20220701-2028.zip Edited July 1, 2022 by kdwg added recent diag Quote Link to comment
Solution kdwg Posted July 8, 2022 Author Solution Share Posted July 8, 2022 I may have found the issue: There seems to be a relationship between 6.10+ and AMD fTPM, which was set enabled by default with UEFI 3.60 on x570m. With 6.9.2 I was not facing this behavior. Anyway, so far it looks good on 6.10.3, if I run into another freeze I will let you know. It certainly helps others too, the board is relatively widespread. Quote Link to comment
afl Posted June 10, 2023 Share Posted June 10, 2023 So did you just disable fTPM or what settings have you done Quote Link to comment
kdwg Posted June 15, 2023 Author Share Posted June 15, 2023 (edited) On 6/10/2023 at 7:12 AM, afl said: So did you just disable fTPM or what settings have you done hey. By today's perspective, I'm not 100% sure if disabling fTPM alone has fixed this entirely. Meanwhile, I decided to remove "powertop autotune" in my go-file and added the adjustments manually. Global C-State settings (or eq) in BIOS werent modified. Fortunately, I'm not having any stability problems anymore. Do you have any similar symptoms ? Edit: Worth mentioning I have added some drives and changed the power supply (running in another box without any problems). Edited June 15, 2023 by kdwg Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.