Gui / Network locks up


Recommended Posts

Weird issue happening after updating to 6.12.8 I'm getting weird GUI lockups.  I can login do a couple items and then it locks.  Same with SSH, like it's locking up for a period and then continuing.  Clearing the cache and I can get the web gui back for a minute or two and then it will lock up again.

 

Videos and nvidia conversions still happening in the background, no issues.  Just the interface or any management seems to be tripping up and locking. 

 

Has anyone else had this?

tower-diagnostics-20240314-1403.zip

Link to comment

I can't issue ssh commands like rebooting, other commands work but it will not shut down etc.  same with the gui, try and do a reboot and it will lock up the page, but bouncing around the docker images and other items, it's fine. 

 

weird

Link to comment

There's a call trace logged but can't see what caused, hardware or software, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. 

Link to comment

I can't even reboot though, gui or ssh, hangs.  I can hard reset it, just the last time that happened it borked the cache drives and had to redo all dockers etc.

 

Guess that's my only option at this point.  It's still functioning in the background, transcoding shows etc, I just can't keep the gui or ssh to do certain commands after the update.

Link to comment

tried to do a parity check last night and it just hangs, everything locks up and there's no way to know what's happening since I can't login to it and have to use the drac to hard power cycle it.  

 

What are my options here to help diagnose what this is?

Link to comment

It dawned on me looking at the logs page, that I had a 4 nic board setup in LACP and having the 4 nics tied up to the single unraid IP assigned might be an issue..... so I removed the LACP bond and on my switch and removed the br0 bond route.... this fixed it and I was able to nmap port 514 udp and see it open now.  I will log this to another physical box (proxmox running librenms) and do the flash copy and post them here when this happens next. 

 

 

 

 

Link to comment

ok, so in the past few days I've enabled the syslog.  I ended up doing a new config and resyncing the parity to get rid of a old drive (was added but not used) and said it had errors.

 

During the rebuild, I was updating some dockers and lost the gui again.  I have the syslog, but isn't there some sensitive material in there? RSA keys for ssh etc?

 

Right now, dockers seem to be up, cannot kill them.  I was able to kill vm manager and shut it down.  No gui response and the way I'm understanding unraid is setup, I can't reset nginx etc to try and force restart the webserver side?

 

I see this in the logs

 

Mar 25 19:32:12 Tower kernel: general protection fault, probably for non-canonical address 0xefffffff81e42c70: 0000 [#1] PREEMPT SMP PTI
Mar 25 19:32:12 Tower kernel: CPU: 1 PID: 27726 Comm: kworker/u16:3 Tainted: P           O       6.1.74-Unraid #1
Mar 25 19:32:12 Tower kernel: Hardware name: Supermicro Super Server/X11SSH-LN4F, BIOS 2.7 12/07/2021
Mar 25 19:32:12 Tower kernel: Workqueue: writeback wb_workfn (flush-btrfs-5)
Mar 25 19:32:12 Tower kernel: RIP: 0010:do_writepages+0xad/0x124
Mar 25 19:32:12 Tower kernel: Code: 00 00 4c 89 b3 00 01 00 00 48 85 c0 48 0f 48 c2 48 89 83 10 01 00 00 e8 96 e0 6c 00 49 8b 84 24 90 00 00 00 48 89 ee 4c 89 e7 <48> 8b 40 10 48 85 c0 74 07 ff d0 0f 1f 00 eb 05 e8 db e5 ff ff 83
Mar 25 19:32:12 Tower kernel: RSP: 0018:ffffc90025dbfc10 EFLAGS: 00010297
Mar 25 19:32:12 Tower kernel: RAX: efffffff81e42c60 RBX: ffff88821637c860 RCX: 0000000000000000
Mar 25 19:32:12 Tower kernel: RDX: 0000000107354800 RSI: ffffc90025dbfcb0 RDI: ffff88812be607e8
Mar 25 19:32:12 Tower kernel: RBP: ffffc90025dbfcb0 R08: ffffffff82206510 R09: ffffffffffffffff
Mar 25 19:32:12 Tower kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88812be607e8
Mar 25 19:32:12 Tower kernel: R13: ffff88812be607e8 R14: 0000000107354800 R15: ffff88821637d000
Mar 25 19:32:12 Tower kernel: FS:  0000000000000000(0000) GS:ffff888867a40000(0000) knlGS:0000000000000000
Mar 25 19:32:12 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 19:32:12 Tower kernel: CR2: 0000154f0619b000 CR3: 0000000175192006 CR4: 00000000003726e0
Mar 25 19:32:12 Tower kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 25 19:32:12 Tower kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 25 19:32:12 Tower kernel: Call Trace:
Mar 25 19:32:12 Tower kernel: <TASK>
Mar 25 19:32:12 Tower kernel: ? __die_body+0x1a/0x5c
Mar 25 19:32:12 Tower kernel: ? die_addr+0x38/0x51
Mar 25 19:32:12 Tower kernel: ? exc_general_protection+0x30f/0x345
Mar 25 19:32:12 Tower kernel: ? asm_exc_general_protection+0x22/0x30
Mar 25 19:32:12 Tower kernel: ? do_writepages+0xad/0x124
Mar 25 19:32:12 Tower kernel: __writeback_single_inode+0x7a/0x2cb
Mar 25 19:32:12 Tower kernel: writeback_sb_inodes+0x24f/0x40f
Mar 25 19:32:12 Tower kernel: __writeback_inodes_wb+0x82/0xc0
Mar 25 19:32:12 Tower kernel: wb_writeback+0x135/0x24a
Mar 25 19:32:12 Tower kernel: wb_workfn+0x21a/0x39e
Mar 25 19:32:12 Tower kernel: ? sched_clock_cpu+0x12/0xa1
Mar 25 19:32:12 Tower kernel: ? __smp_call_single_queue+0x23/0x35
Mar 25 19:32:12 Tower kernel: ? paravirt_write_msr+0xb/0x11
Mar 25 19:32:12 Tower kernel: ? ttwu_queue_wakelist+0x9a/0xcf
Mar 25 19:32:12 Tower kernel: process_one_work+0x1a8/0x295
Mar 25 19:32:12 Tower kernel: worker_thread+0x18b/0x244
Mar 25 19:32:12 Tower kernel: ? rescuer_thread+0x281/0x281
Mar 25 19:32:12 Tower kernel: kthread+0xe4/0xef
Mar 25 19:32:12 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b
Mar 25 19:32:12 Tower kernel: ret_from_fork+0x1f/0x30
Mar 25 19:32:12 Tower kernel: </TASK>
Mar 25 19:32:12 Tower kernel: Modules linked in: vhost_net tun vhost tap kvm_intel kvm md_mod cmac cifs asn1_decoder cifs_arc4 cifs_md4 dns_resolver veth xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_iotlb xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) tcp_diag inet_diag ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet 8021q garp mrp bridge stp llc bonding tls igb intel_rapl_msr intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp ast drm_vram_helper drm_ttm_helper ttm crct10dif_pclmul crc32_pclmul drm_kms_helper crc32c_intel ghash_clmulni_intel ipmi_ssif sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel drm crypto_simd cryptd rapl intel_cstate
Mar 25 19:32:12 Tower kernel: intel_uncore mpt3sas i2c_i801 agpgart syscopyarea acpi_ipmi i2c_smbus mei_me i2c_algo_bit sysfillrect ahci sysimgblt input_leds raid_class fb_sys_fops i2c_core joydev led_class libahci intel_pch_thermal scsi_transport_sas mei thermal fan ipmi_si video wmi backlight intel_pmc_core acpi_power_meter acpi_pad button unix [last unloaded: md_mod]
Mar 25 19:32:12 Tower kernel: ---[ end trace 0000000000000000 ]---
Mar 25 19:32:12 Tower kernel: RIP: 0010:do_writepages+0xad/0x124
Mar 25 19:32:12 Tower kernel: Code: 00 00 4c 89 b3 00 01 00 00 48 85 c0 48 0f 48 c2 48 89 83 10 01 00 00 e8 96 e0 6c 00 49 8b 84 24 90 00 00 00 48 89 ee 4c 89 e7 <48> 8b 40 10 48 85 c0 74 07 ff d0 0f 1f 00 eb 05 e8 db e5 ff ff 83
Mar 25 19:32:12 Tower kernel: RSP: 0018:ffffc90025dbfc10 EFLAGS: 00010297
Mar 25 19:32:12 Tower kernel: RAX: efffffff81e42c60 RBX: ffff88821637c860 RCX: 0000000000000000
Mar 25 19:32:12 Tower kernel: RDX: 0000000107354800 RSI: ffffc90025dbfcb0 RDI: ffff88812be607e8
Mar 25 19:32:12 Tower kernel: RBP: ffffc90025dbfcb0 R08: ffffffff82206510 R09: ffffffffffffffff
Mar 25 19:32:12 Tower kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88812be607e8
Mar 25 19:32:12 Tower kernel: R13: ffff88812be607e8 R14: 0000000107354800 R15: ffff88821637d000
Mar 25 19:32:12 Tower kernel: FS:  0000000000000000(0000) GS:ffff888867a40000(0000) knlGS:0000000000000000
Mar 25 19:32:12 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 19:32:12 Tower kernel: CR2: 0000154f0619b000 CR3: 0000000175192006 CR4: 00000000003726e0
Mar 25 19:32:12 Tower kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 25 19:32:12 Tower kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 25 19:32:12 Tower kernel: ------------[ cut here ]------------
Mar 25 19:32:12 Tower kernel: WARNING: CPU: 1 PID: 27726 at kernel/exit.c:814 do_exit+0x87/0x923
Mar 25 19:32:12 Tower kernel: Modules linked in: vhost_net tun vhost tap kvm_intel kvm md_mod cmac cifs asn1_decoder cifs_arc4 cifs_md4 dns_resolver veth xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_iotlb xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) tcp_diag inet_diag ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet 8021q garp mrp bridge stp llc bonding tls igb intel_rapl_msr intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp ast drm_vram_helper drm_ttm_helper ttm crct10dif_pclmul crc32_pclmul drm_kms_helper crc32c_intel ghash_clmulni_intel ipmi_ssif sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel drm crypto_simd cryptd rapl intel_cstate
Mar 25 19:32:12 Tower kernel: intel_uncore mpt3sas i2c_i801 agpgart syscopyarea acpi_ipmi i2c_smbus mei_me i2c_algo_bit sysfillrect ahci sysimgblt input_leds raid_class fb_sys_fops i2c_core joydev led_class libahci intel_pch_thermal scsi_transport_sas mei thermal fan ipmi_si video wmi backlight intel_pmc_core acpi_power_meter acpi_pad button unix [last unloaded: md_mod]
Mar 25 19:32:12 Tower kernel: CPU: 1 PID: 27726 Comm: kworker/u16:3 Tainted: P      D    O       6.1.74-Unraid #1
Mar 25 19:32:12 Tower kernel: Hardware name: Supermicro Super Server/X11SSH-LN4F, BIOS 2.7 12/07/2021
Mar 25 19:32:12 Tower kernel: Workqueue: writeback wb_workfn (flush-btrfs-5)
Mar 25 19:32:12 Tower kernel: RIP: 0010:do_exit+0x87/0x923
Mar 25 19:32:12 Tower kernel: Code: 24 74 04 75 13 b8 01 00 00 00 41 89 6c 24 60 48 c1 e0 22 49 89 44 24 70 4c 89 ef e8 31 ed 80 00 48 83 bb b0 07 00 00 00 74 02 <0f> 0b 48 8b bb d8 06 00 00 e8 33 ec 80 00 48 8b 83 d0 06 00 00 83
Mar 25 19:32:12 Tower kernel: RSP: 0018:ffffc90025dbfee0 EFLAGS: 00010286
Mar 25 19:32:12 Tower kernel: RAX: 0000000000000000 RBX: ffff8882f767e180 RCX: 0000000000000000
Mar 25 19:32:12 Tower kernel: RDX: 0000000000000001 RSI: 0000000000002710 RDI: 00000000ffffffff
Mar 25 19:32:12 Tower kernel: RBP: 000000000000000b R08: 0000000000000000 R09: ffffffff82245f30
Mar 25 19:32:12 Tower kernel: R10: 00007fffffffffff R11: ffffffff8296f001 R12: ffff8881f9a0c400
Mar 25 19:32:12 Tower kernel: R13: ffff888184d34200 R14: 0000000000000000 R15: 0000000000000000
Mar 25 19:32:12 Tower kernel: FS:  0000000000000000(0000) GS:ffff888867a40000(0000) knlGS:0000000000000000
Mar 25 19:32:12 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 19:32:12 Tower kernel: CR2: 0000154f0619b000 CR3: 0000000175192006 CR4: 00000000003726e0
Mar 25 19:32:12 Tower kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 25 19:32:12 Tower kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 25 19:32:12 Tower kernel: Call Trace:
Mar 25 19:32:12 Tower kernel: <TASK>
Mar 25 19:32:12 Tower kernel: ? __warn+0xab/0x122
Mar 25 19:32:12 Tower kernel: ? report_bug+0x109/0x17e
Mar 25 19:32:12 Tower kernel: ? do_exit+0x87/0x923
Mar 25 19:32:12 Tower kernel: ? handle_bug+0x41/0x6f
Mar 25 19:32:12 Tower kernel: ? exc_invalid_op+0x13/0x60
Mar 25 19:32:12 Tower kernel: ? asm_exc_invalid_op+0x16/0x20
Mar 25 19:32:12 Tower kernel: ? do_exit+0x87/0x923
Mar 25 19:32:12 Tower kernel: ? worker_thread+0x18b/0x244
Mar 25 19:32:12 Tower kernel: make_task_dead+0x11c/0x11c
Mar 25 19:32:12 Tower kernel: rewind_stack_and_make_dead+0x17/0x17
Mar 25 19:32:12 Tower kernel: RIP: 0000:0x0
Mar 25 19:32:12 Tower kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6.
Mar 25 19:32:12 Tower kernel: RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
Mar 25 19:32:12 Tower kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Mar 25 19:32:12 Tower kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Mar 25 19:32:12 Tower kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Mar 25 19:32:12 Tower kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Mar 25 19:32:12 Tower kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Mar 25 19:32:12 Tower kernel: </TASK>
Mar 25 19:32:12 Tower kernel: ---[ end trace 0000000000000000 ]---
Ma

 

Then this in the morning

Mar 26 06:17:17 Tower nginx: 2024/03/26 06:17:17 [error] 7356#7356: *1479212 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.22.15, server: , request: "GET /Dashboard/Main/Settings/Device?name=disk6 HTTP/1.1", subrequest: "/auth-request.php", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.22.20"
Mar 26 06:17:17 Tower nginx: 2024/03/26 06:17:17 [error] 7356#7356: *1479212 auth request unexpected status: 504 while sending to client, client: 192.168.22.15, server: , request: "GET /Dashboard/Main/Settings/Device?name=disk6 HTTP/1.1", host: "192.168.22.20"


 

Link to comment

also just to follow on, all smart checks on the HD's found no errors on every rust disk and ssd.  I had then started the new config to get rid of the old 1tb bad drive that had no data on it in a new rebuild.  It should have completed around 7 or 8pm based on the estimates 3/25 before losing the gui access.  

Link to comment

Nothing else of note that I see, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment

I'll give that a shot, it happened then too un safemode after a day or two.  I just scp all the log files off the usb drive and ran them through chatgpt and nothing came back as a hardware issue.  At this point if it does in safemode again, I'm not sure what to do other than blow away the USB drive and rebuild?  this only started happening after the update to unraid's latest version which also coinsided with a nvidia driver update for the system. 

Link to comment

I was digging through the logs, and saw this.... now I'm using the nvidia plugin and dockers have been running fine for 5years on this system.  

 

logs/nvidia-smi.txt
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
 

I can rip and replace the nvidia plugin and re-download the lastest stable drivers.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.