Jump to content

Unraid unresponsive / no ssh once a month


Go to solution Solved by JorgeB,

Recommended Posts

Hi.  I seem to have a recurring problem with my unraid setup.  Every month or so unraid becomes completely unresponsive.  I cannot log into the web ui, the docker are not accessible and I cannot connect with ssh.

 

I do not have a graphic card so I cannot hook up a monitor and keyboard to check logs.

 

Recently, I activated the option that saves the syslog on flash for troubleshooting.

 

 

Today, I powered off my unraid server.  I restarted everything before noon and everything was working great.  At about 17h00, unraid becomes unresponsive again.

 

Attached is my diagnostic files.

 

You can see at the end of syslog-previous what seems to happen at 16h54 with a usb disconnecting (usb of my ups).  And then a general a protection fault.

 

Feb  5 16:54:41 Tower kernel: usb 3-4: USB disconnect, device number 3
Feb  5 16:54:42 Tower kernel: usb 3-4: new full-speed USB device number 4 using xhci_hcd
Feb  5 16:54:42 Tower kernel: hid-generic 0003:0764:0501.0002: hiddev96,hidraw0: USB HID v1.10 Device [CPS CP1000AVRLCDa] on usb-0000:0c:00.3-4/input0
Feb  5 16:54:43 Tower kernel: usb 3-4: USB disconnect, device number 4
Feb  5 16:54:44 Tower kernel: usb 3-4: new full-speed USB device number 5 using xhci_hcd
Feb  5 16:54:44 Tower kernel: hid-generic 0003:0764:0501.0003: hiddev96,hidraw0: USB HID v1.10 Device [CPS CP1000AVRLCDa] on usb-0000:0c:00.3-4/input0
Feb  5 16:54:48 Tower apcupsd[6577]: Communications with UPS restored.
Feb  5 16:54:48 Tower sSMTP[2223]: Creating SSL connection to host
Feb  5 16:54:48 Tower sSMTP[2223]: SSL connection using TLS_AES_256_GCM_SHA384
Feb  5 16:54:50 Tower sSMTP[2223]: Sent mail for [email protected] (221 2.0.0 closing connection s18-20020a05622a019200b0042a8a626e3esm330044qtw.53 - gsmtp) uid=0 username=xxx outbytes=652
Feb  5 16:57:27 Tower kernel: general protection fault, probably for non-canonical address 0xffff088157418fb8: 0000 [#1] PREEMPT SMP NOPTI
Feb  5 16:57:27 Tower kernel: CPU: 12 PID: 216 Comm: kswapd0 Tainted: P           O       6.1.64-Unraid #1
Feb  5 16:57:27 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Steel Legend, BIOS P2.90 09/11/2019
Feb  5 16:57:27 Tower kernel: RIP: 0010:remove_extent_mapping+0x3b/0x6e
Feb  5 16:57:27 Tower kernel: Code: 0f 0b 48 89 fe 48 89 df e8 cc f9 ff ff 48 8b 43 68 a8 08 75 2a 48 8b 8b 80 00 00 00 48 8d 83 80 00 00 00 48 8b 93 88 00 00 00 <48> 89 51 08 48 89 0a 48 89 83 80 00 00 00 48 89 83 88 00 00 00 48
Feb  5 16:57:27 Tower kernel: RSP: 0018:ffffc900009cfa28 EFLAGS: 00010246
Feb  5 16:57:27 Tower kernel: RAX: ffff888157418fb0 RBX: ffff888157418f30 RCX: ffff088157418fb0
Feb  5 16:57:27 Tower kernel: RDX: ffff888157418fb0 RSI: ffff88846af9bbd0 RDI: ffff88846af9b900
Feb  5 16:57:27 Tower kernel: RBP: ffff888404e9c028 R08: ffff888404e9be40 R09: 0000000000000000
Feb  5 16:57:27 Tower kernel: R10: 0000000000000402 R11: ffff8884180bf478 R12: 0000000000000cc0
Feb  5 16:57:27 Tower kernel: R13: 000000000021a378 R14: ffff888157418f30 R15: 00000000019c3000
Feb  5 16:57:27 Tower kernel: FS:  0000000000000000(0000) GS:ffff8887eeb00000(0000) knlGS:0000000000000000
Feb  5 16:57:27 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  5 16:57:27 Tower kernel: CR2: 000014a790000010 CR3: 0000000418562000 CR4: 00000000003506e0
Feb  5 16:57:27 Tower kernel: Call Trace:
Feb  5 16:57:27 Tower kernel: <TASK>
Feb  5 16:57:27 Tower kernel: ? __die_body+0x1a/0x5c
Feb  5 16:57:27 Tower kernel: ? die_addr+0x38/0x51
Feb  5 16:57:27 Tower kernel: ? exc_general_protection+0x30f/0x345
Feb  5 16:57:27 Tower kernel: ? asm_exc_general_protection+0x22/0x30
Feb  5 16:57:27 Tower kernel: ? remove_extent_mapping+0x3b/0x6e
Feb  5 16:57:27 Tower kernel: ? remove_extent_mapping+0x1e/0x6e
Feb  5 16:57:27 Tower kernel: try_release_extent_mapping+0x12e/0x20f
Feb  5 16:57:27 Tower kernel: __btrfs_release_folio+0xf/0x31
Feb  5 16:57:27 Tower kernel: shrink_folio_list+0x7ab/0x993
Feb  5 16:57:27 Tower kernel: ? cgroup_rstat_updated+0x21/0xa5
Feb  5 16:57:27 Tower kernel: shrink_lruvec+0x61a/0x9b5
Feb  5 16:57:27 Tower kernel: shrink_node+0x301/0x549
Feb  5 16:57:27 Tower kernel: balance_pgdat+0x4e9/0x6a2
Feb  5 16:57:27 Tower kernel: ? _raw_spin_unlock+0x14/0x29
Feb  5 16:57:27 Tower kernel: ? raw_spin_rq_unlock_irq+0x5/0x10
Feb  5 16:57:27 Tower kernel: ? finish_task_switch.isra.0+0x140/0x218
Feb  5 16:57:27 Tower kernel: kswapd+0x2f0/0x333
Feb  5 16:57:27 Tower kernel: ? _raw_spin_rq_lock_irqsave+0x20/0x20
Feb  5 16:57:27 Tower kernel: ? balance_pgdat+0x6a2/0x6a2
Feb  5 16:57:27 Tower kernel: kthread+0xe7/0xef
Feb  5 16:57:27 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b
Feb  5 16:57:27 Tower kernel: ret_from_fork+0x22/0x30
Feb  5 16:57:27 Tower kernel: </TASK>
Feb  5 16:57:27 Tower kernel: Modules linked in: xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap xt_nat xt_tcpudp veth ipvlan xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) tcp_diag inet_diag ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet 8021q garp mrp bridge stp llc bonding tls edac_mce_amd edac_core intel_rapl_msr intel_rapl_common iosf_mbi kvm_amd kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel mvsas nvme crypto_simd i2c_piix4 cryptd ch341 wmi_bmof libsas r8168(O) rapl i2c_core usbserial k10temp ccp scsi_transport_sas nvme_core ahci libahci wmi button acpi_cpufreq unix
Feb  5 16:57:27 Tower kernel: ---[ end trace 0000000000000000 ]---
Feb  5 16:57:27 Tower kernel: RIP: 0010:remove_extent_mapping+0x3b/0x6e
Feb  5 16:57:27 Tower kernel: Code: 0f 0b 48 89 fe 48 89 df e8 cc f9 ff ff 48 8b 43 68 a8 08 75 2a 48 8b 8b 80 00 00 00 48 8d 83 80 00 00 00 48 8b 93 88 00 00 00 <48> 89 51 08 48 89 0a 48 89 83 80 00 00 00 48 89 83 88 00 00 00 48
Feb  5 16:57:27 Tower kernel: RSP: 0018:ffffc900009cfa28 EFLAGS: 00010246
Feb  5 16:57:27 Tower kernel: RAX: ffff888157418fb0 RBX: ffff888157418f30 RCX: ffff088157418fb0
Feb  5 16:57:27 Tower kernel: RDX: ffff888157418fb0 RSI: ffff88846af9bbd0 RDI: ffff88846af9b900
Feb  5 16:57:27 Tower kernel: RBP: ffff888404e9c028 R08: ffff888404e9be40 R09: 0000000000000000
Feb  5 16:57:27 Tower kernel: R10: 0000000000000402 R11: ffff8884180bf478 R12: 0000000000000cc0
Feb  5 16:57:27 Tower kernel: R13: 000000000021a378 R14: ffff888157418f30 R15: 00000000019c3000
Feb  5 16:57:27 Tower kernel: FS:  0000000000000000(0000) GS:ffff8887eeb00000(0000) knlGS:0000000000000000
Feb  5 16:57:27 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  5 16:57:27 Tower kernel: CR2: 000014a790000010 CR3: 0000000418562000 CR4: 00000000003506e0
Feb  5 16:57:27 Tower kernel: note: kswapd0[216] exited with preempt_count 1
Feb  5 16:57:27 Tower kernel: ------------[ cut here ]------------
Feb  5 16:57:27 Tower kernel: WARNING: CPU: 12 PID: 216 at kernel/exit.c:814 do_exit+0x87/0x923
Feb  5 16:57:27 Tower kernel: Modules linked in: xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap xt_nat xt_tcpudp veth ipvlan xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) tcp_diag inet_diag ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet 8021q garp mrp bridge stp llc bonding tls edac_mce_amd edac_core intel_rapl_msr intel_rapl_common iosf_mbi kvm_amd kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel mvsas nvme crypto_simd i2c_piix4 cryptd ch341 wmi_bmof libsas r8168(O) rapl i2c_core usbserial k10temp ccp scsi_transport_sas nvme_core ahci libahci wmi button acpi_cpufreq unix
Feb  5 16:57:27 Tower kernel: CPU: 12 PID: 216 Comm: kswapd0 Tainted: P      D    O       6.1.64-Unraid #1
Feb  5 16:57:27 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450 Steel Legend, BIOS P2.90 09/11/2019
Feb  5 16:57:27 Tower kernel: RIP: 0010:do_exit+0x87/0x923
Feb  5 16:57:27 Tower kernel: Code: 24 74 04 75 13 b8 01 00 00 00 41 89 6c 24 60 48 c1 e0 22 49 89 44 24 70 4c 89 ef e8 76 dd 80 00 48 83 bb b0 07 00 00 00 74 02 <0f> 0b 48 8b bb d8 06 00 00 e8 78 dc 80 00 48 8b 83 d0 06 00 00 83
Feb  5 16:57:27 Tower kernel: RSP: 0018:ffffc900009cfee0 EFLAGS: 00010286
Feb  5 16:57:27 Tower kernel: RAX: 0000000080000000 RBX: ffff888102d01f80 RCX: 0000000080000000
Feb  5 16:57:27 Tower kernel: RDX: 0000000000000001 RSI: 0000000000002710 RDI: 00000000ffffffff
Feb  5 16:57:27 Tower kernel: RBP: 000000000000000b R08: 0000000000000000 R09: 0720072007200720
Feb  5 16:57:27 Tower kernel: R10: 0720072007200720 R11: 0720072007200720 R12: ffff888102d10800
Feb  5 16:57:27 Tower kernel: R13: ffff888102d09080 R14: 0000000000000000 R15: 0000000000000000
Feb  5 16:57:27 Tower kernel: FS:  0000000000000000(0000) GS:ffff8887eeb00000(0000) knlGS:0000000000000000
Feb  5 16:57:27 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  5 16:57:27 Tower kernel: CR2: 000014a790000010 CR3: 0000000418562000 CR4: 00000000003506e0
Feb  5 16:57:27 Tower kernel: Call Trace:
Feb  5 16:57:27 Tower kernel: <TASK>
Feb  5 16:57:27 Tower kernel: ? __warn+0xab/0x122
Feb  5 16:57:27 Tower kernel: ? report_bug+0x109/0x17e
Feb  5 16:57:27 Tower kernel: ? do_exit+0x87/0x923
Feb  5 16:57:27 Tower kernel: ? handle_bug+0x41/0x6f
Feb  5 16:57:27 Tower kernel: ? exc_invalid_op+0x13/0x60
Feb  5 16:57:27 Tower kernel: ? asm_exc_invalid_op+0x16/0x20
Feb  5 16:57:27 Tower kernel: ? do_exit+0x87/0x923
Feb  5 16:57:27 Tower kernel: make_task_dead+0x11c/0x11c
Feb  5 16:57:27 Tower kernel: rewind_stack_and_make_dead+0x17/0x17
Feb  5 16:57:27 Tower kernel: RIP: 0000:0x0
Feb  5 16:57:27 Tower kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6.
Feb  5 16:57:27 Tower kernel: RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
Feb  5 16:57:27 Tower kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Feb  5 16:57:27 Tower kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Feb  5 16:57:27 Tower kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Feb  5 16:57:27 Tower kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Feb  5 16:57:27 Tower kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Feb  5 16:57:27 Tower kernel: </TASK>
Feb  5 16:57:27 Tower kernel: ---[ end trace 0000000000000000 ]---

 

I'm not sure if the problem really is with the usb connected to my UPS .  Hopefully someone smarter than me can confirm with the logs.

 

I don't absolutely need my UPS connected with my usb, so I will disconnect it at the moment.  It was just a quick way to know when the power at my house went off.

 

Also, in the "Fix Common problems" settings, the only warning I have is the "Marvel Hard Drive Controller Installed" and the "Syslog mirored to flash".

 

Thank you for your time.

anotower-diagnostics-20240205-1853.zip

Link to comment

I just disabled C-State globally and set "Power Supply Idle Control" to "typical current idle".

 

Sorry for the delay, I had to install a gpu and it took more time than I like to admit to figure out the seconde pci slot was disabled because I use 2 nvme drive :(

 

I will report back if I have the same problem in the next 30 or 40 days.

 

Thank you

Link to comment

Today, I'm having troubles with my docker files.  I cannot update certain containers.

 

In the fix common problems plugin, I have "Unable to write to Docker Image" - "Docker Image either full or corrupted".

 

The docker image was not full.  I tried giving it more space but the same problems persisted.

I read on this forum it could be caused by RAM

 

I just did a memtest and the test failed within 15 minutes.

 

I guess bad RAM can cause crash and instability.

I will keep the system down until I buy new RAM.

 

I hope my dockers will be ok after that.

Link to comment
1 hour ago, mikamap said:

I just did a memtest and the test failed within 15 minutes.

 

I guess bad RAM can cause crash and instability.

I will keep the system down until I buy new RAM

 

If you have more than one RAM stick then it can be worth trying them individually to see if they can then pass memtest.

 

Link to comment

Yes I have 4 sticks.  Since the are dual channel I could test both pair individually.  The first pair was faulty instantly in Memtest.

The second, I let it run 48 hours and everything is fine.

 

So at the moment, I run my unraid with only 16gb or RAM and I'm trying to RMA the other pair.

 

At least my system is up at the moment.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...