Jump to content
  • [6.12.4] Server hangs once a day since updating to 6.12.4


    bastl
    • Urgent

    Hello everyone,

     

    coming from 6.12.2 with an stable server, the 6.12.4 update I did a week ago broke something. Once a day I find the server frozen, mostly in the morning. No WebUI, no SMB access, SSH or ping. No response. I have to force reboot the system.

     

    Main use for the server is for light media consumption with Jellyfin, Nextcloud sync from phone (CalDav, CardDav),
    Unifi etc. and from time to time some media conversation with Tdarr or Handbrake dockers, rarly some remote access with WG. Most dockers are running on idle also a VM or two doing nothing. Most time of the day the server is idle. No config changes on my side with the last update. No custom scripts running during this time.

     

    On 6.12.2 the server never had any issues or crashes. It started the night after the update.

     

    I activated the syslog server and catched the latest crash.

    Sep 29 19:13:48 mini root: /mnt/cache: 284 GiB (304924037120 bytes) trimmed on /dev/nvme0n1p1
    Sep 30 02:44:09 mini kernel: general protection fault, maybe for address 0xffffc900033abe6c: 0000 [#1] PREEMPT SMP NOPTI
    Sep 30 02:44:09 mini kernel: CPU: 6 PID: 31855 Comm: ps Tainted: P           O       6.1.49-Unraid #1
    Sep 30 02:44:09 mini kernel: Hardware name: BESSTAR TECH LIMITED HM90/HM90, BIOS 5.16 10/13/2021
    Sep 30 02:44:09 mini kernel: RIP: 0010:mntput_no_expire+0x59/0x1f2
    Sep 30 02:44:09 mini kernel: Code: 2e e7 ff 48 8b 83 e8 00 00 00 48 85 c0 74 16 48 8b 7b 50 83 ce ff e8 2f ef ff ff e8 cc 7a e7 ff e9 78 01 00 00 e8 91 ed ff ff <f0> 83 44 24 fc 00 48 8b 7b 50 83 ce ff e8 0e ef ff ff 48 89 df e8
    Sep 30 02:44:09 mini kernel: RSP: 0018:ffffc900033abe70 EFLAGS: 00010286
    Sep 30 02:44:09 mini kernel: RAX: 0000000000000000 RBX: ffff888134bf0838 RCX: 0000000000000064
    Sep 30 02:44:09 mini kernel: RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffff888134bf09c8
    Sep 30 02:44:09 mini kernel: RBP: ffff888106220b00 R08: 0000000000000000 R09: ffff888134bf0858
    Sep 30 02:44:09 mini kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000a801d
    Sep 30 02:44:09 mini kernel: R13: ffff888134bf0858 R14: ffff88818ab54e40 R15: 0000000000000000
    Sep 30 02:44:09 mini kernel: FS:  0000147c21ef77c0(0000) GS:ffff888712d80000(0000) knlGS:0000000000000000
    Sep 30 02:44:09 mini kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Sep 30 02:44:09 mini kernel: CR2: 00000cdb8942f000 CR3: 000000033b1e2000 CR4: 0000000000350ee0
    Sep 30 02:44:09 mini kernel: Call Trace:
    Sep 30 02:44:09 mini kernel: <TASK>
    Sep 30 02:44:09 mini kernel: ? __die_body+0x1a/0x5c
    Sep 30 02:44:09 mini kernel: ? die_addr+0x38/0x51
    Sep 30 02:44:09 mini kernel: ? exc_general_protection+0x30f/0x345
    Sep 30 02:44:09 mini kernel: ? asm_exc_general_protection+0x22/0x30
    Sep 30 02:44:09 mini kernel: ? mntput_no_expire+0x59/0x1f2
    Sep 30 02:44:09 mini kernel: ? mntput_no_expire+0x6b/0x1f2
    Sep 30 02:44:09 mini kernel: ? dput+0x39/0x17b
    Sep 30 02:44:09 mini kernel: ? __fput+0x19f/0x1d2
    Sep 30 02:44:09 mini kernel: ? task_work_run+0x6b/0x80
    Sep 30 02:44:09 mini kernel: ? exit_to_user_mode_prepare+0x75/0x10d
    Sep 30 02:44:09 mini kernel: ? syscall_exit_to_user_mode+0x18/0x2c
    Sep 30 02:44:09 mini kernel: ? do_syscall_64+0x77/0x81
    Sep 30 02:44:09 mini kernel: ? entry_SYSCALL_64_after_hwframe+0x64/0xce
    Sep 30 02:44:09 mini kernel: </TASK>
    Sep 30 02:44:09 mini kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap xt_nat xt_tcpudp veth macvlan xt_conntrack nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter dm_crypt dm_mod xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) it87 tcp_diag inet_diag hwmon_vid vendor_reset(O) iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs bridge stp llc igc r8169 realtek amdgpu edac_mce_amd edac_core intel_rapl_msr intel_rapl_common iosf_mbi gpu_sched drm_buddy kvm_amd i2c_algo_bit drm_ttm_helper ttm drm_display_helper kvm
    Sep 30 02:44:09 mini kernel: drm_kms_helper drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 btusb btrtl aesni_intel btbcm btintel crypto_simd cryptd bluetooth agpgart i2c_piix4 syscopyarea rapl ahci ecdh_generic nvme sysfillrect i2c_core k10temp libahci amd_sfh ecc sysimgblt ccp fb_sys_fops nvme_core tpm_crb tpm_tis video tpm_tis_core wmi tpm backlight acpi_cpufreq button unix [last unloaded: igc]
    Sep 30 02:44:09 mini kernel: ---[ end trace 0000000000000000 ]---
    Sep 30 02:44:09 mini kernel: RIP: 0010:mntput_no_expire+0x59/0x1f2
    Sep 30 02:44:09 mini kernel: Code: 2e e7 ff 48 8b 83 e8 00 00 00 48 85 c0 74 16 48 8b 7b 50 83 ce ff e8 2f ef ff ff e8 cc 7a e7 ff e9 78 01 00 00 e8 91 ed ff ff <f0> 83 44 24 fc 00 48 8b 7b 50 83 ce ff e8 0e ef ff ff 48 89 df e8
    Sep 30 02:44:09 mini kernel: RSP: 0018:ffffc900033abe70 EFLAGS: 00010286
    Sep 30 02:44:09 mini kernel: RAX: 0000000000000000 RBX: ffff888134bf0838 RCX: 0000000000000064
    Sep 30 02:44:09 mini kernel: RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffff888134bf09c8
    Sep 30 02:44:09 mini kernel: RBP: ffff888106220b00 R08: 0000000000000000 R09: ffff888134bf0858
    Sep 30 02:44:09 mini kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000a801d
    Sep 30 02:44:09 mini kernel: R13: ffff888134bf0858 R14: ffff88818ab54e40 R15: 0000000000000000
    Sep 30 02:44:09 mini kernel: FS:  0000147c21ef77c0(0000) GS:ffff888712d80000(0000) knlGS:0000000000000000
    Sep 30 02:44:09 mini kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Sep 30 02:44:09 mini kernel: CR2: 00000cdb8942f000 CR3: 000000033b1e2000 CR4: 0000000000350ee0
    Sep 30 02:44:09 mini kernel: note: ps[31855] exited with preempt_count 2

    mini-diagnostics-20230930-1231.zip

     

    No idea how to fix this issue. Any help is appreciated.

     

    syslog-10.0.0.4.log




    User Feedback

    Recommended Comments



    Hi, I'm having the same issue here with my new install on 6.12.6 that at somepoint downgraded (manually) to 6.12.4.

    This is a new server from maybe 1 month ago and it WAS NOT doing this at first. For like 1-2 weeks.

    Then issues began:

    - The server will freeze every 1-2 days

    - No log when it freeze

    - I was there once when it happened (connected to the UI while watching the dashboard) the CPU goes up and then nothing works

    - On the hardware monitor I still see (login: ) I can type the username then press enter. Then nothing....

    - To reboot I had luck 1-2 times by pressing power button once then the server would go in shutdown mode (seeing it on the cli monitor) then forcinf shutdown after 90 seconds. Tested to give plenty of time and it doesn't work I have to hold power button for the server to close. Now most of the time I see nothing on the screen when I press the power button.

     

    I added syslog on the boot drive and the only thing I see is a black hole during the lockups (see attached)

    I did multiple diags but here's the latest one (see attached)

    I also installed Netdata console to see if I could have logs or graphs of what is actually happening right before the crash and I got some data out of it !:
    This morning the issue seems to happen around 5:40AM and at this time all kinds of things seems to happen in terms of ram utulization and CPU usage. See attached !

    Please help us. Having to force shutdown every 1-2 days is not good for the systems, devices and harddrives

    I might try 6.12.7rc2 but if it doesn't work. I might go back to 6.11.5 where my last server didn't have any issue either.

     

    tower-diagnostics-20240214-0727.zip syslog (1) Netdata Graphs of crash Feb14 5.40AM.7z

    Link to comment
    29 minutes ago, trurl said:

    Have you done memtest?

    No I didn't.

    I assume it is good because it my old gaming PC and that thing ran flawless.

     

    Oh the other things I forgot to mention is my Hardware listing :

    [8086:191f]    00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
    [8086:1901]    00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 07)
    [8086:1912]    00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
    [8086:a12f]    00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
    [8086:a13a]    00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
    [8086:a102]    00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
    [8086:a167]    00:1b.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)
    [8086:a16a]    00:1b.3 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #20 (rev f1)
    [8086:a110]    00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
    [8086:a118]    00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
    [8086:a145]    00:1f.0 ISA bridge: Intel Corporation Z170 Chipset LPC/eSPI Controller (rev 31)
    [8086:a121]    00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
    [8086:a170]    00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
    [8086:a123]    00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
    [8086:15b8]    00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V (rev 31)
    [1000:0072]    01:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
    [1b21:0612]    03:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)
    [1b21:1242]    04:00.0 USB controller: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
    [144d:a802]    05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 (rev 01)

    - I disabled C-State in BIOS

    - I disables XMP 

    - I reset the overclocking to Default

    - All my dockers are in Docker custom network type: ipvlan



    I could run a mem test, but I doubt its this.
    Do you think the network card bug applies to me ?

    Link to comment
    3 hours ago, David Grenon said:

    other things I forgot to mention is my Hardware listing

    We can see that in your diagnostics in system/lspci.txt

     

    3 hours ago, David Grenon said:

    could run a mem test, but I doubt

    If only to eliminate that. Better safe than sorry.

     

    3 hours ago, David Grenon said:

    Do you think the network card bug applies to me ?

    Which bug are you referring to?

    Link to comment

    Small update from my side. As long as I close any active session to Unraids web-ui from my main desktop, the server won't freeze. If I activly manage something on the server, no freezes. It only happens when I'am logged in on the web-ui from my Win10 PC and the PC isn't really in use. But even if on idle, it happens randomly only every 2-3 days. I'am still not sure how to fix this. 😒

    • Upvote 1
    Link to comment
    19 minutes ago, bastl said:

    Small update from my side. As long as I close any active session to Unraids web-ui from my main desktop, the server won't freeze. If I activly manage something on the server, no freezes. It only happens when I'am logged in on the web-ui from my Win10 PC and the PC isn't really in use. But even if on idle, it happens randomly only every 2-3 days. I'am still not sure how to fix this. 😒

    It's interesting to me you're mentioning this because 2 times the issue happened when I was away (not at home) the server was obviously down all day because I couldn't hard reboot manually and both times I came back from the job (back at home) the server went up at the exact time I came back !

    The second time I even closed my phone before coming home just to validate this, but there might be something else that triggered the server to come back up and running, like maybe my job laptop in my backpack that is sometimes not in sleep That could have the webui open in cache or Google remote desktop tab on my VM (different host) with the unraid webui opened.

    I basically open my unraid UI on multiple devices and that might have something to do with it... 

    How to trouble shoot this though.... No logs...

    To be clear both times I came back at home and the server started working again like nothing happened. Plex and everything were up!
     

    Link to comment
    3 minutes ago, trurl said:

    It is already configured. You got all the logs since I set them up in my flash drive. But as you see, there's no log when the freeze begin. The only thing I can check is the history of Netdata that I installed on my server that clearly show things in CPU/RAM right before the issue happen.

    Are you suggesting that putting logs elsewhere (not on boot drive) would give me more infos?

    thanks

    Link to comment
    4 minutes ago, David Grenon said:

    putting logs elsewhere (not on boot drive) would give me more infos?

    no

    Link to comment
    On 2/14/2024 at 1:18 PM, trurl said:

    If only to eliminate that. Better safe than sorry.

     

     

    Voila!

    What's next ?

    I think I'm gonna try the new release...nothing to loose

    PXL_20240216_005331026.jpg

    PXL_20240216_000113504.jpg

    Link to comment

    Oh and btw I had the issue today too..(didn't had it yesterday) and I tried to close all browser, reopen them from all my devices and nothing came up like the earlier post. Maybe those 2 times where I came back home and at the same time everything came backup like magic were lucky/unlucky idk.

    Link to comment

    Not sure if related to update, but with the new 6.12.8 ? or 7RC2 update yesterday and it already crashed. I might check tonight when it actually happened (if its related to a thing that is running inside my containers that triggers after X hours or at specific time and crash the whole thing...), but If I don't see anything like a pattern I'm straight up reverting to 6.11.5 which was not causing issues with my old server (had the issue with my old server issue with 6.12.x +)

    I'm a little worried to hard reboot my NAS every freaking day, no parity check because 22tb takes more time than the time before the next crash. I don't feel well these days...

    At this point, if 6.11.5 on my new server does the same thing I might check with all containers stopped...

    If this doesn't work... I think I might consider another OS until 6.13.x. This is ridiculous. A NAS is supposed to be stable. And from what I see (and for different reasons) multiple users had issue with 6.12.x.

     

    Anything to help me ?

    Link to comment
    6 minutes ago, David Grenon said:

    Anything to help me ?

    If there's nothing relevant logged, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

    Link to comment
    19 hours ago, JorgeB said:

    If there's nothing relevant logged, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

     

     

    Yeah, also my colleague suggested me that I could export process usage ever 15 minutes to see in the 1hour period that cpu and ram ramp up what process is working and maybe pinpoint the issue.

     

    I also updated to 6.12.8 and rebooted this morning because the server hanged again.

     

    If it crash today I'll try to run without docker and vms services enabled and see..

     

    Thanks for the reply.

    Link to comment

    Didn't get the bug until this morning...

    One thing to mention is that I open a laptop Webgui interface yesterday afternoon to do things (open dockers webui)

    That laptop went on sleep.


    I had my desktop with the Wegui also opened but for the whole weekend.

     

    Netdata is able to show me these logs(see attachment)
    (that's what happen before each crashes)

    When it will crash I think I'll stop everything but Plex docker to see and if it happen again I'll disable everything. If it happen again.. at this point I'll go 6.11.5

    I guess its a container issue, but even if it is... a Docker glitch/missconfig or w/e shouldn't lock the whole system to the point that we need to hard reset. No ?

     


     

    2024-02-19 11_54_48-SonarrRadarrJackettDeluge — Mozilla Firefox.png

    Link to comment
    5 minutes ago, David Grenon said:

    I guess its a container issue, but even if it is... a Docker glitch/missconfig or w/e shouldn't lock the whole system to the point that we need to hard reset. No ?

    It shouldn't, but it's been known to happen.

    Link to comment

    I also sometimes have this in syslogs-Previous:
     

    Feb 19 02:03:42 Tower kernel: PMS LoudnessCmd[26943]: segfault at 0 ip 000014f36c8db080 sp 000014f36738a0c8 error 4 in libswresample.so.4[14f36c8d3000+18000] likely on CPU 7 (core 3, socket 0)
    Feb 19 02:03:42 Tower kernel: Code: 01 cf 4c 39 c7 72 e3 c3 cc cc 8d 04 49 48 98 4d 89 c1 49 29 c1 48 63 c2 48 63 c9 49 39 f9 76 75 f2 0f 10 05 02 05 ff ff 66 90 <0f> bf 16 0f 57 c9 f2 0f 2a ca f2 0f 59 c8 f2 0f 11 0f 0f bf 14 06
    Feb 19 02:04:07 Tower kernel: PMS LoudnessCmd[27486]: segfault at 0 ip 0000148a53886fc3 sp 0000148a4e1380c8 error 4 in libswresample.so.4[148a53885000+18000] likely on CPU 4 (core 0, socket 0)
    Feb 19 02:04:07 Tower kernel: Code: 0f 00 00 00 0f 85 73 ff ff ff 48 f7 c6 0f 00 00 00 0f 85 66 ff ff ff 48 8d 34 56 48 8d 3c 97 48 f7 da 66 0f 6f 2d 7d 64 ff ff <66> 0f 6f 04 56 66 0f 6f 4c 56 10 66 0f ef d2 66 0f ef db 66 0f 61

    I've read somewhere else that I shouldn't worry about it but do I ?

    these 2 entries are the only one showing before the crash...

    Link to comment
    1 hour ago, JorgeB said:

    It shouldn't, but it's been known to happen.

    How can I prevent this. Is there a way to specificly say to unraid like:

    Hey, keep at least 2-4gb of ram /2 core cpu UNUSED by anything but the system itself so the Unraid don't freeze and when everything hang I can troubleshoot ?

    Or what to do on my side to help you understand what docker (i don't run any VMS) or plugins would be missconfigured ? Printscreens of every docker configs ?

    would that help you identify ?
    Thank you,

    Link to comment
    3 minutes ago, David Grenon said:

    would that help you identify ?

    You can start the containers one at a time a retest to see if you find the culprit.

    Link to comment

    I had a very similar issue, and even though my memtest for 8 hours passed, it was still bad/misconfigured ram.  

     

    I took out one of the ram dimms, and it completely resolved my problem 

    Link to comment
    41 minutes ago, Terebi said:

    I had a very similar issue, and even though my memtest for 8 hours passed, it was still bad/misconfigured ram.  

     

    I took out one of the ram dimms, and it completely resolved my problem 

    I understand this and might worth a try.

     

    The thing is I had this specific issue with my old server using other ram stick and now this server (old) is running flawless with 6.11.5.

     

    As I also told earlier, this was my old gaming PC and never had any issues with it.

     

    Link to comment
    1 hour ago, JorgeB said:

    You can start the containers one at a time a retest to see if you find the culprit.

    If it crash again, I'll start only Plex

    Is pluggins something to be worried about ?

    What to do if I only want to start Plex Docker, and nothing else?
    How to make sure that nothing else is running even in pluggins ?

    I actually have these running and forgewt about netdata because it was installed after the issue.
     

    2024-02-19 15_28_30-SonarrRadarrJackettDeluge — Mozilla Firefox.png

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.

×
×
  • Create New...