Server Unresponsive

GatorMB · February 13

Good day all,

Disclaimer, I'm new to this and honestly fumbling along as best as I can. I have exhausted multiple resources prior to posting here.

My system specs are in my signature.

I built my server initially with a Supermicro x9SCL mobo / xeon e3-1230v2 and 16gb ram. It ran well, but transcoding was an issue. I decided to replace a few items. I changed to the X11SSH-F mobo, xeon e3-1285v6 and 64GB UDIMM. I then added the Tesla p4. It is a headless server. With the new setup, it runs good and the TDARR and Plex containers are utilizing the GPU for transcoding now. However, when I go to bed, I wake up and I can no longer access the server from anywhere. The MAC address doesn't even populate in the router. I end up shutting off the power to the server and rebooting. This will allow me to access it from a desktop again. Sometimes I can access it for hours, and sometimes it locks up again in minutes. As of right now, it's been up for 6h57m. It's like when it's not being accessed, it just times out. I have brought it up to my office and connected it to a monitor and no difference.

I have tried another mobo, tried new ram, reloaded everything one step at a time. The only thing I haven't replaced is the cpu. It this a config issue or a hardware issue? I have absolutely no idea and am unsure where to find more info.

Any help is greatly appreciated!!

trurl · February 13

Attach Diagnostics to your NEXT post in this thread.

Also setup syslog server.

GatorMB · February 13

Hope this helps.

goldraid-diagnostics-20240213-1617.zip goldraid-syslog-20240213-2221.zip

JorgeB · February 14

Nothing obvious so far, post the persistent syslog after a crash.

11 hours ago, trurl said:

Also setup syslog server.

GatorMB · February 14

I have enabled the persistent syslog server. I will hopefully have better info soon. I woke up this AM and it had locked up again.

goldraid-diagnostics-20240214-0912.zip

JorgeB · February 14

And where is it? It only comes with the diags if you chose the mirror to flash drive option.

GatorMB · February 14

syslog.txt

I have enabled the syslog and when I get more data I will forward it after the next crash. For now, this is all I have. Are my shares setup incorrectly?

GatorMB · February 14

More specifically, here is a snippet of the log that I think might be an issue that I have no idea to resolve.

Feb 14 09:38:01 GOLDRAID root: Fix Common Problems Version 2024.01.18

Feb 14 09:38:02 GOLDRAID root: Fix Common Problems: Warning: Share appdata set to cache-only, but files / folders exist on the array

Feb 14 09:38:02 GOLDRAID root: Fix Common Problems: Warning: Docker Application bazarr has an update available for it

Feb 14 09:38:10 GOLDRAID kernel: ------------[ cut here ]------------

Feb 14 09:38:10 GOLDRAID kernel: WARNING: CPU: 2 PID: 270 at net/netfilter/nf_conntrack_core.c:1210 __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]

Feb 14 09:38:10 GOLDRAID kernel: Modules linked in: nvidia_uvm(PO) macvlan xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) tcp_diag inet_diag jc42 regmap_i2c ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs 8021q garp mrp bridge stp llc bonding tls igb intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp nvidia_drm(PO) nvidia_modeset(PO) kvm_intel i915 kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 ast sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd drm_vram_helper iosf_mbi drm_ttm_helper drm_buddy ipmi_ssif cryptd nvidia(PO) drm_display_helper rapl ttm mpt3sas drm_kms_helper intel_cstate raid_class intel_uncore drm intel_gtt i2c_i801 joydev i2c_smbus scsi_transport_sas agpgart syscopyarea ahci Feb 14 09:38:10 GOLDRAID kernel: i2c_algo_bit mei_me input_leds acpi_ipmi led_class video sysfillrect i2c_core mei libahci sysimgblt intel_pch_thermal wmi fb_sys_fops thermal fan ipmi_si backlight intel_pmc_core acpi_power_meter acpi_pad button unix [last unloaded: igb]

Feb 14 09:38:10 GOLDRAID kernel: CPU: 2 PID: 270 Comm: kworker/u16:5 Tainted: P IO 6.1.64-Unraid #1

Feb 14 09:38:10 GOLDRAID kernel: Hardware name: Supermicro PIO-1UUP-ND1-AI036/X11SSH-F, BIOS 3.0 06/06/2023

Feb 14 09:38:10 GOLDRAID kernel: Workqueue: events_unbound macvlan_process_broadcast [macvlan]

Feb 14 09:38:10 GOLDRAID kernel: RIP: 0010:__nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]

Feb 14 09:38:10 GOLDRAID kernel: Code: 44 24 10 e8 e2 e1 ff ff 8b 7c 24 04 89 ea 89 c6 89 04 24 e8 7e e6 ff ff 84 c0 75 a2 48 89 df e8 9b e2 ff ff 85 c0 89 c5 74 18 <0f> 0b 8b 34 24 8b 7c 24 04 e8 18 dd ff ff e8 93 e3 ff ff e9 72 01

Feb 14 09:38:10 GOLDRAID kernel: RSP: 0000:ffffc90000178d98 EFLAGS: 00010202

Feb 14 09:38:10 GOLDRAID kernel: RAX: 0000000000000001 RBX: ffff88879ccdf800 RCX: 0c5d9175f44708e9

Feb 14 09:38:10 GOLDRAID kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88879ccdf800

Feb 14 09:38:10 GOLDRAID kernel: RBP: 0000000000000001 R08: fe5f0278be059942 R09: 76ba4064315dc017

Feb 14 09:38:10 GOLDRAID kernel: R10: d444fff6b20efcb2 R11: ffffc90000178d60 R12: ffffffff82a14d00

Feb 14 09:38:10 GOLDRAID kernel: R13: 0000000000034e1b R14: ffff888416ba0400 R15: 0000000000000000

Feb 14 09:38:10 GOLDRAID kernel: FS: 0000000000000000(0000) GS:ffff889055280000(0000) knlGS:0000000000000000

Feb 14 09:38:10 GOLDRAID kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Feb 14 09:38:10 GOLDRAID kernel: CR2: 00001fb78b21b000 CR3: 00000002e2362001 CR4: 00000000003706e0

Feb 14 09:38:10 GOLDRAID kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

Feb 14 09:38:10 GOLDRAID kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Feb 14 09:38:10 GOLDRAID kernel: Call Trace:

Feb 14 09:38:10 GOLDRAID kernel: <IRQ>

Feb 14 09:38:10 GOLDRAID kernel: ? __warn+0xab/0x122

Feb 14 09:38:10 GOLDRAID kernel: ? report_bug+0x109/0x17e

Feb 14 09:38:10 GOLDRAID kernel: ? __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]

Feb 14 09:38:10 GOLDRAID kernel: ? handle_bug+0x41/0x6f

Feb 14 09:38:10 GOLDRAID kernel: ? exc_invalid_op+0x13/0x60

Feb 14 09:38:10 GOLDRAID kernel: ? asm_exc_invalid_op+0x16/0x20

Feb 14 09:38:10 GOLDRAID kernel: ? __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]

Feb 14 09:38:10 GOLDRAID kernel: ? __nf_conntrack_confirm+0x9e/0x2b0 [nf_conntrack]

Feb 14 09:38:10 GOLDRAID kernel: ? nf_nat_inet_fn+0x60/0x1a8 [nf_nat]

Feb 14 09:38:10 GOLDRAID kernel: nf_conntrack_confirm+0x25/0x54 [nf_conntrack]

Feb 14 09:38:10 GOLDRAID kernel: nf_hook_slow+0x3a/0x96

Feb 14 09:38:10 GOLDRAID kernel: ? ip_protocol_deliver_rcu+0x164/0x164

Feb 14 09:38:10 GOLDRAID kernel: NF_HOOK.constprop.0+0x79/0xd9

Feb 14 09:38:10 GOLDRAID kernel: ? ip_protocol_deliver_rcu+0x164/0x164

Feb 14 09:38:10 GOLDRAID kernel: __netif_receive_skb_one_core+0x77/0x9c

Feb 14 09:38:10 GOLDRAID kernel: process_backlog+0x8c/0x116

Feb 14 09:38:10 GOLDRAID kernel: __napi_poll.constprop.0+0x28/0x124

Feb 14 09:38:10 GOLDRAID kernel: net_rx_action+0x159/0x24f

Feb 14 09:38:10 GOLDRAID kernel: __do_softirq+0x126/0x288

Feb 14 09:38:10 GOLDRAID kernel: do_softirq+0x7f/0xab

Feb 14 09:38:10 GOLDRAID kernel: </IRQ>

Feb 14 09:38:10 GOLDRAID kernel: <TASK>

Feb 14 09:38:10 GOLDRAID kernel: __local_bh_enable_ip+0x4c/0x6b

Feb 14 09:38:10 GOLDRAID kernel: netif_rx+0x52/0x5a

Feb 14 09:38:10 GOLDRAID kernel: macvlan_broadcast+0x10a/0x150 [macvlan]

Feb 14 09:38:10 GOLDRAID kernel: ? _raw_spin_unlock+0x14/0x29

Feb 14 09:38:10 GOLDRAID kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]

Feb 14 09:38:10 GOLDRAID kernel: process_one_work+0x1a8/0x295

Feb 14 09:38:10 GOLDRAID kernel: worker_thread+0x18b/0x244

Feb 14 09:38:10 GOLDRAID kernel: ? rescuer_thread+0x281/0x281

Feb 14 09:38:10 GOLDRAID kernel: kthread+0xe4/0xef

Feb 14 09:38:10 GOLDRAID kernel: ? kthread_complete_and_exit+0x1b/0x1b

Feb 14 09:38:10 GOLDRAID kernel: ret_from_fork+0x1f/0x30

Feb 14 09:38:10 GOLDRAID kernel: </TASK>

Feb 14 09:38:10 GOLDRAID kernel: ---[ end trace 0000000000000000 ]---

Feb 14 09:38:12 GOLDRAID root: Fix Common Problems: Warning: Syslog mirrored to flash

JorgeB · February 14

39 minutes ago, GatorMB said:

Feb 14 09:38:10 GOLDRAID kernel: macvlan_broadcast+0x10a/0x150 [macvlan]

Feb 14 09:38:10 GOLDRAID kernel: ? _raw_spin_unlock+0x14/0x29

Feb 14 09:38:10 GOLDRAID kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]

Macvlan call traces will usually end up crashing the server, switching to ipvlan should fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)), then reboot.

trurl · February 14

1 hour ago, GatorMB said:

Feb 14 09:38:02 GOLDRAID root: Fix Common Problems: Warning: Share appdata set to cache-only, but files / folders exist on the array

Your appdata share has files on the array, and that share isn't configured to allow mover to move it to cache.

Also, your system share is all on the array.

Ideally, appdata, domains, and system shares would have all files on fast pool such as cache, with nothing on the array, and configured to stay there, so Dockers/VMs will perform better, and so array disks can spin down since these files are always open.

Nothing can move open files. You will have to disable Docker and VM Manager in Settings to get these moved to cache.

GatorMB · February 14

So, it crashed again. Here is the syslog info.

syslog-previous.txt syslog.txt

trurl · February 14

What do you get from command line with this?

docker network ls

GatorMB · February 14

1 minute ago, trurl said:
docker network ls

NETWORK ID NAME DRIVER SCOPE

400cca84e0d2 br0 ipvlan local

06eb05e8c95b bridge bridge local

74b5ac950c5a host host local

8c0d0933753b none null local

GatorMB · February 14

4 hours ago, trurl said:

Your appdata share has files on the array, and that share isn't configured to allow mover to move it to cache.

Also, your system share is all on the array.

Ideally, appdata, domains, and system shares would have all files on fast pool such as cache, with nothing on the array, and configured to stay there, so Dockers/VMs will perform better, and so array disks can spin down since these files are always open.

Nothing can move open files. You will have to disable Docker and VM Manager in Settings to get these moved to cache.

I went to the bash command and made sure all appdata / domains / system shares were moved to the cache using:

rsync -av --remove-source-files /mnt/disk2/appdata/ /mnt/cache/appdata/

It moved all files. I then removed all empty folders left behind:
find /mnt/disk2/appdata/ -type d -empty -delete

I corrected appdata folder permissions:
chmod -R 755 /mnt/cache/appdata/

chown -R nobody:users /mnt/cache/appdata/

I made sure that the appdata / domains / system shares were all now cache only in my shares menu.

I made sure that the data & iCloud-drive-sync shares were all now pointing to array in my shares menu.

I reloaded the nvidia driver, and installed the gpu statistics plug-in.

I removed changed the macvlan to ipvlan.

I have checked the cache pool and It's still pretty full (I think). I have a 2tb ssd and a 256GB ssd. Do I need a larger cache pool?

What am I missing?

And before I forget, huge thanks to trurl & JorgeB for all your help. I really appreciate the time you are taking!

trurl · February 14

34 minutes ago, GatorMB said:

I corrected appdata folder permissions:
chmod -R 755 /mnt/cache/appdata/

chown -R nobody:users /mnt/cache/appdata/

That might not be appropriate for all containers. That is why there is a Docker Safe New Permissions, so New Permissions can be run while excluding appdata.

GatorMB · February 14

19 minutes ago, trurl said:

That might not be appropriate for all containers. That is why there is a Docker Safe New Permissions, so New Permissions can be run while excluding appdata.

What would you suggest?

trurl · February 14

Just now, GatorMB said:

What would you suggest?

Let each container manage their appdata permissions.

GatorMB · February 14

1 minute ago, trurl said:

Let each container manage their appdata permissions.

Ok, then how would I reset it to that?

trurl · February 15

There is no "reset" for that. If one of your containers seems to be broken you might have to delete its appdata and let it recreate.

trurl · February 15

Do you have appdata backup?

GatorMB · February 15

4 minutes ago, trurl said:

Do you have appdata backup?

I will after this learning experience! lol

It’s running fine now. I’ll post in the morning and let you know if it crashed again. Thanks again for all your help!

GatorMB · February 17

So yesterday I woke up to an unresponsive server again. No network connectivity, nothing. So, I decided to try 2 more things. I changed the PSU from a 600w to a new corsair 850 80+ Gold. I then added a corsair water cooler for the cpu. Again this am, I woke to a non-responsive server. But this time it was still showing on the router as connected.

If it crashes over the weekend, I'll upload a new set of logs.

goldraid-diagnostics-20240216-2022.zip syslog-2.txt

bastl · February 17

@GatorMB Do you have a idle session to Unraids WebUI opened somewhere on your network? I have an issue the server freezing randomly if I have a websession opened from a Windows box with Firefox. Randomly every 1-2 days with root logged in to Unraid, the server will freeze without any errors catched in the logs. If I log off or shutdown the Windows pc the server won't crash.

GatorMB · February 17

1 hour ago, bastl said:

@GatorMB Do you have a idle session to Unraids WebUI opened somewhere on your network? I have an issue the server freezing randomly if I have a websession opened from a Windows box with Firefox. Randomly every 1-2 days with root logged in to Unraid, the server will freeze without any errors catched in the logs. If I log off or shutdown the Windows pc the server won't crash.

No, I don't leave any active connections. It's a headless server and I only log in to run a process or to try to figure out an use such as this. I'm going to enable the IPMI function and connect that lan port for diagnostics later this weekend.

GatorMB · February 18

Ok, so I went away to the lake yesterday am and left the server running. I got back an hour ago and it's all locked up again. Not showing on the network either. Did a hard reset and it came up fine. Here are the logs. I can't figure this out! Someone please point me in the right direction!

syslog syslog-previous

Server Unresponsive

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation