UNRAID server freeze/unresponsive about every 24h~ with error "rcu_sched self-detected stall on CPU" in the logs

Ingenioes · February 6, 2021

Hey guys,

my UNRAID Server 6.8.3 has begun to become completely unresponsive the last few days and I would love it if you guys could help me solve the problem.

Installed Docker: binhex-qbittorrent, bitwardenrs, apacheguacamole, duplicati, binhex-krusader, mariadb, nextlcoud, Grafana, JDownloader2, plex, unif-controller, Influxdb, tautulli, Varken, telegraf, swag, redis, Authelia. (All of them are in autostart and runnig 24/7)

(Firefox and binhex-teamspeak are installed but I acutally never use them)

Installed Applications: CA User Scripts, CA GUI Links, CA Fix Common Problems, CA Auto Update, CA Auto Turbe Write Mode, CA Appdata Backup/Restore V2, Community Applications, Custom Tab, Dynamix Active Streams, Dynamix Cache Dirs, Dynamix Schedules, Dynamix SSD Trim, Dynamix System Info, Dynamix System Stats, Dynamix System Temp, Dynamix Wireguard, File Activiy, IPMI Tools, Mover Tuning, NerdPack GUI, Preclear Disk, rclone, Recycle Bin, Tips and Tweaks, Unassigned Devices, Unassigned Devices Plus, Unlimited Width, Virtual Machine Wake on Lan, Wake on Lan ... long list sry

Hardware: Dell R730 XD; 2x E5 2667 v3; 4x8GB DDR4 2133 ECC (HMA41GR7MFR8N-TF), BIOS 2.12.1, NIC: Dell 099GTM 99GTM(One 10G Port for LAN und both 1G passthrough to the Untangle VM).

Parity: WDC_WD140EDFZ; WDC_WD140EMFZ

Data: 4x WDC_WD80EFZX; WDC_WD80EFZX; ST8000AS0002; WDC_WD120EDAZ

Cache: Corsair MP510 via EZDIY-FAB PCI Express M.2 SSD NGFF PCIe Card to PCIe 3.0 x4 Adapter (Support M.2 PCIe 22110,2280, 2260, 2242) from German Amazon: Link

Problem:

It started on the 25.01 and repeated on the 27.01 more or less exactly 48h apart(25.01 18:46//27.01 19:10).

Afterwards I turned on the syslog to usb thing so I could have a look into the logs and cached a screenshot via IDRAC when the server crashed again on the 31.01 17:41 - image attached.

The server is, when its crashed, completely unresponsive, I cant reach my VMs/Docker nor can I reach the UNRAID Server itself, no ping, no ssh, no webui, even the monitor/mouse/keyboard on the unraid server is completely frozen.

Sadly I didn't have a lot of time lately to look into this, so I created a user script that restarted my server every night, that did "fix" the problem. When I disabled that restart my server would crash more or less after 24h.

What I've tried:

I already ran the hardware test utility of the dell lifecycle controller/IDRAC(also the thorough version) sadly(or luckily?) it didn't show any problems.

Ran memtest the last night (the one from UNRAID itself didn't start, so i used V5.31b from the memtest.org website), that also did not show any errors.

It also crashed somewhere in the night from Thursday to Friday, as it was unresponsive on Friday morning 5.2.21 (I turned off the restart because I did a BIOS update two nights before and wanted to see if it helped smth)

Thoughts:

As I see the error "rcu_sched self-detected stall on CPU" show up in syslog every time I recall the server crashing I was thinking maybe it is a dead CPU?

I have two other CPUs on hand that I could install and also have another NIC daughter board(0FM487) in case you guys think its a hardware problem.

I was also thinking about Ultimate Boot CD or smth like that the next night and stress test my CPUs?

Let me know if I forgot any important info that you need.

Thank you very much in advance, from what I've seen so far the UNRAID family is rly helpful!

Have a nice weekend everyone

vault-diagnostics-20210206-1216.zip syslog_5.2.21

JorgeB · February 6, 2021

Jan 30 22:28:42 Vault kernel: macvlan_broadcast+0x111/0x156 [macvlan]
Jan 30 22:28:42 Vault kernel: macvlan_process_broadcast+0xea/0x128 [macvlan]

Problem starts with macvlan call traces, these are usually the result of having dockers with a custom IP address, and they may end up locking up Unraid, see below for more info:

Ingenioes · February 6, 2021

Thanks for the headsup JorgeB,

I indeed had one Docker running on a dedicated IP via br0, I just swapped it back to bridge mode and will keep an eye on it.

If there is no error for the next few day ill mark this post as solved, thank you very much.

Ingenioes · February 7, 2021

Hey guys,

so sadly I just had another lockup of my system, again same circumstances, no ping, no docker/vm reachable, no web ui, no ssh, no physical monitor/mouse/keyboard working ...

After the hard reboot I instantly checked the syslog but there is no logs around that time? The crash was around ~ 11:38 and there is literally nothing in the syslog.

I have now opened the syslog on my hardware monitor again, so I can have a look at that for the next crash. Stupid me didn't do this after the last lockup.

Any Ideas what might have caused it this time?

I attached a new syslog and diagnostics.

Thank you guys very much in advance.

syslog_7.2.21 vault-diagnostics-20210207-1200.zip

JorgeB · February 7, 2021

47 minutes ago, Ingenioes said:

he crash was around ~ 11:38 and there is literally nothing in the syslog.

Could point to a hardware problem, another thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Ingenioes · February 7, 2021

Hey JorgeB, thank you for your input again. Will try your suggestion and keep you updated.

Before I go into safe mode I just fully disabled my guacamole docker, that installation was the last bigger change I did before the server started with those crashes iirc.

Thank you again for the fast response

UNRAID server freeze/unresponsive about every 24h~ with error "rcu_sched self-detected stall on CPU" in the logs

Recommended Posts

Ingenioes

Link to comment

JorgeB

Link to comment

Ingenioes

Link to comment

Ingenioes

Link to comment

JorgeB

Link to comment

Ingenioes

Link to comment

Join the conversation