Server Crashes Frequently - Kernel Panic - Persistent across HW migration (Ryzen > Intel)

Velicoma · January 15

Good evening,

I've spent the last couple of months (or longer) trying to troubleshoot this issue with a previously stable system (~6 months uptime, only short because I have been updating/upgrading some components). I am only a low quality cosplayer of a sys admin, so I apologise in advanced for likely missing something obvious. I have performed a number of troubleshooting steps, the last of which was macvlan to ipvlan change on docker networking as I had seen some call trace errors in the logs; but the system then crashed within 24 hours. I have been able to resolve all my issues across three unraid installs, but this one has eluded me. *disappointed hercules* I have tried to look through the diagnostics, but between work and two young kids, my tired brain has not been able to make much sense of it.

I had an issue I thought was the source previously with some CRC read errors on two drives that I was near certain were good (as during parity check suddenly read errors across many drives). Cable replacements etc. All stopped when I mounted an 80mm fan to the HBA card (9300-16i) heatsink.

The Kernel panic error I only found recently, as the system would hard lock and is under the house. I have a PiKVM (BLIKVM) which would rarely get a video output from the Ryzen system, but does not on the Intel one (see screenshot) where I can see the Kernel Panic error.

Hence macvlan to ipvlan change recently, as it had been stable on macvlan previously and I have a Unifi Dream Machine Pro so didn't want to lose the traffic data monitoring.

I am reading into the firmware on the HBA card, in case that is on an older version and might need upgrading.

Unraid version 6.12.6

but has persisted across all of 6.12.x at least, potentially the later 6.11.x as well.

I do have quite a lot of plugins installed, not sure if there's an easy list I can create to add here.

Troubleshooting activities latest to oldest:

Macvlan to ipvlan docker networking

USB replacement (16gb micro Sandisk to 8GB Kanguru Blue)

Whole system migration (Ryzen to Intel workstation - details below, as I needed the Ryzen for a third kids PC, and read some users had stability issues with Ryzen)

Multiple memory replacements while on Ryzen (including some unregistered ECC, 2x16GB DDR4, 2x32GB DDR4)

Unraid verison updates

Removing some plugins (System Fan for e.g.)

Number of other steps.

I do have quite a lot of plugins installed, not sure if there's an easy list I can create to add here.

Major hardware components

Current:

Intel Z640 (BIOS March 2023)

Intel® Xeon® CPU E5-2620 v4

32GB 2x16GB ECC Registered

USB migrated from Sandisk micro to Kanguru 8GB blue (which I had hoped was the issue) in USB 2.0 port (both systems)

4 port Intel NIC PCIE

Nvidia GTX 1050

HP PCIE NVME M2 (Sabrent Rocket 4TB)

LSI 9300-16i HBA (IT mode)

12x mixed HDDs from 8TB Iron Wolfs, WD Whites 12TBs, 14-20TB X16-20 EXOs

Plenty of fans

HDDs are running on separate PSU (1200w Corsair)

The rest on HP's PSU

MSI B550 Tomahawk

64GB (2x32GB) DDR4 3200

GTX 1050 ti (instead of 1050 above)

All else the same.

Apologies if I've missed some information, it's getting late (well, early I suppose) here.

Appreciate reading the wall of text.

Kind regards,

Andrew

darkstar-diagnostics-20240115-2345.zip

Edited January 15 by Velicoma

JorgeB · January 15

The previous syslog still shows a macvlan call race:

Jan 11 13:55:37 Darkstar kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Jan 11 13:55:37 Darkstar kernel: ? _raw_spin_unlock+0x14/0x29
Jan 11 13:55:37 Darkstar kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]

Note that if you changed to ipvlan but didn't reboot it could still crash because of macvlan, assuming syslog server is still enabled post new diags if it crashes again now.

Velicoma · January 15

54 minutes ago, JorgeB said:
The previous syslog still shows a macvlan call race:
Jan 11 13:55:37 Darkstar kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Jan 11 13:55:37 Darkstar kernel: ? _raw_spin_unlock+0x14/0x29
Jan 11 13:55:37 Darkstar kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]
Note that if you changed to ipvlan but didn't reboot it could still crash because of macvlan, assuming syslog server is still enabled post new diags if it crashes again now.

Ah did not realise I needed to reboot as the docker service was off to make the change. Now it's rebooted from the hard lock, I'll keep an eye on it.

Thank you!

Velicoma · January 16

System locked up again this afternoon unfortunately. Kernel panic from the KVMoverIP console.

I can't seem to select eth0 or bond0 for the containers that had static IPs previously, and so they fail to start. I suspect it's a setting I need to change in network settings or docker settings, but I'll have to wait until later when plex is no longer needed.

Would the docker bridge setting for the containers be causing the issues? I do have 802.3af bonding on 5 NICs (one port disconnected cable). Not sure if any of that is related.

darkstar-diagnostics-20240116-2129.zip

JorgeB · January 16

Jan 16 16:03:25 Darkstar kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Jan 16 16:03:25 Darkstar kernel: ? _raw_spin_unlock+0x14/0x29
Jan 16 16:03:25 Darkstar kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]

There are still macvlan call trace, these should not happen with ipvlan, post the output of:

docker network ls

Velicoma · January 16

Hi JorgeB,

Thanks for the fast reply!

root@Darkstar:~# docker network ls
NETWORK ID     NAME             DRIVER    SCOPE
a669bba618aa   br0              ipvlan    local
d13245982de2   bridge           bridge    local
9249e41856c0   host             host      local
8fc93cafebd7   none             null      local
ae863481cf62   proxynet         bridge    local
16e18470056d   pterodactyl_nw   bridge    local

Before I saw your message I noticed that enable host access to containers was still enabled, and the extra info specified macvlan; even though I had ipvlan enabled. So I disabled that, and started up docker again. To see your message and have run the command. Happy to revert and run the command again though.

Containers with specified IPs are back, but still showing br0, rather than Bond0 or eth0. So I'm not sure if I've done it all right now.

JorgeB · January 16

No macvaln now, suggest rebooting one more time and see if it no longer crashes now.

Velicoma · January 17

Great, gave it a reboot last night, and will keep an eye on it.

Thanks for your help again Jorge! Much appreciated.

secretstorage · February 12

Hi all,

Assuming it's not macvlan issue, and someone (i.e. me) have gone over countless existing threads concerning the same kernel panic issue - what would be the next step in troubleshooting?

My docker doesn't seem to have any macvlan entries:
image.png.ab2d7c3829189b453fac74f5cf3b3839.png

root@Server:~# docker network ls
NETWORK ID NAME DRIVER SCOPE
49a19b8efe85 br0 ipvlan local
fcf8b97dc48a bridge bridge local
8aef9fee9896 host host local
d6284db37b39 none null local
9dcd05e2c5a4 pihole_unraid_default bridge local
68560f9ac751 rrmedia bridge local
cdd1d30a2ba1 wg0 bridge local

I am on:
image.png.e7faa3336cfe4ba26fd49b469a66a55b.png

I've migrated hardware last year, and it continues happening. Most likely every 2–3 weeks (without an unraid reboot), but lately it happened 4 times in 3 days.

The thing I haven't changed in a while (1 year) is my USB stick, though I am not sure if this could be related.

It's a Samsung 32GB BAR Plus Titan Gray 200MB/s

I'd greatly appreciate some assistance!

Thanks in advance

JorgeB · February 12

5 minutes ago, secretstorage said:

Assuming it's not macvlan issue

Most likely not, enable the syslog server and post that after a crash.

secretstorage · February 12

Thank you for the prompt reply.

Will do!!

secretstorage · February 12

One additional question:

Since it's not boot related, would you still recommend mirroring the logs to flash, or `appdata` storage will suffice in this instance?

JorgeB · February 12

Appdata should be fine but you need to input the main Unraid server IP in the remote syslog server filed, or nothing will be logged.

itimpi · February 12

2 hours ago, secretstorage said:

Since it's not boot related, would you still recommend mirroring the logs to flash, or `appdata` storage will suffice in this instance?

If its only going to be for a short period I would say mirror to flash as easiest and catches the most information.. If it will be for any significant length of time then use the remote syslog server field to log to the location you specify to avoid wear on the flash drive.

trurl · February 12

You should go ahead and post diagnostics now since we don't have it yet in this thread.

secretstorage · February 13

syslog-10.10.10.99.log server-diagnostics-20240213-1804.zip

Please see the diagnostics attached.

Second, I am including the logs from the last 24 hours, which may provide some clues or further confusion. What have not happened before, for an unknown reason my whole docker simply stopped yesterday and was disabled until I enabled it about an hour ago.

The logs are also flooded with Server kernel: device vethe9dd15e entered promiscuous mode

Please feel free to have a look, if you see anything relevant.

Thank you in advance!!

JorgeB · February 13

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

1 hour ago, secretstorage said:

The logs are also flooded with Server kernel: device vethe9dd15e entered promiscuous mode

This can be normal, but if it keeps spamming the log there could be a container constantly restarting, you can confirm that by looking at the containers up time.

secretstorage · February 13

2 hours ago, JorgeB said:

if it still crashes it's likely a hardware problem

Thanks very much for looking into it so quickly!

1) You may have missed that in an earlier post, I've migrated ALL hardware (including case 😉) bar the USB drive and the issue continues happening as it had before.

2) I understand it may be a route for some but since Unraid is hosting my home automation server it's not an option to bring it down for several days as it would affect many systems. I don't have hardware to host it in parallel unfortunately.

3) With many threads on this and people having corrected the mcvlan issues already, it's really difficult to understand how everyone is having a hardware issue that produces the same outcome. I understand it's very difficult to chaise down such intermitent problems but I also fell there is a lot of 'it works on my end' vibes going around.

I'll await another kernel panic crash and send in the logs, as it haven't happened yet.

This may allow us additional information.

Fingers crossed! 🤞

JorgeB · February 13

10 minutes ago, secretstorage said:

You may have missed that in an earlier post, I've migrated ALL hardware (including case 😉) bar the USB drive and the issue continues happening as it had before.

Although unlikely it's not impossible that there would also be a problem with the new hardware.

11 minutes ago, secretstorage said:

2) I understand it may be a route for some but since Unraid is hosting my home automation server it's not an option to bring it down for several days as it would affect many systems. I don't have hardware to host it in parallel unfortunately.

That will complicate things, but it's still the best suggestion I have.

12 minutes ago, secretstorage said:

3) With many threads on this and people having corrected the mcvlan issues already, it's really difficult to understand how everyone is having a hardware issue that produces the same outcome. I understand it's very difficult to chaise down such intermitent problems but I also fell there is a lot of 'it works on my end' vibes going around.

There can be a lot of different issues causing similar symptoms, most users don't have any issues, one other possibility that comes to mind is this one, see if it applies to you.

https://forums.unraid.net/bug-reports/stable-releases/crashes-since-updating-to-v611x-for-qbittorrent-and-deluge-users-r2153/

14 minutes ago, secretstorage said:

I'll await another kernel panic crash and send in the logs, as it haven't happened yet.

This may allow us additional information.

So this syslog was before a crash? If yes post new one after a crash, it can help if there's something about the panic logged in the syslog.

secretstorage · February 28

The crash occurred today around 22.20 CET and I've recovered around 0:20 CET, so rough 2h later.

Please see the logs attached

syslog-10.10.10.99.log

JorgeB · February 28

Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: WARN Successful completion on short TX
Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: Looking for event-dma 0000000102666710 trb-start 0000000102666720 trb-end 0000000102666720 seg-start 0000000102666000 seg-end 0000000102666ff0

This is the last thing logged before the crash, seems some USB controller related issue, it's not good but not sure it would be a reason to crash the server, try again, if the same thing happens it might be.

secretstorage · February 28

OK, I'll continue logging and will come back to this thread upon another crash.

It's typically every 10-14 days.

Thanks for looking into it!

secretstorage · April 6

It happened 3 times in the last two weeks, with the latest one about 20 minutes ago.

I can look up the other two, which took place when I was on leave.

Thanks for looking into it.

syslog-10.10.10.99.log

JorgeB · April 6

Lots of GPU related errors, also a lot of USB disconnects, are you disconnecting/reconnecting any USB devices?

secretstorage · April 6

My 3D printer is powering down and up maybe 2-3 times a day and I am running OctoPrint container - so this could be one.
Outside of that nothing else is connecting/disconnecting.

Need to remove the GPU as it's been giving me grief recently and try to run it headless.
The issue with that is that I will not know what state the server is in when it locks up.

Server Crashes Frequently - Kernel Panic - Persistent across HW migration (Ryzen > Intel)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation