Jump to content

Server Crashes Frequently - Kernel Panic - Persistent across HW migration (Ryzen > Intel)


Recommended Posts

Good evening,

 

I've spent the last couple of months (or longer) trying to troubleshoot this issue with a previously stable system (~6 months uptime, only short because I have been updating/upgrading some components). I am only a low quality cosplayer of a sys admin, so I apologise in advanced for likely missing something obvious. I have performed a number of troubleshooting steps, the last of which was macvlan to ipvlan change on docker networking as I had seen some call trace errors in the logs; but the system then crashed within 24 hours. I have been able to resolve all my issues across three unraid installs, but this one has eluded me. *disappointed hercules* I have tried to look through the diagnostics, but between work and two young kids, my tired brain has not been able to make much sense of it.

 

I had an issue I thought was the source previously with some CRC read errors on two drives that I was near certain were good (as during parity check suddenly read errors across many drives). Cable replacements etc. All stopped when I mounted an 80mm fan to the HBA card (9300-16i) heatsink.

 

The Kernel panic error I only found recently, as the system would hard lock and is under the house. I have a PiKVM (BLIKVM) which would rarely get a video output from the Ryzen system, but does not on the Intel one (see screenshot) where I can see the Kernel Panic error.

Hence macvlan to ipvlan change recently, as it had been stable on macvlan previously and I have a Unifi Dream Machine Pro so didn't want to lose the traffic data monitoring.

I am reading into the firmware on the HBA card, in case that is on an older version and might need upgrading.

 

Unraid version 6.12.6

but has persisted across all of 6.12.x at least, potentially the later 6.11.x as well.

I do have quite a lot of plugins installed, not sure if there's an easy list I can create to add here.

 

Troubleshooting activities latest to oldest:

Macvlan to ipvlan docker networking

USB replacement (16gb micro Sandisk to 8GB Kanguru Blue)

Whole system migration (Ryzen to Intel workstation - details below, as I needed the Ryzen for a third kids PC, and read some users had stability issues with Ryzen)

Multiple memory replacements while on Ryzen (including some unregistered ECC, 2x16GB DDR4, 2x32GB DDR4)

Unraid verison updates

Removing some plugins (System Fan for e.g.)

Number of other steps.

 

 

I do have quite a lot of plugins installed, not sure if there's an easy list I can create to add here.

 

Major hardware components

 

Current:

Intel Z640 (BIOS March 2023)

Intel® Xeon® CPU E5-2620 v4

32GB 2x16GB ECC Registered

USB migrated from Sandisk micro to Kanguru 8GB blue (which I had hoped was the issue) in USB 2.0 port (both systems)

 

 

4 port Intel NIC PCIE

Nvidia GTX 1050

HP PCIE NVME M2 (Sabrent Rocket 4TB)

LSI 9300-16i HBA (IT mode)

12x mixed HDDs from 8TB Iron Wolfs, WD Whites 12TBs, 14-20TB X16-20 EXOs

Plenty of fans

HDDs are running on separate PSU (1200w Corsair)

The rest on HP's PSU

 

Previous

Ryzen 3900X

MSI B550 Tomahawk

64GB (2x32GB) DDR4 3200

GTX 1050 ti (instead of 1050 above)

All else the same.

 

 

Apologies if I've missed some information, it's getting late (well, early I suppose) here.

 

Appreciate reading the wall of text.

 

Kind regards,

Andrew

 

Screenshot 2024-01-15 at 5.15.30 pm.png

darkstar-diagnostics-20240115-2345.zip

Edited by Velicoma
Link to comment

The previous syslog still shows a macvlan call race:

 

Jan 11 13:55:37 Darkstar kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Jan 11 13:55:37 Darkstar kernel: ? _raw_spin_unlock+0x14/0x29
Jan 11 13:55:37 Darkstar kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]

 

Note that if you changed to ipvlan but didn't reboot it could still crash because of macvlan, assuming syslog server is still enabled post new diags if it crashes again now.

Link to comment
54 minutes ago, JorgeB said:

The previous syslog still shows a macvlan call race:

 

Jan 11 13:55:37 Darkstar kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Jan 11 13:55:37 Darkstar kernel: ? _raw_spin_unlock+0x14/0x29
Jan 11 13:55:37 Darkstar kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]

 

Note that if you changed to ipvlan but didn't reboot it could still crash because of macvlan, assuming syslog server is still enabled post new diags if it crashes again now.

Ah did not realise I needed to reboot as the docker service was off to make the change. Now it's rebooted from the hard lock, I'll keep an eye on it.

Thank you!

  • Like 1
Link to comment

System locked up again this afternoon unfortunately. Kernel panic from the KVMoverIP console.

I can't seem to select eth0 or bond0 for the containers that had static IPs previously, and so they fail to start. I suspect it's a setting I need to change in network settings or docker settings, but I'll have to wait until later when plex is no longer needed.

 

Would the docker bridge setting for the containers be causing the issues? I do have 802.3af bonding on 5 NICs (one port disconnected cable). Not sure if any of that is related.

darkstar-diagnostics-20240116-2129.zip

Link to comment
Jan 16 16:03:25 Darkstar kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Jan 16 16:03:25 Darkstar kernel: ? _raw_spin_unlock+0x14/0x29
Jan 16 16:03:25 Darkstar kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]

 

There are still macvlan call trace, these should not happen with ipvlan, post the output of:

 

docker network ls

 

Link to comment

Hi JorgeB,

 

Thanks for the fast reply!

 

root@Darkstar:~# docker network ls
NETWORK ID     NAME             DRIVER    SCOPE
a669bba618aa   br0              ipvlan    local
d13245982de2   bridge           bridge    local
9249e41856c0   host             host      local
8fc93cafebd7   none             null      local
ae863481cf62   proxynet         bridge    local
16e18470056d   pterodactyl_nw   bridge    local

 

Before I saw your message I noticed that enable host access to containers was still enabled, and the extra info specified macvlan; even though I had ipvlan enabled. So I disabled that, and started up docker again. To see your message and have run the command. Happy to revert and run the command again though.

 

Containers with specified IPs are back, but still showing br0, rather than Bond0 or eth0. So I'm not sure if I've done it all right now.

Link to comment
  • 4 weeks later...

Hi all,

Assuming it's not macvlan issue, and someone (i.e. me) have gone over countless existing threads concerning the same kernel panic issue - what would be the next step in troubleshooting?
 

My docker doesn't seem to have any macvlan entries:
image.png.ab2d7c3829189b453fac74f5cf3b3839.png
 

root@Server:~# docker network ls
NETWORK ID     NAME                    DRIVER    SCOPE
49a19b8efe85   br0                     ipvlan    local
fcf8b97dc48a   bridge                  bridge    local
8aef9fee9896   host                    host      local
d6284db37b39   none                    null      local
9dcd05e2c5a4   pihole_unraid_default   bridge    local
68560f9ac751   rrmedia                 bridge    local
cdd1d30a2ba1   wg0                     bridge    local

 

I am on:
image.png.e7faa3336cfe4ba26fd49b469a66a55b.png

I've migrated hardware last year, and it continues happening. Most likely every 2–3 weeks (without an unraid reboot), but lately it happened 4 times in 3 days.

1431093245_2024-02-1207_38_47.thumb.jpg.2cad1a439c7ed22804809b6eff1750a5.jpg

The thing I haven't changed in a while (1 year) is my USB stick, though I am not sure if this could be related.

It's a Samsung 32GB BAR Plus Titan Gray 200MB/s

 

I'd greatly appreciate some assistance!

Thanks in advance

Link to comment
2 hours ago, secretstorage said:

Since it's not boot related, would you still recommend mirroring the logs to flash, or `appdata` storage will suffice in this instance?

If its only going to be for a short period I would say mirror to flash as easiest and catches the most information..  If it will be for any significant length of time then use the remote syslog server field to log to the location you specify to avoid wear on the flash drive.

Link to comment

syslog-10.10.10.99.logserver-diagnostics-20240213-1804.zip

Please see the diagnostics attached.

Second, I am including the logs from the last 24 hours, which may provide some clues or further confusion. What have not happened before, for an unknown reason my whole docker simply stopped yesterday and was disabled until I enabled it about an hour ago.

 

The logs are also flooded with Server kernel: device vethe9dd15e entered promiscuous mode

Please feel free to have a look, if you see anything relevant.

 

Thank you in advance!!

Link to comment

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

 

1 hour ago, secretstorage said:

The logs are also flooded with Server kernel: device vethe9dd15e entered promiscuous mode

This can be normal, but if it keeps spamming the log there could be a container constantly restarting, you can confirm that by looking at the containers up time.

Link to comment
2 hours ago, JorgeB said:

if it still crashes it's likely a hardware problem

Thanks very much for looking into it so quickly!

 

1) You may have missed that in an earlier post, I've migrated ALL hardware (including case 😉) bar the USB drive and the issue continues happening as it had before.

2) I understand it may be a route for some but since Unraid is hosting my home automation server it's not an option to bring it down for several days as it would affect many systems. I don't have hardware to host it in parallel unfortunately.

3) With many threads on this and people having corrected the mcvlan issues already, it's really difficult to understand how everyone is having a hardware issue that produces the same outcome. I understand it's very difficult to chaise down such intermitent problems but I also fell there is a lot of 'it works on my end' vibes going around.

 

I'll await another kernel panic crash and send in the logs, as it haven't happened yet.

This may allow us additional information. 

Fingers crossed! 🤞

 

Link to comment
10 minutes ago, secretstorage said:

You may have missed that in an earlier post, I've migrated ALL hardware (including case 😉) bar the USB drive and the issue continues happening as it had before.

Although unlikely it's not impossible that there would also be a problem with the new hardware.

 

11 minutes ago, secretstorage said:

2) I understand it may be a route for some but since Unraid is hosting my home automation server it's not an option to bring it down for several days as it would affect many systems. I don't have hardware to host it in parallel unfortunately.

That will complicate things, but it's still the best suggestion I have.

 

12 minutes ago, secretstorage said:

3) With many threads on this and people having corrected the mcvlan issues already, it's really difficult to understand how everyone is having a hardware issue that produces the same outcome. I understand it's very difficult to chaise down such intermitent problems but I also fell there is a lot of 'it works on my end' vibes going around.

There can be a lot of different issues causing similar symptoms, most users don't have any issues, one other possibility that comes to mind is this one, see if it applies to you.

 

https://forums.unraid.net/bug-reports/stable-releases/crashes-since-updating-to-v611x-for-qbittorrent-and-deluge-users-r2153/

 

14 minutes ago, secretstorage said:

I'll await another kernel panic crash and send in the logs, as it haven't happened yet.

This may allow us additional information. 

So this syslog was before a crash? If yes post new one after a crash, it can help if there's something about the panic logged in the syslog.

 

Link to comment
  • 2 weeks later...
Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: WARN Successful completion on short TX
Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: Looking for event-dma 0000000102666710 trb-start 0000000102666720 trb-end 0000000102666720 seg-start 0000000102666000 seg-end 0000000102666ff0

 

This is the last thing logged before the crash, seems some USB controller related issue, it's not good but not sure it would be a reason to crash the server, try again, if the same thing happens it might be.

  • Thanks 1
Link to comment
  • 1 month later...

My 3D printer is powering down and up maybe 2-3 times a day and I am running OctoPrint container - so this could be one.
Outside of that nothing else is connecting/disconnecting.

 

Need to remove the GPU as it's been giving me grief recently and try to run it headless.
The issue with that is that I will not know what state the server is in when it locks up.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...