Velicoma Posted January 15 Share Posted January 15 (edited) Good evening, I've spent the last couple of months (or longer) trying to troubleshoot this issue with a previously stable system (~6 months uptime, only short because I have been updating/upgrading some components). I am only a low quality cosplayer of a sys admin, so I apologise in advanced for likely missing something obvious. I have performed a number of troubleshooting steps, the last of which was macvlan to ipvlan change on docker networking as I had seen some call trace errors in the logs; but the system then crashed within 24 hours. I have been able to resolve all my issues across three unraid installs, but this one has eluded me. *disappointed hercules* I have tried to look through the diagnostics, but between work and two young kids, my tired brain has not been able to make much sense of it. I had an issue I thought was the source previously with some CRC read errors on two drives that I was near certain were good (as during parity check suddenly read errors across many drives). Cable replacements etc. All stopped when I mounted an 80mm fan to the HBA card (9300-16i) heatsink. The Kernel panic error I only found recently, as the system would hard lock and is under the house. I have a PiKVM (BLIKVM) which would rarely get a video output from the Ryzen system, but does not on the Intel one (see screenshot) where I can see the Kernel Panic error. Hence macvlan to ipvlan change recently, as it had been stable on macvlan previously and I have a Unifi Dream Machine Pro so didn't want to lose the traffic data monitoring. I am reading into the firmware on the HBA card, in case that is on an older version and might need upgrading. Unraid version 6.12.6 but has persisted across all of 6.12.x at least, potentially the later 6.11.x as well. I do have quite a lot of plugins installed, not sure if there's an easy list I can create to add here. Troubleshooting activities latest to oldest: Macvlan to ipvlan docker networking USB replacement (16gb micro Sandisk to 8GB Kanguru Blue) Whole system migration (Ryzen to Intel workstation - details below, as I needed the Ryzen for a third kids PC, and read some users had stability issues with Ryzen) Multiple memory replacements while on Ryzen (including some unregistered ECC, 2x16GB DDR4, 2x32GB DDR4) Unraid verison updates Removing some plugins (System Fan for e.g.) Number of other steps. I do have quite a lot of plugins installed, not sure if there's an easy list I can create to add here. Major hardware components Current: Intel Z640 (BIOS March 2023) Intel® Xeon® CPU E5-2620 v4 32GB 2x16GB ECC Registered USB migrated from Sandisk micro to Kanguru 8GB blue (which I had hoped was the issue) in USB 2.0 port (both systems) 4 port Intel NIC PCIE Nvidia GTX 1050 HP PCIE NVME M2 (Sabrent Rocket 4TB) LSI 9300-16i HBA (IT mode) 12x mixed HDDs from 8TB Iron Wolfs, WD Whites 12TBs, 14-20TB X16-20 EXOs Plenty of fans HDDs are running on separate PSU (1200w Corsair) The rest on HP's PSU Previous Ryzen 3900X MSI B550 Tomahawk 64GB (2x32GB) DDR4 3200 GTX 1050 ti (instead of 1050 above) All else the same. Apologies if I've missed some information, it's getting late (well, early I suppose) here. Appreciate reading the wall of text. Kind regards, Andrew darkstar-diagnostics-20240115-2345.zip Edited January 15 by Velicoma Quote Link to comment
JorgeB Posted January 15 Share Posted January 15 The previous syslog still shows a macvlan call race: Jan 11 13:55:37 Darkstar kernel: macvlan_broadcast+0x10a/0x150 [macvlan] Jan 11 13:55:37 Darkstar kernel: ? _raw_spin_unlock+0x14/0x29 Jan 11 13:55:37 Darkstar kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan] Note that if you changed to ipvlan but didn't reboot it could still crash because of macvlan, assuming syslog server is still enabled post new diags if it crashes again now. Quote Link to comment
Velicoma Posted January 15 Author Share Posted January 15 54 minutes ago, JorgeB said: The previous syslog still shows a macvlan call race: Jan 11 13:55:37 Darkstar kernel: macvlan_broadcast+0x10a/0x150 [macvlan] Jan 11 13:55:37 Darkstar kernel: ? _raw_spin_unlock+0x14/0x29 Jan 11 13:55:37 Darkstar kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan] Note that if you changed to ipvlan but didn't reboot it could still crash because of macvlan, assuming syslog server is still enabled post new diags if it crashes again now. Ah did not realise I needed to reboot as the docker service was off to make the change. Now it's rebooted from the hard lock, I'll keep an eye on it. Thank you! 1 Quote Link to comment
Velicoma Posted January 16 Author Share Posted January 16 System locked up again this afternoon unfortunately. Kernel panic from the KVMoverIP console. I can't seem to select eth0 or bond0 for the containers that had static IPs previously, and so they fail to start. I suspect it's a setting I need to change in network settings or docker settings, but I'll have to wait until later when plex is no longer needed. Would the docker bridge setting for the containers be causing the issues? I do have 802.3af bonding on 5 NICs (one port disconnected cable). Not sure if any of that is related. darkstar-diagnostics-20240116-2129.zip Quote Link to comment
JorgeB Posted January 16 Share Posted January 16 Jan 16 16:03:25 Darkstar kernel: macvlan_broadcast+0x10a/0x150 [macvlan] Jan 16 16:03:25 Darkstar kernel: ? _raw_spin_unlock+0x14/0x29 Jan 16 16:03:25 Darkstar kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan] There are still macvlan call trace, these should not happen with ipvlan, post the output of: docker network ls Quote Link to comment
Velicoma Posted January 16 Author Share Posted January 16 Hi JorgeB, Thanks for the fast reply! root@Darkstar:~# docker network ls NETWORK ID NAME DRIVER SCOPE a669bba618aa br0 ipvlan local d13245982de2 bridge bridge local 9249e41856c0 host host local 8fc93cafebd7 none null local ae863481cf62 proxynet bridge local 16e18470056d pterodactyl_nw bridge local Before I saw your message I noticed that enable host access to containers was still enabled, and the extra info specified macvlan; even though I had ipvlan enabled. So I disabled that, and started up docker again. To see your message and have run the command. Happy to revert and run the command again though. Containers with specified IPs are back, but still showing br0, rather than Bond0 or eth0. So I'm not sure if I've done it all right now. Quote Link to comment
JorgeB Posted January 16 Share Posted January 16 No macvaln now, suggest rebooting one more time and see if it no longer crashes now. 1 Quote Link to comment
Velicoma Posted January 17 Author Share Posted January 17 Great, gave it a reboot last night, and will keep an eye on it. Thanks for your help again Jorge! Much appreciated. Quote Link to comment
secretstorage Posted February 12 Share Posted February 12 Hi all, Assuming it's not macvlan issue, and someone (i.e. me) have gone over countless existing threads concerning the same kernel panic issue - what would be the next step in troubleshooting? My docker doesn't seem to have any macvlan entries: root@Server:~# docker network ls NETWORK ID NAME DRIVER SCOPE 49a19b8efe85 br0 ipvlan local fcf8b97dc48a bridge bridge local 8aef9fee9896 host host local d6284db37b39 none null local 9dcd05e2c5a4 pihole_unraid_default bridge local 68560f9ac751 rrmedia bridge local cdd1d30a2ba1 wg0 bridge local I am on: I've migrated hardware last year, and it continues happening. Most likely every 2–3 weeks (without an unraid reboot), but lately it happened 4 times in 3 days. The thing I haven't changed in a while (1 year) is my USB stick, though I am not sure if this could be related. It's a Samsung 32GB BAR Plus Titan Gray 200MB/s I'd greatly appreciate some assistance! Thanks in advance Quote Link to comment
JorgeB Posted February 12 Share Posted February 12 5 minutes ago, secretstorage said: Assuming it's not macvlan issue Most likely not, enable the syslog server and post that after a crash. 1 Quote Link to comment
secretstorage Posted February 12 Share Posted February 12 Thank you for the prompt reply. Will do!! Quote Link to comment
secretstorage Posted February 12 Share Posted February 12 One additional question: Since it's not boot related, would you still recommend mirroring the logs to flash, or `appdata` storage will suffice in this instance? Quote Link to comment
JorgeB Posted February 12 Share Posted February 12 Appdata should be fine but you need to input the main Unraid server IP in the remote syslog server filed, or nothing will be logged. 1 Quote Link to comment
itimpi Posted February 12 Share Posted February 12 2 hours ago, secretstorage said: Since it's not boot related, would you still recommend mirroring the logs to flash, or `appdata` storage will suffice in this instance? If its only going to be for a short period I would say mirror to flash as easiest and catches the most information.. If it will be for any significant length of time then use the remote syslog server field to log to the location you specify to avoid wear on the flash drive. Quote Link to comment
trurl Posted February 12 Share Posted February 12 You should go ahead and post diagnostics now since we don't have it yet in this thread. Quote Link to comment
secretstorage Posted February 13 Share Posted February 13 syslog-10.10.10.99.logserver-diagnostics-20240213-1804.zip Please see the diagnostics attached. Second, I am including the logs from the last 24 hours, which may provide some clues or further confusion. What have not happened before, for an unknown reason my whole docker simply stopped yesterday and was disabled until I enabled it about an hour ago. The logs are also flooded with Server kernel: device vethe9dd15e entered promiscuous mode Please feel free to have a look, if you see anything relevant. Thank you in advance!! Quote Link to comment
JorgeB Posted February 13 Share Posted February 13 Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. 1 hour ago, secretstorage said: The logs are also flooded with Server kernel: device vethe9dd15e entered promiscuous mode This can be normal, but if it keeps spamming the log there could be a container constantly restarting, you can confirm that by looking at the containers up time. Quote Link to comment
secretstorage Posted February 13 Share Posted February 13 2 hours ago, JorgeB said: if it still crashes it's likely a hardware problem Thanks very much for looking into it so quickly! 1) You may have missed that in an earlier post, I've migrated ALL hardware (including case 😉) bar the USB drive and the issue continues happening as it had before. 2) I understand it may be a route for some but since Unraid is hosting my home automation server it's not an option to bring it down for several days as it would affect many systems. I don't have hardware to host it in parallel unfortunately. 3) With many threads on this and people having corrected the mcvlan issues already, it's really difficult to understand how everyone is having a hardware issue that produces the same outcome. I understand it's very difficult to chaise down such intermitent problems but I also fell there is a lot of 'it works on my end' vibes going around. I'll await another kernel panic crash and send in the logs, as it haven't happened yet. This may allow us additional information. Fingers crossed! 🤞 Quote Link to comment
JorgeB Posted February 13 Share Posted February 13 10 minutes ago, secretstorage said: You may have missed that in an earlier post, I've migrated ALL hardware (including case 😉) bar the USB drive and the issue continues happening as it had before. Although unlikely it's not impossible that there would also be a problem with the new hardware. 11 minutes ago, secretstorage said: 2) I understand it may be a route for some but since Unraid is hosting my home automation server it's not an option to bring it down for several days as it would affect many systems. I don't have hardware to host it in parallel unfortunately. That will complicate things, but it's still the best suggestion I have. 12 minutes ago, secretstorage said: 3) With many threads on this and people having corrected the mcvlan issues already, it's really difficult to understand how everyone is having a hardware issue that produces the same outcome. I understand it's very difficult to chaise down such intermitent problems but I also fell there is a lot of 'it works on my end' vibes going around. There can be a lot of different issues causing similar symptoms, most users don't have any issues, one other possibility that comes to mind is this one, see if it applies to you. https://forums.unraid.net/bug-reports/stable-releases/crashes-since-updating-to-v611x-for-qbittorrent-and-deluge-users-r2153/ 14 minutes ago, secretstorage said: I'll await another kernel panic crash and send in the logs, as it haven't happened yet. This may allow us additional information. So this syslog was before a crash? If yes post new one after a crash, it can help if there's something about the panic logged in the syslog. Quote Link to comment
secretstorage Posted February 28 Share Posted February 28 The crash occurred today around 22.20 CET and I've recovered around 0:20 CET, so rough 2h later. Please see the logs attached syslog-10.10.10.99.log Quote Link to comment
JorgeB Posted February 28 Share Posted February 28 Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state. Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: WARN Successful completion on short TX Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1 Feb 27 22:13:28 Server kernel: xhci_hcd 0000:03:00.0: Looking for event-dma 0000000102666710 trb-start 0000000102666720 trb-end 0000000102666720 seg-start 0000000102666000 seg-end 0000000102666ff0 This is the last thing logged before the crash, seems some USB controller related issue, it's not good but not sure it would be a reason to crash the server, try again, if the same thing happens it might be. 1 Quote Link to comment
secretstorage Posted February 28 Share Posted February 28 OK, I'll continue logging and will come back to this thread upon another crash. It's typically every 10-14 days. Thanks for looking into it! Quote Link to comment
secretstorage Posted April 6 Share Posted April 6 It happened 3 times in the last two weeks, with the latest one about 20 minutes ago. I can look up the other two, which took place when I was on leave. Thanks for looking into it. syslog-10.10.10.99.log Quote Link to comment
JorgeB Posted April 6 Share Posted April 6 Lots of GPU related errors, also a lot of USB disconnects, are you disconnecting/reconnecting any USB devices? Quote Link to comment
secretstorage Posted April 6 Share Posted April 6 My 3D printer is powering down and up maybe 2-3 times a day and I am running OctoPrint container - so this could be one. Outside of that nothing else is connecting/disconnecting. Need to remove the GPU as it's been giving me grief recently and try to run it headless. The issue with that is that I will not know what state the server is in when it locks up. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.