Jump to content

6.10.3 Server crashing and trying to pinpoint the problem.


cwrivers

Recommended Posts

Hey all,

I have been experiencing crashes every 30 days almost exactly down to the hour. The only thing that was running on a monthly basis was CA Backup/Restore, but this is being run on the 2nd of each month and my crashes are usually around the 28th-31st of each month. About 17 days ago i updated from 6.9.2 to 6.10.3 in hopes of something changing by switching docker network type from macvlan to ipvlan. From most of the research that i could find and what was showing in syslogs this was my best path forward, since most of my kernel panics were pointing to macvlan issues. In this process i also did some cleanup of old and unused dockers, updated all my dockers and plugins and for the most part, everything has been running smoothly. Until today....
I was transferring some files from my cache drive to my main PC using a direct 10gb connection between the two. About halfway through the transfer stopped and all webgui connections were unresponsive. I checked the display connected to the server and i was able to navigate somewhat, and pulled open the syslogs and found a mess of messages being written faster than i could read anything. Luckily i enabled "Write syslogs to flash" and it was able to capture all this, but there is so much there that i need help deciphering it.

 

Hopefully this is something that can be diagnosed before i pull the plug and start over from scratch. I have attached the diagnostics.zip from this crash, Looking through the logs, it appears that the issue started around 15:12:00 Today.

 

Thanks in advance!

thevault-diagnostics-20220918-1518.zip

Link to comment

There are PCIe errors affecting multiple devices

Sep 18 15:12:49 TheVault kernel: xhci_hcd 0000:01:00.0: AER: can't recover (no error_detected callback)
Sep 18 15:12:49 TheVault kernel: ahci 0000:01:00.1: AER: can't recover (no error_detected callback)
Sep 18 15:12:49 TheVault kernel: pci 0000:03:00.0: AER: can't recover (no error_detected callback)
Sep 18 15:12:49 TheVault kernel: igb 0000:05:00.0 eth0: PCIe link lost

 

Then

Sep 18 15:13:30 TheVault kernel: ahci 0000:01:00.1: AHCI controller unavailable!

 

Look for a BIOS update, you can also try updating to v6.11.0-rc5, newer kernel might help.

 

Link to comment
5 hours ago, JorgeB said:

There are PCIe errors affecting multiple devices

Sep 18 15:12:49 TheVault kernel: xhci_hcd 0000:01:00.0: AER: can't recover (no error_detected callback)
Sep 18 15:12:49 TheVault kernel: ahci 0000:01:00.1: AER: can't recover (no error_detected callback)
Sep 18 15:12:49 TheVault kernel: pci 0000:03:00.0: AER: can't recover (no error_detected callback)
Sep 18 15:12:49 TheVault kernel: igb 0000:05:00.0 eth0: PCIe link lost

 

Then

Sep 18 15:13:30 TheVault kernel: ahci 0000:01:00.1: AHCI controller unavailable!

 

Look for a BIOS update, you can also try updating to v6.11.0-rc5, newer kernel might help.

 

 

Im on the second to last BIOS version, ill run an update today.

As far as the unraid version, is there anything else that i should try before upgrading again?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...