Jump to content

Frequent Server Crashes and Segfault Errors


Recommended Posts

Hey everyone,

 

I'm dealing with frequent server crashes and having a tough time figuring out what's causing them. When the server crashes, I'm left with no choice but to hard reboot it. Sometimes, the unRaid GUI remains responsive, but basic functions like stopping dockers or rebooting the server don't work or get stuck.

 

I've been going through the server's syslogs and noticed some common errors, mainly segfault errors, whenever the crashes happen. After looking around this forum, I found that many people suggest running a memtest on the server's RAM when encountering these errors. I've already done this, and it didn't report any issues. I even tried replacing the RAM with new ones, but the problem persists.

 

So far, I've tried:

1. Running a memtest on the previous RAM, which showed no errors.
2. Buying new RAM, but it didn't solve the issue.
3. Removing all unRaid plugins except the essential ones (like CA Plugins).
4. Disabling all Docker containers except for essential ones like Plex.

 

Despite these steps, I'm still struggling to identify the root cause of the problem. I'm not sure if it's related to a specific docker/plugin, an unraid function, or failing hardware. Any advice or insight into this issue would be greatly appreciated.

 

I've attached the server's diagnostics and a couple of recent syslogs showing the common errors I'm encountering.

Thanks in advance for any help with this issue.

 

tower-diagnostics-20240421-0738.zip syslog.txt syslog2.txt

syslog-previous.txt

Edited by Mistershiverz
Link to comment

About a year and a half ago I've had something similar happening to my machine but also had a USB failure at the same time. Once USB failure was resolved, I've still had crashes like you. I had some HDD spares, disconnected array disks from the machine, and installed various different Linux based OSes and Windows on the spares. All were very, very unstable. Failure was always connected to same 2 CPU cores (can't remember which ones, but it was always one or the other).

I've tested with 2 different RAM sets (although my MEMTest was always a PASS on both sets), 2 different motherboards with different BIOS versions, different HDDs, even a different PSU. Only thing that made those unstable OS stable was a different CPU that I borrowed from a local PC repair shop. After some back and forth with Intel support, conclusion was that my CPU was in a pre-fail condition and it got replaced under warranty.

Your mileage may warry.

  • Like 2
Link to comment

@JorgeB @Lavoslav Thanks for the feedback. I'm beginning to share the same suspicion, but I want to understand these errors thoroughly before investing in new hardware. Is it safe to conclude that segfault errors are typically indicative of failing hardware?

 

I've noticed that when I decrease the number of dockers running, the crashes happen less frequently. However, could this simply be because there's less strain on the hardware?

 

Link to comment

I had a failing USB drive that caused the system to keep crashing and issues booting up, although one of the staff told me from the diagnostics that it's a USB port failure, which is unheard off.

I changed the boot USB drive to a new one and all worked nicely since then, you could b e having a similar issue if all diags passed

  • Like 1
Link to comment
2 hours ago, Mistershiverz said:

I've noticed that when I decrease the number of dockers running, the crashes happen less frequently. However, could this simply be because there's less strain on the hardware?

 

If it's the same thing that I had with my CPU that's exactly it. It was not even a matter of IF the machine will crash but how long will it take to crash. With my CPU I could not even install any OS unless I disabled most of the cores and even then it was unstable and any OS would eventualy crash simply by virtue of machine being tuned on. Idle, it would take multiple hours, surfing few hours at most, anything more intensive would crash it somewhere between instantly and handful of minutes...

Just keep in mind I'm not saying it's 100% CPU fault on your end, but the symptoms look eerily familiar.

FYI, my dud was a i9-13900k and it was working like a charm for 4 months before problems started.

Edited by Lavoslav
  • Like 1
Link to comment
Apr 21 03:00:21 Tower kernel: BUG: kernel NULL pointer dereference, address: 0000000000000030
Apr 21 03:00:21 Tower kernel: RIP: 0010:uncharge_folio+0xe/0xd8
Apr 21 03:00:21 Tower kernel: __mem_cgroup_uncharge_list+0x49/0x84
Apr 21 03:00:21 Tower kernel: release_pages+0x2de/0x30e
Apr 21 03:00:21 Tower kernel: truncate_inode_pages_range+0x15a/0x382

 

These lines in particular from your logs are interesting; as they all pertain to memory management. 

The kernel panicked with a memory access issue; then the following messages all relate back to physical memory (only some relate back to swap). 

 

Also you updated the BIOS recently as your revision is from March 2024

Apr 21 03:00:21 Tower kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/Pro WS W680-ACE IPMI, BIOS 3401 03/19/2024

 

Reading the notes for that BIOS; they have made changes to memory management

PRO WS W680-ACE BIOS 3401     Version 3401     12.5 MB     2024/03/22
1.Improved DDR5 compatibility
2.Further optimized CEP settings when disabled

 

I see in your logs you're running them at 4800Mhz; did you run them with XMP disabled (which IIRC is 3200Mhz)? 

Check that the RAM slots are free from any debret or dust as well. 

 

You could also try rolling back the BIOS revision; otherwise could be the CPU on it s way out, and the IMC (Integrated Memory Controller) is faulting. 

I had an issue with one of my Ryzen9's...it was always a bit buggy and the other day it just completely died (thankfully replaced under Warranty). 

  • Like 1
Link to comment

@Jarskyappreciate the feedback.
 

Quote

Also you updated the BIOS recently as your revision is from March 2024

Yes amongst the troubleshooting and after running the memtest i did update the boards bios in hope that it would resolve some of the errors. 

 

Quote

I see in your logs you're running them at 4800Mhz; did you run them with XMP disabled (which IIRC is 3200Mhz)? 

No this is worth a look but when in this board's bios i could not see any mention of the RAM using an XMP Profile.

Hmm ok i may have to start looking at hardware replacements as all the feedback looks to be failing hardware.

Link to comment
  • 2 weeks later...

Don't forget check PSU cables for GPU if they are Ben so hard, I had some issues also with crashes cause (I over did my cable management on my pc) PSU corsair in use was a Corsair AX1000, I have 2 and both of them crashed was GPU cable was slightly bent, after replacing them with another PSU that has braided non-flat GPU cables my pc never crashed again.

 

so basically if the crash happens under heavier loads, check your PSU

Link to comment

Hi guys, this is my first post on this forum, I just wanted to say I'm having the same errors, segfaults and hard crashes. I installed unRAID today with a used CPU and random crashes keep happening, I'm running a memtest as of now, let's hope it's a faulty stick of RAM, I don't wanna toss away this CPU... :(

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...