Frequent Server Crashes and Segfault Errors

April 20, 20242 yr

Hey everyone,

I'm dealing with frequent server crashes and having a tough time figuring out what's causing them. When the server crashes, I'm left with no choice but to hard reboot it. Sometimes, the unRaid GUI remains responsive, but basic functions like stopping dockers or rebooting the server don't work or get stuck.

I've been going through the server's syslogs and noticed some common errors, mainly segfault errors, whenever the crashes happen. After looking around this forum, I found that many people suggest running a memtest on the server's RAM when encountering these errors. I've already done this, and it didn't report any issues. I even tried replacing the RAM with new ones, but the problem persists.

So far, I've tried:

1. Running a memtest on the previous RAM, which showed no errors.
2. Buying new RAM, but it didn't solve the issue.
3. Removing all unRaid plugins except the essential ones (like CA Plugins).
4. Disabling all Docker containers except for essential ones like Plex.

Despite these steps, I'm still struggling to identify the root cause of the problem. I'm not sure if it's related to a specific docker/plugin, an unraid function, or failing hardware. Any advice or insight into this issue would be greatly appreciated.

I've attached the server's diagnostics and a couple of recent syslogs showing the common errors I'm encountering.

Thanks in advance for any help with this issue.

tower-diagnostics-20240421-0738.zip syslog.txt syslog2.txt

syslog-previous.txt

Edited April 20, 20242 yr by Mistershiverz

Quote

April 21, 20242 yr

Community Expert

Multiple apps segfaulting like that suggests to me that there's still an hardware issue, if it's not RAM it could be a board or CPU issue.

Quote

1

April 21, 20242 yr

About a year and a half ago I've had something similar happening to my machine but also had a USB failure at the same time. Once USB failure was resolved, I've still had crashes like you. I had some HDD spares, disconnected array disks from the machine, and installed various different Linux based OSes and Windows on the spares. All were very, very unstable. Failure was always connected to same 2 CPU cores (can't remember which ones, but it was always one or the other).

I've tested with 2 different RAM sets (although my MEMTest was always a PASS on both sets), 2 different motherboards with different BIOS versions, different HDDs, even a different PSU. Only thing that made those unstable OS stable was a different CPU that I borrowed from a local PC repair shop. After some back and forth with Intel support, conclusion was that my CPU was in a pre-fail condition and it got replaced under warranty.

Your mileage may warry.

Quote

2

April 21, 20242 yr

Author

@JorgeB @Lavoslav Thanks for the feedback. I'm beginning to share the same suspicion, but I want to understand these errors thoroughly before investing in new hardware. Is it safe to conclude that segfault errors are typically indicative of failing hardware?

I've noticed that when I decrease the number of dockers running, the crashes happen less frequently. However, could this simply be because there's less strain on the hardware?

Quote

April 21, 20242 yr

I had a failing USB drive that caused the system to keep crashing and issues booting up, although one of the staff told me from the diagnostics that it's a USB port failure, which is unheard off.

I changed the boot USB drive to a new one and all worked nicely since then, you could b e having a similar issue if all diags passed

Quote

1

April 21, 20242 yr

Author

@PSYCHOPATHiO thanks for that i have also replaced the Boot USB during this troubleshooting process as i also had read another thread where a user with similar errors reported that resolved it for them. Unfortunately in my case has made no change.

Quote

April 21, 20242 yr

2 hours ago, Mistershiverz said:

I've noticed that when I decrease the number of dockers running, the crashes happen less frequently. However, could this simply be because there's less strain on the hardware?

If it's the same thing that I had with my CPU that's exactly it. It was not even a matter of IF the machine will crash but how long will it take to crash. With my CPU I could not even install any OS unless I disabled most of the cores and even then it was unstable and any OS would eventualy crash simply by virtue of machine being tuned on. Idle, it would take multiple hours, surfing few hours at most, anything more intensive would crash it somewhere between instantly and handful of minutes...

Just keep in mind I'm not saying it's 100% CPU fault on your end, but the symptoms look eerily familiar.

FYI, my dud was a i9-13900k and it was working like a charm for 4 months before problems started.

Edited April 21, 20242 yr by Lavoslav

Quote

1

April 21, 20242 yr

Author

@Lavoslav cool appreciate the feedback. Similar story here the Server was rock solid for 12 Months now struggling to get a 48 hours of uptime.

Quote

April 22, 20242 yr

Apr 21 03:00:21 Tower kernel: BUG: kernel NULL pointer dereference, address: 0000000000000030
Apr 21 03:00:21 Tower kernel: RIP: 0010:uncharge_folio+0xe/0xd8
Apr 21 03:00:21 Tower kernel: __mem_cgroup_uncharge_list+0x49/0x84
Apr 21 03:00:21 Tower kernel: release_pages+0x2de/0x30e
Apr 21 03:00:21 Tower kernel: truncate_inode_pages_range+0x15a/0x382

These lines in particular from your logs are interesting; as they all pertain to memory management.

The kernel panicked with a memory access issue; then the following messages all relate back to physical memory (only some relate back to swap).

Also you updated the BIOS recently as your revision is from March 2024

Apr 21 03:00:21 Tower kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/Pro WS W680-ACE IPMI, BIOS 3401 03/19/2024

Reading the notes for that BIOS; they have made changes to memory management

PRO WS W680-ACE BIOS 3401     Version 3401     12.5 MB     2024/03/22
1.Improved DDR5 compatibility
2.Further optimized CEP settings when disabled

I see in your logs you're running them at 4800Mhz; did you run them with XMP disabled (which IIRC is 3200Mhz)?

Check that the RAM slots are free from any debret or dust as well.

You could also try rolling back the BIOS revision; otherwise could be the CPU on it s way out, and the IMC (Integrated Memory Controller) is faulting.

I had an issue with one of my Ryzen9's...it was always a bit buggy and the other day it just completely died (thankfully replaced under Warranty).

Quote

1

April 22, 20242 yr

Author

@Jarskyappreciate the feedback.

Quote

Also you updated the BIOS recently as your revision is from March 2024

Yes amongst the troubleshooting and after running the memtest i did update the boards bios in hope that it would resolve some of the errors.

Quote

I see in your logs you're running them at 4800Mhz; did you run them with XMP disabled (which IIRC is 3200Mhz)?

No this is worth a look but when in this board's bios i could not see any mention of the RAM using an XMP Profile.

Hmm ok i may have to start looking at hardware replacements as all the feedback looks to be failing hardware.

Quote

May 1, 20242 yr

Another thing to note, with my old hardware the Ryzen 1700x if the RAM 4 dim slots are filled it used to crash frequently, it took my a long time to figure that one out or it could be ram incompatibility, so try setting the ram to default stock speeds.

Edited May 1, 20242 yr by PSYCHOPATHiO

Quote

May 1, 20242 yr

Author

@PSYCHOPATHiO will do thank you that is what i will test next before looking at potential new Mobo or CPU.

Quote

May 1, 20242 yr

Don't forget check PSU cables for GPU if they are Ben so hard, I had some issues also with crashes cause (I over did my cable management on my pc) PSU corsair in use was a Corsair AX1000, I have 2 and both of them crashed was GPU cable was slightly bent, after replacing them with another PSU that has braided non-flat GPU cables my pc never crashed again.

so basically if the crash happens under heavier loads, check your PSU

Quote

May 2, 20242 yr

Hi guys, this is my first post on this forum, I just wanted to say I'm having the same errors, segfaults and hard crashes. I installed unRAID today with a used CPU and random crashes keep happening, I'm running a memtest as of now, let's hope it's a faulty stick of RAM, I don't wanna toss away this CPU... :(

Quote

May 2, 20242 yr

Community Expert

2 minutes ago, apool said:

I'm having the same error

Start your own thread with your Diagnostics

Quote

May 2, 20242 yr

6 minutes ago, trurl said:

Start your own thread with your Diagnostics

Will do, whenever memtest finishes

Quote

Frequent Server Crashes and Segfault Errors

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)