Kernel traps, call traces and overall system instability

December 29, 20241 yr

Hello all, this is my first post for support on this forum so please advise if there is anything I am leaving out. I seem to be having pervasive problems and a variety of different errors that seem related to RAM despite passing a memcheck.

Unraid Version: 6.12.14

Hardware: Aoostar WTR Pro w/ Intel N100. 32GB OWC DD4 3200MHz RAM, Western Digital 1TB WD Blue SN580 NVMe, 2x 18TB Seagate Exos

Plugins: Community Applications, Appdata Backup, Intel GPU Top, Tailscale, Unassigned Devices

Docker Containers: binhex-jellyfin, binhex-prowlarr, binhex-radarr, binhex-readarr, binhex-sabnzbd, binhex-sonarr, calibre, homarr, jellyseer

Description of issues: I am trying to assess if I have a hardware or software issue. I am running into a variety of issues with my setup. I have a record of a few of these from copying them down (syslogs before the most recent crash and diagnostics attached)

Dec 19 19:24:26 Tower kernel: traps: .NET TP Worker[2445] general protection fault ip:1532c2bc5575 sp:14f186b5e380 error:0 in libc.so.6[1532c2b44000+171000]

These errors sometimes lead to docker containers going down or being unresponsive. On several occasions my server has become completely unresponsive both in the GUI and via SSH and I have had to physically power cycle the machine.

I have configured the syslogs to be backed up and have attached the most recent syslog backup from before my most recent full crash requiring a power cycle. It does not appear to capture everything I was seeing in the logs when the server became unresponsive, which I copied and pasted out at the time of the crash. So you will see the logs attached as two files, the syslog backup (I've edited to remove file names) and a "crash log copy and paste" of the errors that took the server offline. I'm also attaching a diagnostics file from after when I had to power cycle. This most recent error and server crash happened when I attempted to update my binhex-radarr container.

Just to reiterate, I have run memtest86 and my RAM passed.

Is there anything obvious from the logs or diagnostics that is misconfigured that could be causing pervasive instability on my server? I appreciate the help and can answer any questions!

One other note: I am currently visiting family out of state so am troubleshooting this all remotely and relying on a roommate to power cycle the server when it's down. I note this to explain the delay between the crash and power cycling.

syslog-192.168.0.141 2.log Crash log copy and paste.txt tower-diagnostics-20241229-1319.zip

Quote

December 29, 20241 yr

Community Expert

Difficult to say based on the logs if they are a hardware or software issue, since memtest is only definitive if it finds errors, if you have multiple sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM, another thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the docker containers.

Quote

1

December 31, 20241 yr

Author

Thanks @JorgeB. I tried running in safemode for about 24 hours and connected this morning the log was completely filled with rsyslog read only errrors. When I tried to navigate to Toos > Diagnostics to download diagnostics, the entire server crashed.

That leads me to seriously believe I have a hardware issue.

Quote

January 6, 20251 yr

Quote

.NET TP Worker[2445]

Unraid doesn't use .NET, so this general protection fault must have come from a different app you're running on your server. Likely one of the *arr apps (Lidarr, Radarr, etc) since they're all .NET apps.

The attached syslog doesn't show a general protection fault, but it does have a "BUG: unable to handle page fault" which is usually indicative of either memory or disk swap issues (if you have swap enabled), but you said you ran memtest already. Hmm.

Quote

January 6, 20251 yr

Author

Ok a bit of a user error and poor troubleshooting on me here. I was not running safe mode, I simply had my docker containers disabled as I was troubleshooting remotely and needed tailscale up. When I came back to my server, I went ahead and actually put it in safe mode and then under pressure to get media available, I went ahead and updated my containers and then enabled one of them. This led to a full crash and I'm attaching diagnostics where the logs are filled with errors.

I am now fully running safe mode, all containers disabled and have been stable for approximately 14 hours. Still interested in if anyone sees any pattern in the logs as the zpool error that caused the latest full crash was new.

tower-diagnostics-20250106-1010.zip

Quote

January 13, 20251 yr

Author

Quick update - I replaced my RAM about 36 hours ago and have not had a single error. Will try to report back on if this continues to work. This may be lesson that, as @JorgeBmentioned, memtest86 is only definitive if it finds errors.

Quote

1

Kernel traps, call traces and overall system instability

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)