Jump to content

Tower Crashing Intermittently


Recommended Posts

Folks. Syslog and diagnostics attached. 

 

Symptom: server will run for some number of days or weeks, and then crashes. This has happened maybe 3 times, but I only just moved the syslog to a different machine. Dockers stop running, VMs stop running, webGUI non-responsive, does not respond to ping; machine itself remains powered with fans and lights on. Its on UPS with auto-shutdown, so I do not think it's a power blip (I had one of those a week ago, where the machine died and did not reboot when power came back. 

 

I see the warnings/errors at 23:21:42, but do not know how to interpret. 

 

Server is back online now, hence the diagnostics file. It isn't a major issue, but would prefer to not have to run parity checks all the time lol .

vesper-diagnostics-20231112-1717.zip syslog.txt

Link to comment
  • 1 month later...
  • 1 month later...

Not resolved. Unassigned Devices is still uninstalled; qBit runs in a VM (I have it shutdown now, but didnt' before, but still - it's in a Windows 10 VM)s. 

 

Two crashes in the past week; one after nearly a month of uptime, the second less than 24h after the first's parity check finished. Same symptoms; machine still running (Fans & lights on, HDDs spinning) but no access to Shares/Web/Dockers(Tailscale, Homebridge)/VMs; requires hard shutdown and reboot. 

 

I only managed to capture the syslog from crash 2 (syslog server being on my backup OMV NAS that needed some attention).

syslog (2).txt vesper-diagnostics-20240129-0831.zip

Link to comment

And crashed again today. 

Running in bareback mode with no VMs/Dockers running (homebridge docker is still running; nice for cameras). I'll try safe-mode next; I'll run a mem-test on the 7th if it's crashed by then.

 

Nothing mentioned about the crash in the syslog at all.  

Link to comment

Edit 2: Unable to reboot; logs stalled; "Array stopping: unmounting disks" but no further unmounting attempts. Tried "reboot" from cli and GUI - did not reboot. This is getting ridiculous. No way to force a reboot or force an unmount?!?

 

Edit1: Found this one; corrupt file wouldn't transfer; hung up doing something; also prevented Array from unmounting. 

 

Data point: removed nearly all my plugins, all VMs off, only Plex and Homebridge running. 

 

Caught server with random CPUs pinned; server idle (I had a sync running from another VM pulling files onto my backup NAS; terminated that and shutdown the VM; no change). Not sure if it's related, but it doesn't seem right. No crash yet. Nothing in "Processes" with any significant CPU usage; only 2 above 0.1%, (6% & 2%) - not sure how else to determine what is using it. 

 

image.png.1df064f234eb5062b883ced111c694ac.png

vesper-diagnostics-20240131-2055.zip

Edited by bjsmith911
Link to comment

No issues since last boot; took it offline yesterday to run a memtest. I'll upload the full report later (don't have the memtest USB with me) but the memory checks out. (Possibly vulnerable to high frequency row hammer bit flips, but no errors). 

 

I've remove several, but not all, plugins, including the Nvidia plugin; also removed the GT1030 that was in there, left over from my quad-monitor workstation days. 

20240208_122708741_iOS.jpg

Link to comment
  • 5 months later...

I keep seeing posts re: intermittent crashing. These folks are lambasted by people saying it isn't a problem, but then why do the posts keep coming? Anyway. I am not a developer, and I have inconsistently downloaded and examined my syslogs, but they do consistently show BUGS as the last timestamp before the system goes unresponsive. 

 

E.g.:

image.thumb.png.4ff04774b892cb4a8585c0defd98b0fb.png

Nov1123:21:42Vesper kernel:BUG: unable to handle page fault for address: 00000200636d12d0

Nov1123:21:42Vesper kernel:#PF: supervisor read access in kernel mode

Nov1123:21:42Vesper kernel:#PF: error_code(0x0000) - not-present page

Nov1123:21:42Vesper kernel:PGD 0 P4D 0

Nov1123:21:42Vesper kernel:Oops: 0000 [#1] PREEMPT SMP NOPTI

Nov1123:21:42Vesper kernel:CPU: 2 PID: 163 Comm: kswapd0 Tainted: P     U     O       6.1.49-Unraid #1

Nov1123:21:42Vesper kernel:Hardware name: Micro-Star International Co., Ltd. MS-7D06/MPG Z590 GAMING CARBON WIFI (MS-7D06), BIOS 1.B0 06/12/2023

 

and 

 

image.thumb.png.c9f74d9a076738855fcae6fc8f62652c.png

Jul293:05:44Vesper kernel: BUG: kernel NULL pointer dereference, address: 0000000000000081

Jul293:05:44Vesper kernel: #PF: supervisor read access in kernel mode

Jul293:05:44Vesper kernel: #PF: error_code(0x0000) - not-present page

Jul293:05:44Vesper kernel: PGD 15fcd5067 P4D 15fcd5067 PUD 15fcd4067 PMD 0

Jul293:05:44Vesper kernel: Oops: 0000 [#2] PREEMPT SMP NOPTI

Jul293:05:44Vesper kernel: CPU: 7 PID: 15899 Comm: shfs Tainted: P     UD    O       6.1.79-Unraid #1

Jul293:05:44Vesper kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D06/MPG Z590 GAMING CARBON WIFI (MS-7D06), BIOS 1.B0 06/12/2023

 

 

These posts seem to be layered with animosity from both the Unraid faithful and the "victims", but what is getting lost in the dialogue is that there is an issue. I understand it may ultimately be related to some pairing of Linux Kernel-to-specific Hardware, and not an Unraid issue, but an issue nonetheless. 

 

Examples:


https://www.facebook.com/groups/217132562182318/posts/1593856457843248/
https://www.facebook.com/groups/217132562182318/posts/1594972577731636/
https://www.facebook.com/groups/217132562182318/posts/1598660484029512/

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...