Jump to content

Unraid random crashes / hardware check errors


Recommended Posts

Hi all, 

 

Earlier this year I got my Unraid system running just fine. It was running several Dockers and 2 virtual machines. After getting some trouble with the system (one VM crashing and losing the audio passthrough wich could only be fixed by a full system restart) and wanting to split the system up in 2 systems, I took it all apart and build one gaming system using the hardware that Unraid previously ran on, and a new Unraid server wich now ran on my old gaming hardware. This setup worked just fine. 

 

A few weeks ago, I got a new GPU and decided to try and see if this new GPU and Windows 11 would run smoothly on Unraid. Since it ran perfectly, I decided to consolidate both systems into one. The hardware that I use now is the same hardware as earlier this year. But now it has problems. 

 

I get random crashing. When the system crashes, it locks up. Nothing works and I can't get into Unraid at all. It requires a hard reset. This can have multiple causes but in my case it could be the Ryzen bug. I have followed the guide to disabeling global C states and have turned the PSU mode in the BIOS to "idle power" or something like that. I have also used the fix that Spaceinvaderone posted on Youtube:

 

 

But my system still acts weird. I haven't had a full lockup since disabeling the C states but have had the system randomly reboot. It looked like a clean reboot. I did however, just get this error message:

image.thumb.png.ea1585c287a0f48a524180a765f2879b.png

 

I downloaded the nerdpack, installed mcelog (I have no idea if that also means that it runs because I can't figure out if I need to start that logging thing somewhere) and downloaded the diagnostics. 

 

Does anyone know what the hardware error is?

tower-diagnostics-20210830-2208.zip

Link to comment
50 minutes ago, workermaster said:

wondering how it got my email address

Everything on your server that sends emails uses the same process, and it uses the email address you have configured in Notifications. It is good that you get email Notifications, though that one looks like there is something wrong with it. Probably it comes from Fix Common Problems.

 

Those previous Diagnostics were taken just after you installed mcelog.

 

Post new Diagnostics.

 

 

Link to comment
1 hour ago, trurl said:

Everything on your server that sends emails uses the same process, and it uses the email address you have configured in Notifications. It is good that you get email Notifications, though that one looks like there is something wrong with it. Probably it comes from Fix Common Problems.

 

Those previous Diagnostics were taken just after you installed mcelog.

 

Post new Diagnostics.

 

 

That sounds good. This means that the email system works as it should. 

Here are the new diagnostics. 

 

 

tower-diagnostics-20210831-1503.zip

Link to comment

The MCE are just after boot and don't continue so they can be ignored.

 

Also noticed this in syslog

Aug 31 09:15:08 Tower kernel: traps: Radarr[10112] general protection fault ip:14c1f0b8a6f3 sp:14c1eb2876e0 error:0 in libcoreclr.so[14c1f0849000+36a000]

GPF is an excuse to run memtest.

 

Real question is what is wrong with your cron setup causing the email.

 

What do you get from the command line with this?


 

Link to comment
16 minutes ago, trurl said:

What do you get from the command line with this?

ls /etc/cron.daily

 

Not that much:

image.png.593430b88152b08f7a2bda460f9b82b7.png

 

The server still hasn't crashed but the gaming VM has. It suddenly freezes and after a few seconds, the screen goes black. I get the feeling that I am going to run into every single problem that has ever existed. 

Link to comment
8 minutes ago, trurl said:

That suggests the user scripts plugin. Do you have any scripts used by that plugin?

 

See if @Squid has an idea.

I have 3 scripts in there. One of them was for resetting my RX 6800 XT GPU. I have deleted that one since I do not need it. 

There is also another script for removing the 3 transcode limit on the GTX 1050 Ti for Plex. But that one doesn't work I think because I am still limited. 

And there is a script to dump the vbios of my GPU. 

 

At this moment the only script running is the one for removing the limiter of the 1050 Ti and that one only runs at the array startup. 

Link to comment
21 hours ago, Squid said:

This is a known issue with 6.10-rc1 and I believe is fixed on the next release.

 

I did not get any new mail last night or the last few reboots. 

 

I did get another VM crash and last night Unraid locked up again. Very annoying that I do not have any logfiles. There appears to not be any errors logged. It all just suddenly stops. 

Currently I suspect that the Unraid hard lock / crash is caused by Binhex Syncthing container. The crashes did not happen before that. 

I also think that the VM crashes are mostly caused by buggy games. It happens when I play older assassins creed games but not when running Furmark or Prime95 (or both).

 

I am going to leave the Syncthing container off for a few days to test that. And I am also going to try playing a newer game to see if that solves the crashing. 

 

The last few days have been rough and I am almost at the point of giving up. I have spent over 2 weeks debugging stuff to get it all running and I feel like I am close to being done but I am also very close to giving up and throwing everything in the garbage. 

Link to comment

Since I have shutdown the Synthing Docker from Binhex, my system has not crashed again last night. So far so good. Will test this for a few more days before I contact Binhex in order to find out what causes this. 

 

I did get another email telling me the same thing as before. Something about the size of a VM not being correct. 

Edited by workermaster
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...