How to troubleshoot instability ?


Cliff

Recommended Posts

I have an unraid server which has been running great for over a year. I am also running a windows 11 VM but after a while I stopped getting updates as I needed to pass throug a TPM module. I then upgraded to the next-branch of unraid and everything seamed to work well.

 

But after a while I started getting random reboots where the whole server crashed that happened more and more frequently.

My first suspicion was that the next-branch caused some problem with the gpu my VM vas using (GTX 1060 vith dumped bios). as I sometimes noticed that the log was at 100% before crashing and was filling up with:

2021-12-11T22:59:01.085965Z qemu-system-x86_64: vfio_region_write(0000:0a:00.0:region1+0x16ff78, 0x0,1) failed: Device or resource busy

 

And if I look in my device info it looks like my 1060 gpu has the address 0000:0a:00.0.

 

Most of the times the server starts normally after a crash/reboot and the Windows 11 VM starts up fine with video-output from the gpu. But after ~8hours - 5min the server craches and seams that it is getting worse as latly the server mostly crashes after a few minutes.

 

So I firstly tried disabeling the VM´s but the server did still crash.

I downgraded to unraid stable, updated bios and removed my 1060 gpu from the server but still is crashes after a few minutes.

I am currently running memtest to exclude any memory-problems. But what else can I do if the memory is fine ?

Can there be some problem with the Unraid USB-stick or something else that causes this problems ?

 

My server specs:

Asus TUF Gaming X570-Plus

64GB RAM

AMD Ryzen 9 3900X

Nvidia 1060 6Gb

Nvidia GT 710

250Gb SSD cache

2x12TB HDD +12TB parity

512GB M2 for VM

 

The only strange thing that I have noticed even when the server was running fine is that I always have to unplug all cables from the 1060 gpu when rebooting the server otherwise I can't use it in any VM's. I am also using a nvidia GT 710 with an HDMI dummy plug as unraid needs a gpu as I understand.

 

unraid-diagnostics-20211212-1451.zip

Edited by Cliff
Link to comment

Not necessarily your problem, but something of interest

 

You bought Corsair Dominator (64Gig Kit 4x16) memory.  Corsair when listing compatibility on it only shows Intel Chipsets.  (And Asus doesn't list that memory within their QVL either)

 

While I will usually only buy from the Motherboards QVL (as I don't trust memory manufacturer's compatibility lists), I find it curious that neither of them list it as being compatible.

Link to comment

I ordered new usb-drives to check if that solves anything. 

Edit: I just remembered that I am passing through the m2 as a raw disk to the windows 11 VM. So I tried booting from it directly without unraid, and right now it looks like there are no problems. I did some testing fo about an hour where I ran cpu/gpu stress tests and there was no reboots. I will do some more testing when I get home from work.

 

But if it works without problems without unraid what could have caused all this instabillity ? Will all be fixed if I migrate to a new usb-key or could there still be something else that is corrupted in some way?

And can I still transfer all settings docker/vms/etc. to the new USB or do I risk corrupting something again if I try to reuse all my settings ?

Edited by Cliff
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.