Server randomly Crashes


Recommended Posts

Hello all,

My Situation: fresh System installed since 3 Month with new UNRAID V6, works fine for weeks, used Docker(lancache, folding@home, jdownloader[query downloads with slow internet]) also used VMs Windows and Linux and i had have some SMB and APF shares.

Problem: My system crashes randomly after some time, even with no docker and no VMs started. When the system crashes it needs to be shutdown manually, i can´t login to the webUI and also the console is not responding but i still get my hourly Array Status and can ping the systems IP-Address.

I have inserted an USB-Networkcard, i will make a test without it.

- No change! -

In the logs Folder there are no logs.

 

Greetings

mps-diagnostics-20200524-1440.zip

Edited by D'n'S137
added diagnostics ZIP
Link to comment

At last boot i got no errors after two hours the system crashed again i had to reset the system(hard shutdown). Today i booted the system an got some errors in the LOG see below:

May 24 17:36:58 MPS kernel: Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
May 24 17:36:58 MPS kernel: tsc: Fast TSC calibration failed
May 24 17:36:58 MPS kernel: ACPI: Early table checksum verification disabled
May 24 17:36:58 MPS kernel: mce: [Hardware Error]: Machine check events logged
May 24 17:36:58 MPS kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108
May 24 17:36:58 MPS kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff8168b28a MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
May 24 17:36:58 MPS kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1590334600 SOCKET 0 APIC 8 microcode 8001138
May 24 17:36:58 MPS kernel: mce: [Hardware Error]: Machine check events logged
May 24 17:36:58 MPS kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000000000108
May 24 17:36:58 MPS kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff8168b28a MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
May 24 17:36:58 MPS kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1590334600 SOCKET 0 APIC 1 microcode 8001138
May 24 17:36:58 MPS kernel: floppy0: no floppy controllers found
May 24 17:36:58 MPS kernel: ccp 0000:07:00.2: psp initialization failed
May 24 17:36:58 MPS kernel: ACPI Warning: SystemIO range 0x0000000000000B00-0x0000000000000B08 conflicts with OpRegion 0x0000000000000B00-0x0000000000000B0F (\GSA1.SMBI) (20180810/utaddress-204)
May 24 17:36:58 MPS kernel: ata5.00: failed to enable Sense Data Reporting, Emask 0x1
May 24 17:36:58 MPS kernel: ata5.00: failed to enable Sense Data Reporting, Emask 0x1
May 24 17:36:58 MPS kernel: ata6.00: failed to enable Sense Data Reporting, Emask 0x1
May 24 17:36:58 MPS kernel: ata6.00: failed to enable Sense Data Reporting, Emask 0x1
May 24 17:37:11 MPS rpc.statd[1972]: Failed to read /var/lib/nfs/state: Success

 

Do i have a bad CPU? Are there any reliable hardware test? I bought the CPU *used* to save a bug, maybe this was a scam/mistake.

Link to comment

I ran into MCEs with my old server. The recommendations from here were that I contact the CPU & motherboard vendors to see if there was anything they could do/help with. I ended up having to replace the CPU.

 

Not saying that's guaranteed to be your only course of action here, but bracing you for the worst. Wait for someone more knowledgeable to chime in, but you may want to at least start checking with your vendors.

  • Like 1
Link to comment
I ran into MCEs with my old server. The recommendations from here were that I contact the CPU & motherboard vendors to see if there was anything they could do/help with. I ended up having to replace the CPU.
 
Not saying that's guaranteed to be your only course of action here, but bracing you for the worst. Wait for someone more knowledgeable to chime in, but you may want to at least start checking with your vendors.
Thanks for your experience report. It leads me to search for my Problem at AMD, some other people had similar issues with the same CPU it seems to be a Problem with the C-States and the SMT feature so I disabled C-States in my BIOS. Now I am testing if this solved my Problem. But I will reach out for the AMD support because it seems like they know the issue, with these Ryzen 7 1700(x) CPUs.



Gesendet von meinem Pixel 3a XL mit Tapatalk

Link to comment
21 hours ago, Frank1940 said:

google     unraid.net  amd ryzen problems

 

And read the posts from Unraid.net first.    You can read the ones on Reddit if you want more information.

Okay so Thanks for your advise.

I disabled C-States and SMT features, since i had no Problems at all.

But i was wondering why can i run the Server for some days bevor and now i get these errors, because C-States were active all the time.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.