Server freezes without further notice


Recommended Posts

Hi all,

 

Lately my server started to behave strangely. It freezes without further notice. I checked the logs but did not see any problem. I have an out of band nic to be able to connect to the server to remote control it and reboot it. When the server freezes the ipmi interface is also unreachable.

 

You can find the diagnostic files attach to this post.

 

Thanks.

 

Denis

 

ket-diagnostics-20200316-1113.zip

Link to comment

Dear Jonathan,

 

Thank you for your comment. It helped me a lot for the investigation. By checking the IPMI events I manage to find always the same error before the freeze.

Here is the extract:  "742  | 03/16/2020, 19:11:49 | CPU_CATERR       | Processor                          | State Asserted - Asserted". By googling, I found that CATERR stands for catastrophic error. I read this article which explain pretty well the error handling process. Now I have to find what is causing my problem.

 

Denis

Edited by Unraid_Noob
Link to comment
  • 3 months later...

Hi all,

 

I contacted the Asrock support in order to get a response about my CPU_CATERR error. After several weeks of investigation, trial and errors, they apparently seem to point a problem with the OS.

To summarize the problem, I experience random server freezes with a CPU_CATERR error in IPMI logs. The only solution is a hard reset of the server. Another interesting aspect is that when the problem occurs, the switch where my server is connected gets mad and blocks all the other ports as well. Meaning all the other hosts connected to the same switch become unreachable. After the hard restart everything returns to normal.

 

I tried:

- To remove all the PCI cards

- To remove all the RAM modules except one

- To change the RAM modules with another brand

- To update the Bios with a special version provided by Asrock

- To change the PSU with another one

This without any success.

Sometimes I can run the server for weeks without any issues, but on the other hand it happens that during the same day the server freezes 2 to 3 times.

 

What can I do to prove to ASRock that the problem is the motherboard itself and nothing else.

 

Thank you for your help.


Denis

 

Link to comment
  • 1 month later...

I have somewhat the same issue:
image.png.0ec6c8969da609f4035293d0a93e3a48.png

 

I have a dual socket motherboard which has onboard error leds and when these errors happen the leds points at the 2nd CPU. 

I still have some troubleshooting to do, (re-seat the CPU in the socket (re-seat the RAM), and if that fails switch the CPUs from socket), but it is hard when the system continues to run without errors for months even though i did not change anything. 

 



 

Link to comment

Hi NeoJoris,

 

Sad news for you. After weeks of investigation and testing I ended requesting a replacement which happened. I just received it. Due to the time requested for the investigation process I bought a Supermicro board which will be my definitive motherboard. I will resell the Asrock one.

 

Cheers,

 

Denis

Link to comment
  • 1 year later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.