Unraid_Noob Posted March 16, 2020 Share Posted March 16, 2020 Hi all, Lately my server started to behave strangely. It freezes without further notice. I checked the logs but did not see any problem. I have an out of band nic to be able to connect to the server to remote control it and reboot it. When the server freezes the ipmi interface is also unreachable. You can find the diagnostic files attach to this post. Thanks. Denis ket-diagnostics-20200316-1113.zip Quote Link to comment
JonathanM Posted March 16, 2020 Share Posted March 16, 2020 4 hours ago, Unraid_Noob said: When the server freezes the ipmi interface is also unreachable. With that symptom, the first thing I would investigate is CPU cooling, PSU issues, memory issues. If it's hanging so hard that the IPMI is dead, that's almost got to be hardware. Quote Link to comment
Unraid_Noob Posted March 17, 2020 Author Share Posted March 17, 2020 (edited) Dear Jonathan, Thank you for your comment. It helped me a lot for the investigation. By checking the IPMI events I manage to find always the same error before the freeze. Here is the extract: "742 | 03/16/2020, 19:11:49 | CPU_CATERR | Processor | State Asserted - Asserted". By googling, I found that CATERR stands for catastrophic error. I read this article which explain pretty well the error handling process. Now I have to find what is causing my problem. Denis Edited March 17, 2020 by Unraid_Noob Quote Link to comment
Unraid_Noob Posted July 9, 2020 Author Share Posted July 9, 2020 Hi all, I contacted the Asrock support in order to get a response about my CPU_CATERR error. After several weeks of investigation, trial and errors, they apparently seem to point a problem with the OS. To summarize the problem, I experience random server freezes with a CPU_CATERR error in IPMI logs. The only solution is a hard reset of the server. Another interesting aspect is that when the problem occurs, the switch where my server is connected gets mad and blocks all the other ports as well. Meaning all the other hosts connected to the same switch become unreachable. After the hard restart everything returns to normal. I tried: - To remove all the PCI cards - To remove all the RAM modules except one - To change the RAM modules with another brand - To update the Bios with a special version provided by Asrock - To change the PSU with another one This without any success. Sometimes I can run the server for weeks without any issues, but on the other hand it happens that during the same day the server freezes 2 to 3 times. What can I do to prove to ASRock that the problem is the motherboard itself and nothing else. Thank you for your help. Denis Quote Link to comment
NeoJoris Posted August 20, 2020 Share Posted August 20, 2020 I have somewhat the same issue: I have a dual socket motherboard which has onboard error leds and when these errors happen the leds points at the 2nd CPU. I still have some troubleshooting to do, (re-seat the CPU in the socket (re-seat the RAM), and if that fails switch the CPUs from socket), but it is hard when the system continues to run without errors for months even though i did not change anything. Quote Link to comment
Unraid_Noob Posted August 26, 2020 Author Share Posted August 26, 2020 Hi NeoJoris, Sad news for you. After weeks of investigation and testing I ended requesting a replacement which happened. I just received it. Due to the time requested for the investigation process I bought a Supermicro board which will be my definitive motherboard. I will resell the Asrock one. Cheers, Denis Quote Link to comment
Benedict Eich Posted January 20, 2022 Share Posted January 20, 2022 Hi got the same freezing and CAT_ERR errors out of nothing. sometimes it runs stable sometimes not. i also have a dual socket ASUS Z10PA-D8 Server board. did you solve the problem? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.