(Solved) Random crashes, I'm now stumped


Ixel

Recommended Posts

Hi,

In the last few weeks I've been having random crashes occur more regularly over that time. What started from perhaps every two weeks went to every week and now virtually every day.

 

I've done a memtest with no errors sadly. I originally took out the NVMe cache drives temporarily with no change, as during some of the crashes I was only able to write to the HDDs and not the NVMe drives until a hard reset was done. I've upgraded to 6.10rc1 due to the macvlan crash, which I originally had allowed the host access to containers (now disabled and using ipvlan). The voltages shown on the IPMI seem fine so I can't believe it's a power supply fault developing. At this point I'm stumped, the hardware (other than the hard drives) is pretty new too and it was reliable for quite some time.

 

I've now enabled mirroring of syslog to flash for now, but I've attached a snippet of what I was able to retrieve prior to needing to hard reset again. Next time it crashes I will be able to get a full syslog.

 

I'm hoping someone possibly has an idea of what might be causing this problem, beyond the "it could be the motherboard, CPU, memory, hard drives or power supply" which sadly doesn't narrow things down much.

 

Let me know if you have any questions. Thanks in advance.

 

Basic summary of specs:

AMD Threadripper Pro 3995WX 64-Core CPU

512GB DDR4 ECC RDIMM (64GB x 8 at 3200MHz), Kingston Server Premier

ASUS WRX80-E SAGE Wifi Motherboard

Corsair AXi 1200 PSU

ASUS ROG 1080Ti OC GPU

Samsung 970 Pro 512GB NVMe x 4

Western Digital Red NAS drives for general storage and parity, 2 x 10TB and 3 x 4TB

 

EDIT: Looks like either or both changing the memory clock speed to a lower value, not that which is officially stated as compatible with my motherboard on Kingston's website, and disabling global c-states control has solved the instability. I've not had any issues as yet since changing those settings. Thanks for the help! Fingers crossed it stays this way.

unraid_syslog_snippet.txt tower-diagnostics-20210901-1659.zip

Edited by Ixel
Possibly solved
Link to comment
1 minute ago, ChatNoir said:

Hello,

Did you check this part of the FAQ ?

 

 

Hi,

Thanks for replying. I did not, sorry. I have just read it now though and will see what I can find in the BIOS related to that and make the appropriate changes. I'll let you know how it goes, thanks! 👍

Link to comment
2 minutes ago, Ixel said:

Thanks for replying. I did not, sorry. I have just read it now though and will see what I can find in the BIOS related to that and make the appropriate changes. I'll let you know how it goes, thanks! 👍

Also consider memory speed. I glanced at it and it seems you have 8x dual rank DIMMs.

Not sure you are running it at a speed supported by the memory controller.

Link to comment
39 minutes ago, ChatNoir said:

Also consider memory speed. I glanced at it and it seems you have 8x dual rank DIMMs.

Not sure you are running it at a speed supported by the memory controller.

 

According to Kingston it should be supported, however I've manually set them to 2666 now. Global C-states control is now disabled too. Fingers crossed it solves the problem.

Link to comment
  • Ixel changed the title to (Solved) Random crashes, I'm now stumped

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.