Need help with random system crashes (threadripper)


krh1009

Recommended Posts

I've been battling random system lock up for the last month and need help.  I upgrade ram from 16G to 32G and started having problems with crashing after 12 to 24 hours of running.  Things I've done so far:

  • Updated bios to latest version
  • reset bios to default config
  • replaced my marvel HBA with an LSI 
  • Switch PCI slots of HBA
  • Updated unraid to 6.8.2

None of these helped.  I'm switched logging to flash drive.  A copy of the syslog right after a crash and diag files are attached.

 

ANY HELP WOULD BE APPRECIATED.... 

syslog crash mediaserv-diagnostics-20200204-0523.zip

Link to comment

Thank you for the suggestion!  I didn't think brand new ram would could have a problem....but once again I was wrong.  I have 4 X 8G modules installed.  Is there a way to figure out which stick is bad, or should I just start over will all new ram?

 

Using: G.SKILL Ripjaws V Series DDR4 PC4-25600 3200MHz  Model F4-3200C16D-16GVKB

 

Any recommendations on a replacement?

 

IMG_2902.thumb.jpg.77985ba5b4f764f4a6a9962ad6e2f10d.jpg

Edited by krh1009
Link to comment
22 minutes ago, Gragorg said:

Try and reseat the ram first and run memtest. Sometimes that is all it is.  If it doesn't fix it you can pull ram chips until you don't get errors on memtest to eliminate the bad chip.

Thanks.  Going to re-seat and test each pair separately and see if I get errors   

  • Like 1
Link to comment

So I tested each pair separately ( two tests of 16GB).  Both tests show no errors, so I know the modules are not the problem. When I insert all four at one time I get a ton of errors.  

 

11 hours ago, jpowell8672 said:

Also make sure CPU is seated properly. This is a easily overlooked problem with Threadripper CPU's. All 3 CPU hold down screws must be torqued in proper sequence & amount for CPU to be seated properly. I had a memory issue with mine at first and reseating the CPU fixed it for me.

 

I think the CPU re-seating might be what is needed,  Do i need to unmount the CPU completely and remount or can I loosen the bracket and re-tighten the screws in the proper sequence?

 

Thanks again for the help

 

 

 

 

 

Link to comment
2 hours ago, krh1009 said:

So I tested each pair separately ( two tests of 16GB).  Both tests show no errors, so I know the modules are not the problem. When I insert all four at one time I get a ton of errors.  

 

 

I think the CPU re-seating might be what is needed,  Do i need to unmount the CPU completely and remount or can I loosen the bracket and re-tighten the screws in the proper sequence?

 

Thanks again for the help

 

 

 

 

 

 

Link to comment
9 minutes ago, jpowell8672 said:

If you are unable to resolve the issue with your current ram and decide to purchase new ram the Samsung B-die chips work the best with Threadripper v1 & 2 if you can find some.

 

https://benzhaomin.github.io/bdiefinder/

 

Thanks for the video and the brand recommendation.   I'm going to 1) lower the memory speed, if no luck 2) reseat CPU, if no luck 3) buy more ram

Link to comment

OK...so I took the cowards way out.  I didn't feel like removing the cooler cleaning the TIM and reseating the CPU.   So I ordered 2 sticks of 16G ram and place them in the lower two (working) slots.  memtest showed 100% pass.  SO I think I'm good for now.  One long rainy weekend I'll attempt to reseat the CPU., which I think will solve the problem completely. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.