New Server in the Making [Advice]


Jackal

Recommended Posts

Hi all,

 

I recently got my hands into a SuperMicro X10DAI with (2x) E5-2630v4 CPUs and (6x) 8GB RDIMMs of RAM which I will become a multipurpose server. So far so good ;-) 

 

What I found out today after testing the RAM, for a second time in a weeks-period, is that I get some ECC errors. Well, it is just "2" but shall I be worried / no-worried / semi-worried / forget and re-check it again after an X-amount of days/months? It seems that the system responds and corrects the errors but...?

 

Furthermore, since I do not find any answer in the manual, which one is Channel 4? In other manuals I have seen detailed information regarding RAM population. In this one I can not seem to understand the "Fill First" Method. My Guess is that 

------------------------- DIMM_00 ------- DIMM_02

Channel_00 --> P1-DIMMA1 & P1-DIMMA2

Channel_01 --> P1-DIMMB1 & P1-DIMMB2

Channel_02 --> P1-DIMMC1 & P1-DIMMC2

Channel_03 --> P1-DIMMD1 & P1-DIMMD2

 

Channel_00 --> P2-DIMME1 & P2-DIMME2

Channel_01 --> P2-DIMMF1 & P2-DIMMF2

Channel_02 --> P2-DIMMG1 & P2-DIMMG2

Channel_03 --> P2-DIMMH1 & P2-DIMMH2

 

IMG_9501.thumb.jpg.78ea035528071a247c5aa69783feb734.jpg

Link to comment

Both of them are on the same channel/slot. Either there is something wrong with that DIMM Module, the board, the slot or the CPU.

Because this is an LGA2011 socket; I would **HIGHLY** recommend swapping a known good memory module into that slot.
Retest with the memory swapped. If the issue follows the DIMM, you've got a bad stick of ram.
If the issue stays with the slot, I would the pull CPU associated with that memory slot (look in the manual for the motherboard to figure out what slot 4 is attached to), clean the CPU contacts **NOT THE SOCKET CONTACTS** with isopropyl alcohol and put it all back together. It's extremely common to get a false seat on the CPU on the LGA2011 socket resulting in issues just like this one. 

Link to comment

Thanks a lot @Xaero. It seems that I have some testing in front of me.

 

The thing is that I do not have a known good memory module that I can swap. I am testing these that the system came with to see which is bad and which is not.

 

20 minutes ago, Xaero said:

Both of them are on the same channel/slot. Either there is something wrong with that DIMM Module, the board, the slot or the CPU.

 

Now you are making me scary !!! I hope this is not going to be anything than a bad memory module...

 

Edited by Jackal
Link to comment
19 hours ago, JonathanM said:

This.

 

Zero is the only acceptable number of errors.

 

Now I have started testing the modules in pairs to see which one is the faulty one. So far 2/6 DIMMs passed the stress tests without producing any ECC errors. Now I am quite sceptical if there is something wrong with a specific RAM slot according to @Xaero. Let us see ... 4 more hours remain to see what is going on with these two DIMMs. 

 

Link to comment
On 11/17/2021 at 10:59 AM, Jackal said:

 

Now I have started testing the modules in pairs to see which one is the faulty one. So far 2/6 DIMMs passed the stress tests without producing any ECC errors. Now I am quite sceptical if there is something wrong with a specific RAM slot according to @Xaero. Let us see ... 4 more hours remain to see what is going on with these two DIMMs. 

 

Keep in mind that they may all very well pass in those first two slots. The slot in question may be where the error occurs - but that doesn't always mean the slot itself is bad. The memory controller isn't on the Northbridge anymore, it's on the CPU itself. Because the LGA2011 (and it's child variants, both narrow and wide ILM) is such a large socket, it's really really easy to get not enough mounting pressure on the first go around and miss just a couple of pins or have just slightly too high of resistance on a couple of pins and lose a memory channel/slot. If you find out just one slot has the problem, pop that CPU out, clean it and put it back in. With any luck the problem will just be gone.

Link to comment
24 minutes ago, Xaero said:

Keep in mind that they may all very well pass in those first two slots. The slot in question may be where the error occurs - but that doesn't always mean the slot itself is bad. The memory controller isn't on the Northbridge anymore, it's on the CPU itself. Because the LGA2011 (and it's child variants, both narrow and wide ILM) is such a large socket, it's really really easy to get not enough mounting pressure on the first go around and miss just a couple of pins or have just slightly too high of resistance on a couple of pins and lose a memory channel/slot. If you find out just one slot has the problem, pop that CPU out, clean it and put it back in. With any luck the problem will just be gone.

 

Hi @Xaero . I really appreciate following up on my problem. Had it not been for this problem I would not have gathered all this information.

 

So far all the DIMMs in pairs have been tested successfully, as well as 4 of them at the same time. Now, I am waiting for a test with all of 6 of them installed in slots except the "problematic" one. It may not be in the motherboards recommended mode of operation but what shall I do. Next test is to test once again the "problematic" slot and see if the error is reproduced. And then I am definitely following your advice and pop the CPU out for inspection/cleaning etc during the weekend. I have done that many times in the past before, but last time I did such a thing, I discovered 2 bent pins in the cpu socket and it took me a while to bring back the system in a working state. Not a very pleasant memory ;-). But will definitely do if necessary.

 

Thankfully all this will end up pretty soon, as I want to move on, choose a case, and start putting the parts together ...!!!

Edited by Jackal
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.