Identifying slot bad ram is in

Spartacus09 · July 16, 2019

Is there a command to identify which memory slot this error is referring or list all of them, I'm assuming its likely A1 of the 8 slots.

I'm not sure if it starts at channel #1 or channel #0 and might be A2 though (motherboard manual calls A1 channel A).

Jul 14 21:33:07 unRAID kernel: mce: [Hardware Error]: Machine check events logged
Jul 14 21:33:07 unRAID kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jul 14 21:33:07 unRAID kernel: EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#0 (channel:1 slot:0 page:0x109a826 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:1 channel_mask:2 rank:0)
Jul 15 04:21:17 unRAID kernel: mce: [Hardware Error]: Machine check events logged
Jul 15 04:21:17 unRAID kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jul 15 04:21:17 unRAID kernel: EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#0 (channel:1 slot:0 page:0x109a826 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:1 channel_mask:2 rank:0)
Jul 15 04:40:06 unRAID root: Fix Common Problems: Error: Machine Check Events detected on your server
Jul 15 11:31:07 unRAID kernel: mce: [Hardware Error]: Machine check events logged
Jul 15 11:31:07 unRAID kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jul 15 11:31:07 unRAID kernel: EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#0 (channel:1 slot:0 page:0x109a826 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:1 channel_mask:2 rank:0)
Jul 15 18:31:29 unRAID kernel: mce: [Hardware Error]: Machine check events logged
Jul 15 18:31:29 unRAID kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jul 15 18:31:29 unRAID kernel: EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#0 (channel:1 slot:0 page:0x109a826 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:1 channel_mask:2 rank:0)

Edited July 17, 2019 by Spartacus09

Frank1940 · July 16, 2019

You are not alone in this problem. What I would suggest you do is start by googling CE memory scrubbing error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#0

That will get you started. Apparently, you already have your MB manual for reference. You could try modifying the search parameters to include your MB spec and see if that gives you more specific help. Probably, by apply logic and knowledge to your particular situation, some type of pattern will be become obvious. (Nobody intentionally confused how the information is displayed in the syslog but each manufacturer seems to have their own slot numbering nomenclature ...)

Spartacus09 · July 16, 2019

57 minutes ago, Frank1940 said:

You are not alone in this problem. What I would suggest you do is start by googling CE memory scrubbing error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#0

That will get you started. Apparently, you already have your MB manual for reference. You could try modifying the search parameters to include your MB spec and see if that gives you more specific help. Probably, by apply logic and knowledge to your particular situation, some type of pattern will be become obvious. (Nobody intentionally confused how the information is displayed in the syslog but each manufacturer seems to have their own slot numbering nomenclature ...)

Thanks so a guy here was receiving a channel 0 dimm 0 error also with a supermicro mobo sounds like SM labels start at 0: https://serverfault.com/questions/792225/how-to-find-which-memory-has-ce-error

Looks like its likely slot A2 then, ill give that shot.

What is the steps to clear that hardware error out of the logs so I can see if it comes back? (I updated unraid versions previously and it cleared it but didn't reoccur til a week or so later).

Edited July 16, 2019 by Spartacus09

Spartacus09 · July 19, 2019

On 7/16/2019 at 11:10 AM, Spartacus09 said:

What is the steps to clear that hardware error out of the logs so I can see if it comes back? (I updated unraid versions previously and it cleared it but didn't reoccur til a week or so later).

Restarting apparently clears the errors associated, or at least there are no errors now after replacing the ram in A1/A2.

Identifying slot bad ram is in

Recommended Posts

Spartacus09

Link to comment

Frank1940

Link to comment

Spartacus09

Link to comment

Spartacus09

Link to comment

Join the conversation