New Server - Machine Errors in Log


smdl

Recommended Posts

Hi. folks. 

 

I've been evaluating unRAID for a few months (I like it!) on a very old machine, and am just in the process of setting up an all-new server.  I completed the setup and pre-clear last night, and just brought the array online today.  One of my first steps was to add the CA Fix Common Problems app, and that reported a single hardware error when I ran it.  However, I saw some notes in the forum indicating that doing a simple re-scan to see if the problem persisted would be a normal approach.  When I did that, the problem was no longer indicated, so I carried on configuring the system.

 

At things sit currenetly, everything unRAID seems happy, with everything in the green.  However, while checking status, I happened to notice hardware errors showing up in the log.  All say that the errors where corrected, and no action is required, but they persist.  I have attached the Diagnostics file to this post, and would really appreciate any help with figuring out what isn't right.  I'm still a novice with unRAID, so apologies if I am missing something obvious.

 

Just in case the Diagnostic file doesn't include the hardware errors, I have included a sample below:

 

Thanks,

Shaun

 

Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: CPU:14 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: 0xdc20000000020151
Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: Error Addr: 0x000000000107a9c0
Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002
Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2
Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity.
Oct 14 14:39:33 GISL-CR1-uR1 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Oct 14 14:45:01 GISL-CR1-uR1 kernel: mce: [Hardware Error]: Machine check events logged
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Corrected error, no action required.
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: CPU:6 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: 0xdc20000000020151
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Error Addr: 0x000000000106ce40
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity.
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Oct 14 14:45:01 GISL-CR1-uR1 kernel: mce: [Hardware Error]: Machine check events logged
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Corrected error, no action required.
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: CPU:14 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: 0xdc20000000020151
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Error Addr: 0x000000000109e1c0
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity.
Oct 14 14:45:01 GISL-CR1-uR1 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Oct 14 14:50:28 GISL-CR1-uR1 kernel: mce: [Hardware Error]: Machine check events logged
Oct 14 14:50:28 GISL-CR1-uR1 kernel: [Hardware Error]: Corrected error, no action required.
Oct 14 14:50:28 GISL-CR1-uR1 kernel: [Hardware Error]: CPU:6 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]:

gisl-cr1-ur1-diagnostics-20191014-2201.zip

Link to comment

Hmm, just checked, and this error still seems to be occurring about every 5-6 minutes.  Hoping someone has some idea of how to look further into these.  I'll keep searching, as well.

 

Thanks,

Shaun

 

0xdc20000000020151
Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: Error Addr: 0x00000000011541c0
Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002
Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2
Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity.
Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Oct 15 14:19:30 GISL-CR1-uR1 kernel: mce: [Hardware Error]: Machine check events logged
Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: Corrected error, no action required.
Oct 15 14:19:30 GISL-CR1-uR1 kernel: [Hardware Error]: CPU:6 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]:

Link to comment

Thanks, folks -- I really appreciate the help. 

 

I actually did update the BIOS, right after I posted my last message, as there was a newer version available.  The update was successful, which is great, but now the system won't boot.  It consistently gives the following message:

 

SYSLINUX 6.03 EDD Load error - Boot error

 

I was just starting to look at that when I had to take my son to an event, so I am just getting back to it now.  Seems it might have something to do with the USB flash drive, but it seems suspicious that the problem started immediately after I updated the BIOS.

 

I'll see what I can dig up.  Any suggestions or advice would be gratefully accepted

 

Cheers,

Shaun

Link to comment

Okay, I was finally able to get the system back up and working again, but I had to recreate the USB boot drive to do it.  Also good news, the errors don't seem to be recurring... yet.  Will need to keep watching this for a while to see if it comes back in time.

 

Thanks again for the help with this!

 

Cheers,

Shaun

Link to comment

Uh oh.  Spoke too soon. 

 

Again, uR seems to be working great, but these errors continue to appear in the background.  I saw an  note in CA Fix Common Problems:

 

You should install mcelog via the NerdPack plugin, post your diagnostics and ask for assistance on the unRaid forums. The output of mcelog (if installed) has been logged

 

I'm not sure I should be playing in NerdPack too much yet, and am not sure I actually have it installed.   I did find NerdPack GUI, and have that installed, but can't really figure out how to do much in there, yet.  Will keep searching as it would be nice to have the errors decoded.

 

Here is another sample of the errors.

 

Oct 15 20:20:01 Tower kernel: mce: [Hardware Error]: Machine check events logged
Oct 15 20:20:01 Tower kernel: [Hardware Error]: Corrected error, no action required.
Oct 15 20:20:01 Tower kernel: [Hardware Error]: CPU:14 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: 0xdc20000000020151
Oct 15 20:20:01 Tower kernel: [Hardware Error]: Error Addr: 0x0000000001073ac0
Oct 15 20:20:01 Tower kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002
Oct 15 20:20:01 Tower kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2
Oct 15 20:20:01 Tower kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity.
Oct 15 20:20:01 Tower kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Oct 15 20:20:01 Tower kernel: mce: [Hardware Error]: Machine check events logged
Oct 15 20:20:01 Tower kernel: [Hardware Error]: Corrected error, no action required.
Oct 15 20:20:01 Tower kernel: [Hardware Error]: CPU:6 (17:8:2) MC1_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-]: 0xdc20000000020151
Oct 15 20:20:01 Tower kernel: [Hardware Error]: Error Addr: 0x0000000001071bc0
Oct 15 20:20:01 Tower kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000002a010002
Oct 15 20:20:01 Tower kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 2
Oct 15 20:20:01 Tower kernel: [Hardware Error]: Instruction Fetch Unit Error: IC full tag parity.
Oct 15 20:20:01 Tower kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

 

Cheers,

Shaun

Link to comment
  • 2 weeks later...

Greetings, all.

 

Well, the system continues to operate without apparent issue, although it's not really in heavy use, yet.  Until I have a sense of what might be causing this, I'm hesitant to depend upon it.

 

I've looked up the errors online, but all I really see are posts relating to Linux kernel and AMD error reporting conventions, which really go over my head.  I'll keep trying to understand it, but these seem to be more about how errors are reported, rather than what their potential meaning might be.  Hoping someone out there might have additional ideas as to where to look.

 

Any help would be sincerely appreciated.

 

Cheers,

Shaun

Link to comment

As mentioned, this system seems to operate fine from an unRAID perspective, but I will add that the array seems to take much longer to start than my old machine, which seems odd.  Old machine is AMD Phenom 9500 Quad Core 2200mhz with 4GB DDR2, and 5 SATA drives (largest 2TB).  New one is Ryzen 7 2700X 3700Mhz with 32GB DDR4, and 4 SATA drives (largest 4TB).  Seems odd to me that the new array would take about twice as long to start, and I wonder if this is an indication of a performance problem related to the errors.  Are there other factors that might contribute to slower startup?

 

Thanks,

Shaun

 

 

Edited by smdl
Link to comment
  • 4 weeks later...

Hi, folks.

 

Just to close the loop on this, after extensive testing and parts-swapping, I was able to confirm that the problem was with the processor.  I have since replaced the processor with a new one (same model), and the errors have stopped. 

 

Sincere thanks to those who offered input.

 

Cheers,

Shaun

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.