Intel Socket 1151 Motherboards with IPMI AND Support for iGPU


Recommended Posts

Thank you @Hoopster for this deal/thread. I upgraded my almost 9 year old server.

 

I'm running into an odd issue and trying to determine if it's hardware or software. My system runs fine until I start a parity check. During the parity check unRaid freezes but I am still able to reach the IPMI interface.  I'm using BMC Firmware 1.80.00 and BIOS Firmware L2.21A. From my IPMI event log it shows CPU_CATERR.

 3    | 07/04/2020, 08:34:35 | CPU_CATERR       | Processor                          | State Asserted - Asserted

I ran a syslog server but I don't see anything related.

<30>Jul  4 02:46:03 DeathStar ool www[17368]: /usr/local/emhttp/plugins/dynamix/scripts/rsyslog_config    192.168.1.5    04/07 02:46:03.013    
<46>Jul  4 02:46:06 DeathStar rsyslogd: [origin software="rsyslogd" swVersion="8.1908.0" x-pid="17979" x-info="https://www.rsyslog.com"] start    192.168.1.5    04/07 02:46:05.473    
<4>Jul  4 02:46:32 DeathStar kernel: mdcmd (88): spindown 3    192.168.1.5    04/07 02:46:31.123    
<4>Jul  4 02:46:32 DeathStar kernel: mdcmd (89): spindown 6    192.168.1.5    04/07 02:46:32.001    
<4>Jul  4 02:46:34 DeathStar kernel: mdcmd (90): spindown 4    192.168.1.5    04/07 02:46:33.832    
<4>Jul  4 02:46:35 DeathStar kernel: mdcmd (91): spindown 5    192.168.1.5    04/07 02:46:34.483    
<4>Jul  4 02:46:45 DeathStar kernel: mdcmd (92): check     192.168.1.5    04/07 02:46:44.087    
<4>Jul  4 02:46:45 DeathStar kernel: md: recovery thread: check P ...    192.168.1.5    04/07 02:46:44.087    
<13>Jul  4 03:00:01 DeathStar Recycle Bin: Scheduled: Files older than 7 days have been removed    192.168.1.5    04/07 03:00:00.645    
<77>Jul  4 03:40:21 DeathStar crond[2295]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null    192.168.1.5    04/07 03:40:20.584    

I also downloaded the MCA Log (attached). It looks like the CPU may be defective, but I've never seen anything like it.

CPU1 Do not Present!!
Get CPU0 MCA Error Source Log failed: CC = 0x81
Get CPU core number failed: CC = 0x81. Default catch 28 cores.

Diagnostics also attached, run just after rebooting.

 

Any help is appreciated, and can move this to its own topic if needed.

MCALog.txt deathstar-diagnostics-20200704-1144.zip

Edited by LateNight
Link to comment
58 minutes ago, LateNight said:

During the parity check unRaid freezes but I am still able to reach the IPMI interface.  I'm using BMC Firmware 1.80.00 and BIOS Firmware L2.21A. From my IPMI event log it shows CPU_CATERR.


 3    | 07/04/2020, 08:34:35 | CPU_CATERR       | Processor                          | State Asserted - Asserted

I have seen the CPU_CATERR as well, but, never during a parity check. 

 

Since I am running BOINC 24x7 and have most of the cores/threads assigned to it, my CPU has been under a constant heavy load.  I saw the CPU_CATERR when I had Turbo Boost enabled as the CPU was constantly running in the mid-80sC temp range and at CPU clock speeds from 4.7 GHZ up to 5.0 GHZ.

 

After some time (many hours to several days) running at these temps and speeds, the server would lock up.

 

I disabled Turbo Boost through the Tips and Tweaks plugin and have not had any more problems.  Temps have never exceeded 56C since. When I am done with BOINC, I intend to enable it again and see what happens in "normal" usage patterns.

 

Fortunately, I have never seen the errors reported in your MCA log, but, it does not look good.

 

Whether or not your CPU has problems, I can't say for sure, but, a parity check should not lock it up and cause a CPU_CATERR.  I have run several parity checks since I started running BOINC and have had zero problems with both running simultaneously.  Of course, I also have Turbo Boost disabled so perhaps that makes a difference.

Link to comment

Thank you @Hoopster. I read earlier in the thread that you had disabled Turbo Boost but wrote it off as it seemed it was temperature related.  I have not seen anything in my system reach over 40°C (except that MB sensor showing 85°C), and watching most of the parity check the CPU never went above 38°C. However after disabling Turbo Boost, I was able to complete a parity check. 

Link to comment

I have started getting lockup’s on my setup as well.  I checked the IPMI logs and it had “CPU_CATERR” as well.   My temps are idling now are around 39c to 42c.  I just now disabled turbo boost. Hopefully this will help until I get my case temps down.

Link to comment
50 minutes ago, JM2005 said:

I have started getting lockup’s on my setup as well.  I checked the IPMI logs and it had “CPU_CATERR” as well.   My temps are idling now are around 39c to 42c.  I just now disabled turbo boost. Hopefully this will help until I get my case temps down.

@JM2005 - when you click on that entry in the IPMI log what does it show? On my system it shows:

image.thumb.png.e3786bf7666b89a0e18f210e73c403fe.png

Link to comment

Greetings, Hoopster. First off, thank you for being the trailblazer for this specific unraid build and to everyone else who has shared their experiences on this thread (read through page 8 so far). I pretty much bought your same setup minus my drives are 12TB WD Reds. I must say, IPMI is so refreshing. Having that kind of insight and control on a hardware level is really handy!

 

I was originally running unraid on an ASRock P67 board and Intel i5-2500k with 16GBs of RAM (my old gaming rig being repurposed for data hoarding :)). Did that for a few months very stably and made the decision to go with more horsepower and scalability. After doing some research, I came upon your build and thought, "This is exactly what I need!"  I bought my package from IMC on eBay as you had done. Really a helpful, honest and all around nice guy. Having the  ASRock E3C246D4U rack board with an Intel Xeon E-2288G and 64GBs of ECC RAM offers some much performance overhead (hoping for more stability)! With that being said, the transition to the new hardware worked pretty flawlessly thanks due in part to the brilliant software engineers at unraid, but I'm having the same intermittent lockups as you (over a dozen in the last 3 days). IPMI logs the event as "CPU_CATERR | Processor | State Asserted - Deasserted"). I've had them occur when the unraid server was virtually idle (was just pre-clearing a 12TB HDD overnight) and it locked up with just an hour left of the clear (had to turn the server off and reboot it through IPMI). I've since attempted to re-pre-clear that same drive over 6 times. Seems like I can't stay stable long enough to complete it as the intermittent lockups plague me. Running fold@home definitely accelerates the lockup, but my temps don't seem high enough to cause this lockup. The PCH, board and CPU don't get above 76C per the probes reporting in IPMI. Running Emby and a nearly idle Windows 10 VM results in the lock up from anywhere to 15 minutes in to nearly 2 hours of direct streaming a movie through Emby (no one else on the server at that time). You would think IPMI reporting would flag a thermal event if that was the root cause, but all I see in the log are CPU_CATERR events that coincide with these lockups. I have since remove my heatsink, reapplied thermal paste and reseated the heatsink along with the power connectors to the board and did a spot check of the rest of the hardware. No idea what is causing these intermittent lockups. I'm in contact with IMC through eBay who has a ticket in with ASRock already.

 

By the way, can you send me that beta BIOS so I can flash for HW transcoding with the iGPU? Transcoding one 1080p remux to 5Mb has me running at nearly 75% CPU utilization which seems super high for a Xeon E-2288G, but maybe not? Not sure why this feature isn't already unlocked with a current public BIOS release.

 

Yikes... so I just read above me (only through page 8 of this thread so far) and it seems I'm not the only one getting lockups with CPU_CATERR errors intermittently. So, it appears I need to disable turbo boost? Seems counterproductive when we bought the top bin Xeon for this application.

Edited by realdiel
Link to comment
11 minutes ago, realdiel said:

So, it appears I need to disable turbo boost?

That's what stopped the CPU_CATERR for me. 

 

With Turbo Boost disabled, I have run several parity check, preclear and other tasks without issue.

 

Two days ago, I stopped running BOINC and re-enabled Turbo Boost.  My server is mostly idle; however, it has run some backup, recording and other tasks that occasionally pushed CPU usage into the 40% range.

 

I may try a manual parity check with Turbo Boost enabled just to see what happens.  It is scheduled to happen on the 15th but I may advance it as a test.

 

I may also try stressing it a bit with HandBrake encodes.

 

@JM2005 @LateNight and you might want to report this to William at ASRock.  Perhaps there is something in the board or BIOS contributing to the problem.  I don't think it was all heat related.  Turbo Boost seems to be the issue and if the error goes away for all of you with Turbo Boost disabled, perhaps the board is not properly handling Turbo Boost.

Link to comment
49 minutes ago, Hoopster said:

That's what stopped the CPU_CATERR for me. 

 

With Turbo Boost disabled, I have run several parity check, preclear and other tasks without issue.

 

Two days ago, I stopped running BOINC and re-enabled Turbo Boost.  My server is mostly idle; however, it has run some backup, recording and other tasks that occasionally pushed CPU usage into the 40% range.

 

I may try a manual parity check with Turbo Boost enabled just to see what happens.  It is scheduled to happen on the 15th but I may advance it as a test.

 

I may also try stressing it a bit with HandBrake encodes.

 

Good luck on the stress testing. Hopefully, one day, we can go full bore with this setup without any stability issues. I installed the tips and tweaks plugin so I could disable turbo boost in an attempt to avoid these intermittent lockups and CPU_CATERR errors. *fingers crossed*

 

Can someone hook me up with the latest beta BIOS version that allows me to use the iGPU for QuickSync? I was hoping to drop my prior discreet GPU in this setup (650Ti BOOST or GTX 980), but the Silverstone CS380 case I bought won't accommodate their length (iGPU transcoding to the rescue!). Now I'm having buyer's remorse on the case over the Fractal Design 7XL or even the Node 804. Both have much better cooling options as well.

 

@kaiguy

I've gotten the 5 beep warning at random during POST as well. No idea what that's about as it's headless and my system works in unraid minus the intermittent lockups followed by the CATERR_CPU errors generated in IPMI reporting (hoping disabling turbo boost fixes that).

 

Hoopster, how do I link a user so they're notified when I mention them in a post like you did before with JM2005 and LateNight? I though the "@" symbol would do it for "kaiguy", but I guess not. :)

Edited by realdiel
Link to comment
7 minutes ago, realdiel said:

@kaiguy

I've gotten the 5 beep warning at random during POST as well. No idea what that's about as it's headless and my system works in unraid minus the intermittent lockups followed by the CATERR_CPU errors generated in IPMI reporting (hoping disabling turbo boost fixes that).

Weird, right? I can't seem to figure out why that started happening. Everything log-wise is fine on my end and nothing seems to be out of the norm. Oh well--guess it doesn't matter.

 

Knock on wood, I haven't ran into the CPU_CATERR issue yet, though my server is pretty darn idle overall with QSV enabled. But seeing how everyone is running into this, I'm sure I'll get it on my next parity check.

Link to comment
9 minutes ago, realdiel said:

Hoopster, how do I link a user so they're notified when I mention them in a post like you did before with JM2005 and LateNight? I though the "@" symbol would do it for "kaiguy", but I guess not

Type the '@' symbol and then start typing username you want to tag.  When the name appears in the pop-up user list, you must select it from the list. 

 

The user is not tagged unless selected from the list.

Link to comment
31 minutes ago, realdiel said:

Now I'm having buyer's remorse on the case over the Fractal Design 7XL or even the Node 804

The GTX 1650 has some pretty good specs for such a small card.  It is a dual-slot card

 

Many who are relying on a discrete card for hardware transcoding are opting for the Quadro P2000 (many used ones on eBay) if they want "unlimited" streams.  Best of all it is a single-slot card.

 

Both of those fit easily in the CS380, but, yeah, iGPU relieves that concern if you don't need a discrete card for a VM.

Edited by Hoopster
Link to comment
13 minutes ago, Hoopster said:

Many who are relying on a discrete card for hardware transcoding are opting for the Quadro P2000

I was eyeballing the Quadro P2000. If I can't get the iGPU going for QSV, that will be my plan B.

 

Are these the best instructions to follow after flashing my BIOS to enable the iGPU in unraid and assigning it to my Emby container for quick sync (will just switch Plex for Emby in the instructions)?

 

 
Edited by realdiel
Link to comment
3 minutes ago, realdiel said:

Are these the best instructions to follow after flashing my BIOS to enable the iGPU in unraid and assigning it to my Emby container for quick sync (will just switch Plex for Emby in the instructions)?

I don't exactly know how QSV in enabled in Emby, but, that would be the only change.

 

Other than that, this guide will help you get it going from an unRAID/Linux perspective with whatever settings you need to do in Emby (QSV requires Premiere just as Plex requires Plex Pass).

Edited by Hoopster
Link to comment
18 hours ago, LateNight said:

Interesting. I have already been in contact with William at ASRock and he's walking me through the steps of eliminating other devices in my system (10GB card and flashed IBM m1015). Are you seeing the same output when clicking on the CU_CATERR event in the log? 

@LateNight hows your troubleshooting coming along with William?  Any news to report?

Link to comment

Just to give fellow users of this HW who might be dealing with random CPU_CATERR lockups too, the recommendation to set "Enable Intel Turbo" to "No" in the tips and tweaks plugin for unraid has given me the longest run of stability so far (no lockups and I'm running all of my dockers, and a VM full clip). If this continues, I'll be quite happy with the HW upgrades; just hoping for a long-term fix from whoever is the root cause of this (Intel, ASRock, someone else?).

Edited by realdiel
Link to comment
On 6/20/2020 at 8:56 AM, Hoopster said:

A PCIe NVMe SSD in the M.2 slot will disable the x4 slot on the board but not disable any SATA ports.

Hey @Hoopster.  Sorry I have been away for a while dealing with real life stuff.

 

So, are you saying you can have a NVMe SSD AND 8 SATA drives connected?  From what I was reading in the manual it sounded like using the NVMe you would not be able to use SATA_0.  I am using the "SAMSUNG 970 EVO PLUS M.2 2280 1TB PCIe Gen 3.0" as my NVMe device and was assuming this takes the space of SATA_0.

 

If you or anyone else is using the NVMe slot and all 8 SATA ports without issue this would be awesome news!

 

Quoted from the manual page 32:

"These SATA3 connectors support SATA data cables for internal storage devices with up to 6.0 Gb/s data transfer rate. *The M.2 slot (M2_1) is shared with the SATA_0 connector. When M2_1 is populated with a M.2 SATA3/ PCIE3.0(x4 or x2) module, SATA_0 is disabled."

 

..and image of where I am seeing this:

 

2020-07-09 16_23_37-E3C246D4U.pdf.png

Link to comment
13 minutes ago, Burizado said:

So, are you saying you can have a NVMe SSD AND 8 SATA drives connected

Yeah, I probably misread this statement on the MB specs page.

 

image.png.d5f9a7704dc2976ac846329a479b11ce.png

 

On many motherboards it all depends on whether or not the NVMe SSD uses the SATA or PCIe interface.  If it is a SATA NVMe SSD, a SATA port is disabled.  If it is a PCIe x4 NVMe SSD, it uses PCIe lanes and disables an x4PCIe slot.

 

It looks like on this motherboard both PCIe and SATA NVMe SSDs will result in disabling SATA 0 (the red one) on the motherboard.  If that is the case, I guess the good news is that the x4 PCIe slot remains available, but you only have 7 usable SATA ports on the MB.

 

I do not use all 8 MB SATA connectors.  I am using 5 for an optical disc drive and 4 2.5" SATA SSDs.  I have nothing connected to the red SATA 0 port.

 

My 8 array drives (even though I am currently using only 5) in the 8-bay hotswap cage are attached to a Dell H310 in IT mode installed in the x8 PCIe slot.

Link to comment
1 minute ago, Hoopster said:

It looks like on this motherboard both PCIe and SATA NVMe SSDs will result in disabling SATA 0 (the red one) on the motherboard.  If that is the case, I guess the good news is that the x4 PCIe slot remains available, but you only have 7 usable SATA ports on the MB.

I had a NVMe SSD installed in my board along with a 12TB SATA HDD on SATA port 0. Both detected in the BIOS and unraid. I could write to the 12TB HDD, but didn't try to on the NVMe SSD. I have since removed the SSD since it was small and of no use to me. Perhaps both the NVMe slot and SATA port 0 are usable concurrently? That would go against the ASRock documentation, but pretty cool if true nonetheless. :)

 

I need to get me one of those H310s as I'm already running 8 drives. I suppose I could Velcro my SSDs within the case and run them cabled, freeing up a few 3.5" hot plug carriers for big spinners, and/or use the 5.25" bays for 3.5" spinning drives with a SATA pass-through adapter if needed.

Link to comment
16 minutes ago, Hoopster said:

Yeah, I probably misread this statement on the MB specs page.

 

image.png.d5f9a7704dc2976ac846329a479b11ce.png

 

On many motherboards it all depends on whether or not the NVMe SSD uses the SATA or PCIe interface.  If it is a SATA NVMe SSD, a SATA port is disabled.  If it is a PCIe x4 NVMe SSD, it uses PCIe lanes and disables an x4PCIe slot.

 

It looks like on this motherboard both PCIe and SATA NVMe SSDs will result in disabling SATA 0 (the red one) on the motherboard.  If that is the case, I guess the good news is that the x4 PCIe slot remains available, but you only have 7 usable SATA ports on the MB.

 

I do not use all 8 MB SATA connectors.  I am using 5 for an optical disc drive and 4 2.5" SATA SSDs.  I have nothing connected to the red SATA 0 port.

 

My 8 array drives (even though I am currently using only 5) in the 8-bay hotswap cage are attached to a Dell H310 in IT mode installed in the x8 PCIe slot.

Ah, ok.  Thanks for the extended info.  I had my hopes up there a bit. hahaha.

 

I just ordered 3 more drives since they were on sale, bringing my total up to 7, and thought I would get another one if I could do 8. 😁

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.