Jump to content

MCE error amd


Go to solution Solved by bmartino1,

Recommended Posts

Hi Comunity.

 

I'm posting my 3rd unraid basic key system diagnostic.  I was getting FCP mce log error use different Module. Per Squid the error is informative but means nothing.

 

I'm tying to do a forbidden router and have the dual 1g nic card successfully pased via VFIO with the System devices Bind.  THis system also runs a samba server and a few dockers. If i start the VM after about 1 hour the entire systems freeze and crashes. this system worked fine in a Windows system for quite some time and as a turenas scale instance, but moved to unraid for dockers.

 

I have 3x 2TB drives in a zfs raid z1

I have 2x 120 GB in a arary and patity dirve

I have a ?xGB size drive as the cache disk.  

 

I'm running with out a graphic card and sometimes have a spare to slap in the machine to get visual diagnostics. the CPU doen't have a Gcard so no onbard visual. The system has run Unraid in the past and stable without VM running, but services enabled for a good 3 months. Not sure what is causing the crash. a reboot to fix but have to stop the VM and not have it start to keep unraid stable. dockers fine? may try a LXC with the nic card passed to it to see if its just VM. Iommu is enabled and working.

 

Please see diagnostic below.

Carry over from Nerdtools chat.

 

On 3/10/2024 at 2:50 PM, bmartino1 said:

 

Hi Squid I'm continuing off an error here found earlier in the post: Not sure what's wrong with that machine, ran mem test pulled things out to test and check. only problem I had from FCP was this error about wrong module. atm the machine is offline and in pieces. will get a diagnostic latter.

 

The system works fine with LX, docker and started crashing when I tried to run a VM with a dual nic passed to the vm.

 

 

 

  

On 3/9/2024 at 4:48 PM, Squid said:

That message is informative and a tad misleading.  The module is being used.

 

zoltar-diagnostics-20240313-1743.zip

Link to comment

Is your system experiencing overheating? Memory issues? 

 

I'm looking at your syslog, why do you have the following boot params?

Quote

pcie_acs_override=downstream,multifunction vfio_iommu_type1.allow_unsafe_interrupts=1

Did you turn on iommu in your motherboard? Have you turned on sr-iov as well?

 

For AMD cpu you need at least the following params:

Quote

iommu=pt amd_iommu=on

I also turn off all video buffering so unraid doesn't steal my gpus:

Quote

video=efifb:off,vesafb:off,vesa:off

I also reserve hugepages for my VM's:

Quote

default_hugepagesz=1G hugepagesz=1G hugepages=64 hugepagesz=2M hugepages=8192 transparent_hugepages=never

And specify in the VM xml to use hugepages:

Quote

  <memoryBacking>
    <hugepages/>
    <nosharepages/>
  </memoryBacking>

It could be your crashing is cause of the ACS override. In general it should be avoided and only used as last resort.

Link to comment
Posted (edited)

AMD zen 1 is this build. I'm more concerned with the VM qemu breaking. Bios. Secure vm enabled, sr-iov enabled, amd dps? Memory iommu enabled. this is bios setting on board firmware to allow PCIE device pass thought.

 

While this is off-topic as Unriad has pretty much perfected Iommu settings... ( i run many devices with pcie pass though with unraid.)

off topic:

Grub Config

https://wiki.ubuntu.com/Kernel/KernelBootParameters
https://manpages.ubuntu.com/manpages/bionic/man7/kernel-command-line.7.html
save default grub: https://manpages.ubuntu.com/manpages/bionic/man8/grub-set-default.8.html (to boot to pve 5.15 kernel)

/etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="initcall_blacklist=sysfb_init libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 kvm.ignore_msrs=1 intel_iommu=on pcie_acs_override=downstream,multifunction nvme_core.default_ps_max_latency_us=5500 default_hugepagesz=1G hugepagesz=1G transparent_hugepage=always rootflags=noatime pci=noaer pcie_aspm=off intremap=no_x2apic_optout video=vesafb:off,efifb:off,simplefb:off,astdrmfb vfio-pci.ids= "

###############################
So why theses kerneal linux comands:

*Order is everything!

initcall_blacklist=sysfb_init 
*is a fix for the simple FB that is sent to the G card at first IRQ registration and initiantes a FB / ?X-x11 session to get card informatoin. to use a G card as passthorugh to a VM  NO Framebuffer or active session are to be on the card.
it more a fix for issues with video=simplefb:off

video=vesafb:off,efifb:off,simplefb:off,astdrmfb
*this is the line used to disable and turn off Frame Buffers so they don't touch the card we will use for passthough.

libata.allow_tpm=1 
is a tpm and sata fix from truenas to help with HBA and sata operation.

*YES AMD THREADSRIPEER AND EPIC REQUIRE BOTH AMD AND INTEL IOMMU ON!...
amd_iommu=on iommu=pt intel_iommu=on pcie_acs_override=downstream,multifunction
*this turns on IOMMU / SRV-IO for the ability to passthrough cards via Memory address.

vfio-pci.ids=
* Lspci -v and other comands to get the IOMMU and PCI device ID to be used with vfio kerneal driver.
-- the devcie especail a G card bue only be in use with vfio kerneal drvier for passthorugh to work.

kvm_amd.npt=1 kvm_amd.avic=1 kvm.ignore_msrs=1
* this is a truenas and other KVM erro log fix.

nvme_core.default_ps_max_latency_us=5500 default_hugepagesz=1G hugepagesz=1G transparent_hugepage=always 
* these are fixes and settings for NVME and storay / memory fixes...

rootflags=noatime 
* this is a file sytem setting to stop some time adding when a file is accessed.

pci=noaer pcie_aspm=off 
*these are other PCI settings such as power mangment and other advance reprots that have been known to spam the log.

intremap=no_x2apic_optout
* this is a fix for usb2.0 and 3.0 to mange usb passthrough.
###############################

 

Check IOMMU and remaping: dmesg | grep -e DMAR -e IOMMU dmesg | grep 'remapping'

 

^remember though this is proxmox, unraid is mutable these settings are gone at reboot.

 

If you got Main > click flash (your flash drive) at the bottom is where you can edit your grub settings. WARNING DON'T DO IT there are many web gui options to set theses settings. and can break your system...

image.thumb.png.43ced9b0db5197823c69183cd1f907d0.png


I ran a threadripper system with proxmox and ubuntu those operating system were needing some special grub commands with kernel 5. But those grub commands are dependent on the kernel version (version 5 specifically...) we are now using version 6 an a lot of these commands are now dead or no longer run the command as they should.

 

Then there are other weird proxmox settings to follow for version 7 that uses kernal 5:

https://pve.proxmox.com/wiki/PCI_Passthrough

echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf
echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf
echo "options kvm ignore_msrs=1 report_ignored_msrs=0" > /etc/modprobe.d/kvm.conf
#echo "options snd-hda-intel enable_msi=1" >> /etc/modprobe.d/snd-hda-intel.conf

##############################
for amd proxmox modprobe options and configs...

-- may need to install a lib file or g++

cd /etc/modprobe.d

*TO set kernel level drives from starting as the host should not use them for GPU passthrough.
#lspci -v to get ci drive in use -- this can be dangerous...

nano blacklist.conf 
blacklist nouveau
blacklist amdgpu
blacklist radeon
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
blacklist nvidia
blacklist nvidia_drm
blacklist i2c_nvidia_gpu
blacklist nvidia-gpu

*TO fix intel sound and pulse audio passthrough of sound. mainly turn off power save sleep settings...

nano alsa-base.conf
# Disable snd-hda-intel power saving
options snd-hda-intel power_save=0 power_save_controller=N

*Force SRV-IO
nano iommu_unsafe_interrupts.conf
options vfio_iommu_type1 allow_unsafe_interrupts=1

*Fix kvm spam of message log:
nano kvm.conf
options kvm ignore_msrs=1

*Your VFIO LSPCI device to pass to VM
nano vfio.conf

( if g card sould have this: )
options vfio-pci ids= disable_vga=1

(if usb card / HBA should have this:)
options vfio-pci ids= disable_idle_d3=1 enable_sriov disable_denylist

##################################

List VIFO PCIDs:
lspci -v: to list drive in use
lspci -n -s 01:00 to list vfio Hardware ID

Exmaple VFIO option in config:
root@BMM-PVE:~# lspci -n -s 36:00.0
to get vfio id

VFIO Comand options:

parm:           ids: Initial PCI IDs to add to the vfio driver, format is "vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]" and multiple comma separated entries can be specified (string)

parm:           nointxmask: Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to [email protected] so the device can be fixed automatically via the broken_intx_masking flag. (bool)

parm:           disable_vga: Disable VGA resource access through vfio-pci (bool)

parm:           disable_idle_d3: Disable using the PCI D3 low power state for idle, unused devices (bool)

parm:           enable_sriov: Enable support for SR-IOV configuration.  Enabling SR-IOV on a PF typically requires support of the userspace PF driver, enabling VFs without such support may result in non-functional VFs or PF. (bool)

parm:           disable_denylist: Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users. (bool)

Don't forget to update the kernel and grub:
nano /etc/modprobe.d/vfio.conf
update-initramfs -u -k al

REBOOT!

Step 6 Side fixes in VM configs.

User have reported that NVIDIA Kepler K80 GPUs need this in vmid.conf:
args: -machine pc,max-ram-below-4g=1G

#https://forum.proxmox.com/threads/quadro-gpu-passthrough-to-windows-shows-code-43.88788/
For Nvidia Passthorugh to fix some error code 43:
VM qemu config:
nvdia gpu nan ###.conf add:
###########
////////////////////////////////////
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'
cpu: host,hidden=1,flags=+pcid

///////////////////////////////////
###########

Otherwise a hookscript may be needed:
#https://forum.proxmox.com/threads/how-to-use-new-hookscript-feature.53388/

and softmod to get linux driver to use VFIO :
Also See:
https://forum.proxmox.com/threads/pci-gpu-passthrough-on-proxmox-ve-8-installation-and-configuration.130218/


New Modproble to fix vfio driver load:

Example:
echo "options vfio-pci ids=1002:67df,1002:aaf0" >> /etc/modprobe.d/vfio.conf
For AMD
echo "softdep radeon pre: vfio-pci" >> /etc/modprobe.d/vfio.conf
echo "softdep amdgpu pre: vfio-pci" >> /etc/modprobe.d/vfio.conf
For Nvidia
echo "softdep nouveau pre: vfio-pci" >> /etc/modprobe.d/vfio.conf
echo "softdep nvidia pre: vfio-pci" >> /etc/modprobe.d/vfio.conf
echo "softdep nvidiafb pre: vfio-pci" >> /etc/modprobe.d/vfio.conf
echo "softdep nvidia_drm pre: vfio-pci" >> /etc/modprobe.d/vfio.conf
echo "softdep drm pre: vfio-pci" >> /etc/modprobe.d/vfio.conf
For Intel
echo "softdep snd_hda_intel pre: vfio-pci" >> /etc/modprobe.d/vfio.conf
echo "softdep snd_hda_codec_hdmi pre: vfio-pci" >> /etc/modprobe.d/vfio.conf
echo "softdep i915 pre: vfio-pci" >> /etc/modprobe.d/vfio.conf


with unraid that is done under VM setting advance toggle:
image.thumb.png.dc0f51d63f641aeadcdada1ab4afde43.png

 

and to bind - make the pci pass though by making it a vfio device.

tools > system devices.

image.thumb.png.9408dbbb2267b9eade683f41063a9197.png

 

image.thumb.png.3e763fea83f8228823538006b6b6b576.png

Here I'm passing multifunction option to grub and other option to intramfs... and vfio id bind unraid has already set up the mod probe black list to be presistent upon reboot.

I required this to also pass a usb hub pcie device to have some bar bone direct accesses.

 

Unraid explain:

 

For unriad I recommend watching space invaders video.

 

My machine was torn down and parts tested without a known hardware issue. Just weird, qemu is not working. Going to try lxc without vfio binding...

 

Per squid this is informative:

mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
CPU is unsupported

 

but means nothing. Squid also claimed that module is installed? ... and per nerd tools as FCP plugin took me down a rabit hole. MCE is more a intel check and error set. Squid also said he needed the unraids diagnostic zip to check and test any mce log errors.

 

mcelog was the linux program that has been intergrated with unraid for some time. 

 

for AMD epic i'm sure you may need similar addition of bios setting and grub setting added. as posted above with my threadripper and proxmox to pass the same gpu and usb pcie device that was my full grub command to get proxmox v7 with kernel 5 to have iommu say its on enabled and working as it should then to set the pcie addon cards to use vfio drive so the device could be used and passed to a VM.

 

Edited by bmartino1
Link to comment

The above system is my test system. i plan to poke around zoltar unraid machien later today. Yes Iommu settings are corect as i said this was workign fine for trunenas with a ubuntu vm. iommu is enabled unte the web gui check and pulls fine form terminal.

 

Bios settings is not the issue.

for zoltar i have not added anything extra to the grub config. using vm advance settings only i was required to enable mutifuction to seperate the 2 nic pcie card to enalbe it to be bound

 

I need to check what unraid settings have done for the grub config. but the pcie nic and lib log and bind log show no errors

 

not passing a gpu, and no need for the storage settings. outside of seeing the pci nic as a check box no xml edits are made.

will post that xml config latter... and confirm grub option zoltar is using and show bios / iommu settings are there and fine.

 

I'm not sure how elses to check the system and grab the log or error that happen whent he vm is running for 1 hour and then crashes.

as zoltar has no visual out. and I have to put a gpu in to see any visual display.

Link to comment

amd picture for Bios and Unriad system info to check Iommu when zoltar was amd.

 

I didn't grab xml template for the VM... as I decided to do a board swab with exta stuff lying around the shop/friends pc repair ... and now zoltar is running an intel system and is much more stable.

 

intel system was running a unraid instance before and decided to swap USB by turning off settings, removing driver plugins and delting the netwrok config file form the flash drive.

 

So Zoltar is functions as it is more of a priority machine. Borg with this same amd motherboard freezes with just dockers running with 1 hour. so I think this 8+ year old motherboard may be on the way out.

advance - chipset settings - svm.jpg

advance - pci sub settings - sriov.jpg

advance--amdcbs-Iommu.jpg

if power out and power restore turn on.jpg

sys info.jpg

Link to comment
Posted (edited)

Hardware issues maybe... the thread ripper system I have has a dmesg log spam of memory errors that self corrects. Still trying to see if it's the ausu sage mobo or one of the memory controllers on the thread ripper processor. but that's a different project.


The module can be confirmed with newer version of unraid. Tools> system Drivers:

image.thumb.png.227c821eb93bf1bf186d41ddab48740b.png

For this post solution to MCE errors PER SQUID!

  

On 3/9/2024 at 4:48 PM, Squid said:

That message is informative and a tad misleading.  The module is being used.

 

  

On 3/10/2024 at 3:18 PM, Squid said:

MCE's you need to post your diagnostics

 

The edac message is informative and doesn't actually mean anything

 

Thank you Squid!

Edited by bmartino1
Link to comment
  • 2 weeks later...

So my friend's system today started to randomly do this error. FCP says machine event. So I decided to run diagnostic.

 

Rebooted the server and had a server fault kernel panic. Boot to safe mode success, fix mod changes and rebooted system to normal boot and its up no errors...

Not sure what setting changed or issues with a system staying up on an AMD platform that is causing this issue.

 

My friend also has 3 unraid licenses and Mutiple Computer type systems, but done't have a forum account...

This machine is called SGC names from star trek.

 

There seems to be a problem with this kernal module driver...

 

image.png.1d7a0c256bde65a743cbb968b458cb54.png

Unirad 6.12.8 > system drivers:

image.thumb.png.633f71d35e307c0d9fdc725d025f746c.png

 

but terminal shows error:
image.thumb.png.a0221404f40ab89e92c3cd6a07994112.png

 

Thoughts ideas?

 

after_reboot_all_systems_go_-_sgc-diagnostics-20240326-1658.zip sgc-diagnostics-20240326-1157.zip

Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...