May 19, 20242 yr Hi All Need some advise or what could be causing this i Keep getting one Core stuck on 100 % even when the array not even started or running . this is my second Unraid server and the other one is working fine. I have one Docker Plex with GPU installed one VM Windows runs no issues but keep noticing one core at 100% even before i started the Array or any Docker or VM some time it happens after i started them also random Core as well attached Screen Shots and Diagnostics for the smart people to dive in ....... this starting to drive me mad as i double check remove cards move them etc as all working from other server when install in another PC before i move it across. Unraid Version 6.12.10 Many Thanks Bannie dell-echo-diagnostics-20240519-1957.zip Edited May 19, 20242 yr by Bannie
May 19, 20242 yr Community Expert Signs of a core thread stuck and it's a bug caused by potential hardware failure in memory. the cpu was sent a task that it got stuck on so the cpu core locks and becomes 100% as displayed. Please run mem test.
May 19, 20242 yr Community Expert Could also be cpu over clocking and frequency stuck on the xeon core thread: https://community.intel.com/t5/Processors/Xeon-E5-2403-CPU-Frenquecy-stuck-at-1-2-GHz-why/m-p/320273
May 19, 20242 yr Author Thanks for the reply i try mem test ..... just a bit odd as was runs windows Server 2022 fine with no issues on for a few weeks and VMs with no signs on cpu thread at 100% ...... i let you know if it does or does not see any errors .
May 19, 20242 yr Author Run Memtest no issues PASS ..... CPUs Set as Stock ..... Could it be a Corrupt file on the USB Stick ? many Thanks Bannie
May 20, 20242 yr Community Expert https://unix.stackexchange.com/questions/653292/htop-shows-one-core-at-100-cpu-usage-but-no-processes-are-using-much-cpu run htop not top htop will show ||| bars per CPU for total. May be a emhttp bug on data. Potential an acpi error we can fix via a grub option main > flash > syslinux add this to end of config: acpi=off so: kernel /bzimage append initrd=/bzroot acpi=off Depending on other vfio / vm settings you may have other data here don't lose that data! Also - check bios settings. Make sure it is set to Other OS also confirm that secure boot is off. ^Known to cause this bug to cpu instruction set by hadware firmware at boot. Edited May 20, 20242 yr by bmartino1
May 20, 20242 yr Author 17 hours ago, bmartino1 said: https://unix.stackexchange.com/questions/653292/htop-shows-one-core-at-100-cpu-usage-but-no-processes-are-using-much-cpu run htop not top htop will show ||| bars per CPU for total. May be a emhttp bug on data. Potential an acpi error we can fix via a grub option main > flash > syslinux add this to end of config: acpi=off so: kernel /bzimage append initrd=/bzroot acpi=off Depending on other vfio / vm settings you may have other data here don't lose that data! Also - check bios settings. Make sure it is set to Other OS also confirm that secure boot is off. ^Known to cause this bug to cpu instruction set by hadware firmware at boot. So last night before i went to the land of nod i tried a few thing and after few minutes is was still doing either straight away before the array had been started manually just to see if it was doing with it not running and it was .... in the end i headed up to nob brain still thinking ..... so i had a little brain fart during the day thought i wonder if one of the drives could be causing this so when i got home today i went about mission drive ....... lol very odd so before powering up server i remove all my drives as was thinking possible a Drive could be causing this random core to go 100% so Parity missing and drives booted in to unraid put them all back so Unraid could see restarted the array and then rebooted . then added the acpi=off code where you said rebooted then only showing one CPU and one core , run htop and did show CPU all as one . so remove code rebooted started the array and so far still acting normal only array running no real data on here yet so array layout not worried about yet ...... still a bit New to Unraid (linux codes) still learning as go what is what and does .... htop it is screen grab taken before docker running and after still all ok then started the VM ...... expecting it to do it 1 core 100% but nope it running fine .... i it either that magic code work and did something or a possible drive error caused not fully in ..... just in case i ordered a couple of drives to replace then because did have a little doubt about two of them so i wait and see what happens ..... at the monument no data on array all on the NVMe drives ...... then started Plex Docker ...... still all ok .... starte dto play film still go to go ...... just so you know its running on a Dell R630 Dual CPU in HBA Mode .......with a couple of SAS drives for the array for testing and Cache Pool is Two CT1000P3PSSD8 1TB which i had in the unraid before porting across to server and was working fine no issues that i know off Again Thanks you for all your help as it now seems to be running and Docker running and VM which before you ask is whats the Digifort it is a iVMS Server ..... for CCTV NVR .... Thank you Bannie
May 20, 20242 yr Author Nope spoke to soon ...... its back ..... i tested the VM on another Unraid server with No issues ...... i have plex running steaming a film stop the film after 30 mins and the still ok went off for five minutes check bang Core 26 100 % Edited May 20, 20242 yr by Bannie
May 20, 20242 yr Community Expert Solution Glad that worked: Sometimes we just need to clear the acpi table created by mobo bios... Correct it mainly debugging acpi: https://wiki.ubuntu.com/DebuggingACPI mainly its a kernel bug caused by bios... Make sure the mote board is on the latest bios revision. some other users reported that keeping a keyboard/mouse pluged in - replug in fixes the issues. https://forums.unraid.net/topic/116612-692-cpu-cores-stuck-at-100-2-coresthreads/?do=findComment&comment=1076132 Other claim fault driver plugins at boot: https://forums.unraid.net/bug-reports/stable-releases/692-cpu-usage-stuck-at-100-2-corethread-r1279/?do=findComment&comment=13471 Edited May 20, 20242 yr by bmartino1
May 20, 20242 yr Community Expert Cna break system.. use comands at own risk. To recover at unraid grub boot hit e and delete teh option to boot back in to change grub options. Yeah DELL... Reviewing diag file: per sys log: May 19 19:38:51 Dell-Echo kernel: x2apic: IRQ remapping doesn't support X2APIC mode ^you have a IOMMU and Dmar for vfio bios options enabled. it is recomend to use: grub option to fix that: intremap=no_x2apic_optout continued: ?thermal / que wait onxeon chip: May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting adjustable number of callback queues. May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting shift to 6 and lim to 1. May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting adjustable number of callback queues. May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting shift to 6 and lim to 1. May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting adjustable number of callback queues. May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting shift to 6 and lim to 1. May 19 19:38:51 Dell-Echo kernel: Performance Events: PEBS fmt2+, Broadwell events, 16-deep LBR, full-width counters, Intel PMU driver. Errors: May 19 19:38:51 Dell-Echo kernel: MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. May 19 19:38:51 Dell-Echo kernel: TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details. May 19 19:38:51 Dell-Echo kernel: MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html for more details. ^Bug you are most likely fighting... May 19 19:38:51 Dell-Echo kernel: ACPI: Using IOAPIC for interrupt routing you may also want to run with grub option per acpi debuting.. noapic continued errors: May 19 19:38:51 Dell-Echo kernel: ACPI: button: Power Button [PWRF] May 19 19:38:51 Dell-Echo kernel: ACPI Error: No handler for Region [SYSI] (000000000f718182) [IPMI] (20220331/evregion-130) May 19 19:38:51 Dell-Echo kernel: ACPI Error: Region IPMI (ID=7) has no handler (20220331/exfldio-261) May 19 19:38:51 Dell-Echo kernel: ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20220331/psparse-529) May 19 19:38:51 Dell-Echo kernel: ipmi_si: IPMI System Interface driver May 19 19:38:51 Dell-Echo kernel: ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20220331/psparse-529) May 19 19:38:51 Dell-Echo kernel: ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS May 19 19:38:51 Dell-Echo kernel: ACPI: \_SB_.PMI0: _PMC evaluation failed: AE_NOT_EXIST ^Manufacture acpi standard error. hard to fix Dell at fault Install impi plugin... ?looks like you may have had a mce event: May 19 19:39:02 Dell-Echo mcelog: Kernel does not support page offline interface Is mcelog empty? mcelog Plugins next via syslog installs: looks like you have this installed and it may be causing your errors: Edited May 20, 20242 yr by bmartino1 forum breaking...
May 20, 20242 yr Author 38 minutes ago, bmartino1 said: Cna break system.. use comands at own risk. To recover at unraid grub boot hit e and delete teh option to boot back in to change grub options. Yeah DELL... Reviewing diag file: per sys log: May 19 19:38:51 Dell-Echo kernel: x2apic: IRQ remapping doesn't support X2APIC mode ^you have a IOMMU and Dmar for vfio bios options enabled. it is recomend to use: grub option to fix that: intremap=no_x2apic_optout continued: ?thermal / que wait onxeon chip: May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting adjustable number of callback queues. May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting shift to 6 and lim to 1. May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting adjustable number of callback queues. May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting shift to 6 and lim to 1. May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting adjustable number of callback queues. May 19 19:38:51 Dell-Echo kernel: cblist_init_generic: Setting shift to 6 and lim to 1. May 19 19:38:51 Dell-Echo kernel: Performance Events: PEBS fmt2+, Broadwell events, 16-deep LBR, full-width counters, Intel PMU driver. Errors: May 19 19:38:51 Dell-Echo kernel: MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. May 19 19:38:51 Dell-Echo kernel: TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details. May 19 19:38:51 Dell-Echo kernel: MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html for more details. ^Bug you are most likely fighting... May 19 19:38:51 Dell-Echo kernel: ACPI: Using IOAPIC for interrupt routing you may also want to run with grub option per acpi debuting.. noapic continued errors: May 19 19:38:51 Dell-Echo kernel: ACPI: button: Power Button [PWRF] May 19 19:38:51 Dell-Echo kernel: ACPI Error: No handler for Region [SYSI] (000000000f718182) [IPMI] (20220331/evregion-130) May 19 19:38:51 Dell-Echo kernel: ACPI Error: Region IPMI (ID=7) has no handler (20220331/exfldio-261) May 19 19:38:51 Dell-Echo kernel: ACPI Error: Aborting method \_SB.PMI0._GHL due to previous error (AE_NOT_EXIST) (20220331/psparse-529) May 19 19:38:51 Dell-Echo kernel: ipmi_si: IPMI System Interface driver May 19 19:38:51 Dell-Echo kernel: ACPI Error: Aborting method \_SB.PMI0._PMC due to previous error (AE_NOT_EXIST) (20220331/psparse-529) May 19 19:38:51 Dell-Echo kernel: ipmi_si dmi-ipmi-si.0: ipmi_platform: probing via SMBIOS May 19 19:38:51 Dell-Echo kernel: ACPI: \_SB_.PMI0: _PMC evaluation failed: AE_NOT_EXIST ^Manufacture acpi standard error. hard to fix Dell at fault Install impi plugin... ?looks like you may have had a mce event: May 19 19:39:02 Dell-Echo mcelog: Kernel does not support page offline interface Is mcelog empty? mcelog Plugins next via syslog installs: looks like you have this installed and it may be causing your errors: i have now remove that plugin and one of the GPU which i was going to use for VM passthrough nothing special only a K620 ..... and now monitoring ..... and so far working ... plex run for 25 mins no issues and VM running both working fine now .... i let run for a fews to check if all is ok
May 20, 20242 yr Community Expert Cool good luck. GPU passthoguh may still be viable. Hopefully next release fixes unraid kernel 6 issues... gpu pasthrough docs: Other unread noob stuff:
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.