CPU Load building up with no indication (after upgrading to AMD)


Recommended Posts

Hi everyone,

I was a happy camper with my i5-8600k no issues besides a  lack of cores for my usage. 
So I upgraded to AMD 3900X, missing the integrated GPU but that's a  different story:

 

Since the upgrade to AMD I can watch the CPU load rise, without any indication I can make out. No processes hogging CPU, no abnormal disk writes and enough RAM (64 GB). So the server then becomes unresponsive. I'm not able for force stop the docker process, libvirtd does work.
So in the end I end up force stopping my machine via power button.
I already tried disabling C states and adding the kernel parameter (adjusted to 24 threads aka 0-23) as recommended by spaceinvaderone (starting around 9:50 min).


I'll attach the diagnostics within the next hour.

I was gone for 4 days and the load was at 1300 oO

This morning like 30 minutes after starting up I was at 127 already see screenshot with htop, top, iotop and vmstat.
IMG_20191008_082556.thumb.jpg.c6caad8f354246ca47a76fde39d6c671.jpg
Running X570 Taichi with the latest BIOS (2.10 - 2019/9/17).

 

 

IMG_20191006_170204.jpg

 

Edited by MrMister
spelling
Link to comment
1 hour ago, SnickySnacks said:

Diagnostics, etc, but I recall seeing this issue being related to certain dockers before:
 

 

Thanks.! Just started reading the Topic.

 

BTW Currently running Unraid 6.7.2.
Diagnostics attached. (Couldn't do it before as the topic was under moderation or something like that)

Also ran Fix Common Problem, only thing appearing is that shares are using cache although they shoudn't.
I'm not running ant additional docker containers then before. Just made the switch from Intel to AMD about a week ago and this issue popped up..

homebase-diagnostics-20191008-0826.zip

Edited by MrMister
additional infos
Link to comment

At the time those diagnostics were created, were any CPUs showing 100% load? If so, which ones?

Also, have you tried booting in safe mode and seeing if this occurs with no plugins/dockers/vms loaded?

There does seem to be some corruption on one of your disks:

Oct  7 23:56:33 Homebase kernel: BTRFS critical (device sdj1): corrupt leaf: root=5 block=1953586397184 slot=84, bad key order, prev (288230376157862467 96 4) current (6150723 96 5)
### [PREVIOUS LINE REPEATED 4 TIMES] ###



Should probably run a check on that one, as it looks like it eventually causes a kernel fault:
 

Oct  8 02:02:49 Homebase kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
Oct  8 02:02:49 Homebase kernel: PGD 4ad0b1067 P4D 4ad0b1067 PUD 4ad0b0067 PMD 0 
Oct  8 02:02:49 Homebase kernel: Oops: 0000 [#1] SMP NOPTI
Oct  8 02:02:49 Homebase kernel: CPU: 15 PID: 1848 Comm: fstrim Tainted: P           O      4.19.56-Unraid #1
Oct  8 02:02:49 Homebase kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Taichi, BIOS P2.10 09/09/2019
Oct  8 02:02:49 Homebase kernel: RIP: 0010:btrfs_trim_fs+0x166/0x369
Oct  8 02:02:49 Homebase kernel: Code: 00 00 48 c7 44 24 38 00 00 00 00 49 8b 45 10 48 c7 44 24 40 00 00 00 00 48 c7 44 24 30 00 00 00 00 48 89 44 24 20 48 8b 43 68 <48> 8b 80 80 00 00 00 48 8b 80 f8 03 00 00 48 8b 80 a8 01 00 00 0f
Oct  8 02:02:49 Homebase kernel: RSP: 0018:ffffc9001294fc90 EFLAGS: 00010297
Oct  8 02:02:49 Homebase kernel: RAX: 0000000000000000 RBX: ffff888f5db68200 RCX: ffff888fbf604878
Oct  8 02:02:49 Homebase kernel: RDX: ffff888cac98de80 RSI: ffff888f5d718c00 RDI: ffff888fbf604858
Oct  8 02:02:49 Homebase kernel: RBP: 0000000000000000 R08: ffff888f5911fa70 R09: ffff888f5911fa68
Oct  8 02:02:49 Homebase kernel: R10: ffffea0022918ec0 R11: ffff888ffe9e0b80 R12: ffff888fbfafe000
Oct  8 02:02:49 Homebase kernel: R13: ffffc9001294fd20 R14: 0000000000000000 R15: 0000000000000000
Oct  8 02:02:49 Homebase kernel: FS:  000014b7fa3ac780(0000) GS:ffff888ffe9c0000(0000) knlGS:0000000000000000
Oct  8 02:02:49 Homebase kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  8 02:02:49 Homebase kernel: CR2: 0000000000000080 CR3: 00000001a6aa2000 CR4: 0000000000340ee0
Oct  8 02:02:49 Homebase kernel: Call Trace:
Oct  8 02:02:49 Homebase kernel: ? dput.part.6+0x24/0xf6
Oct  8 02:02:49 Homebase kernel: btrfs_ioctl_fitrim.isra.7+0xfe/0x135
Oct  8 02:02:49 Homebase kernel: btrfs_ioctl+0x4f6/0x28ad
Oct  8 02:02:49 Homebase kernel: ? queue_var_show+0x12/0x15
Oct  8 02:02:49 Homebase kernel: ? _copy_to_user+0x22/0x28
Oct  8 02:02:49 Homebase kernel: ? cp_new_stat+0x14b/0x17a
Oct  8 02:02:49 Homebase kernel: ? vfs_ioctl+0x19/0x26
Oct  8 02:02:49 Homebase kernel: vfs_ioctl+0x19/0x26
Oct  8 02:02:49 Homebase kernel: do_vfs_ioctl+0x526/0x54e
Oct  8 02:02:49 Homebase kernel: ? __se_sys_newfstat+0x3c/0x5f
Oct  8 02:02:49 Homebase kernel: ksys_ioctl+0x39/0x58
Oct  8 02:02:49 Homebase kernel: __x64_sys_ioctl+0x11/0x14
Oct  8 02:02:49 Homebase kernel: do_syscall_64+0x57/0xf2
Oct  8 02:02:49 Homebase kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct  8 02:02:49 Homebase kernel: RIP: 0033:0x14b7fa4de397
Oct  8 02:02:49 Homebase kernel: Code: 00 00 90 48 8b 05 f9 2a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 2a 0d 00 f7 d8 64 89 01 48
Oct  8 02:02:49 Homebase kernel: RSP: 002b:00007ffc52c9f358 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct  8 02:02:49 Homebase kernel: RAX: ffffffffffffffda RBX: 00007ffc52c9f4b0 RCX: 000014b7fa4de397
Oct  8 02:02:49 Homebase kernel: RDX: 00007ffc52c9f360 RSI: 00000000c0185879 RDI: 0000000000000003
Oct  8 02:02:49 Homebase kernel: RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000415fd0
Oct  8 02:02:49 Homebase kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000415740
Oct  8 02:02:49 Homebase kernel: R13: 00000000004156c0 R14: 0000000000415740 R15: 000014b7fa3ac6b0
Oct  8 02:02:49 Homebase kernel: Modules linked in: veth xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap macvlan xt_nat ipt_MASQUERADE iptable_nat nf_nat_ipv4 iptable_filter ip_tables nf_nat xfs dm_crypt algif_skcipher af_alg dm_mod dax md_mod bonding edac_mce_amd kvm_amd nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) drm_kms_helper btusb btrtl btbcm drm kvm btintel igb bluetooth agpgart syscopyarea sysfillrect crct10dif_pclmul sysimgblt fb_sys_fops crc32_pclmul crc32c_intel ghash_clmulni_intel i2c_piix4 i2c_algo_bit pcbc i2c_core aesni_intel aes_x86_64 crypto_simd wmi_bmof mxm_wmi ahci ecdh_generic cryptd ccp libahci glue_helper wmi button pcc_cpufreq acpi_cpufreq
Oct  8 02:02:49 Homebase kernel: CR2: 0000000000000080
Oct  8 02:02:49 Homebase kernel: ---[ end trace 9bdd9e618dc0d9c2 ]---
Oct  8 02:02:49 Homebase kernel: RIP: 0010:btrfs_trim_fs+0x166/0x369
Oct  8 02:02:49 Homebase kernel: Code: 00 00 48 c7 44 24 38 00 00 00 00 49 8b 45 10 48 c7 44 24 40 00 00 00 00 48 c7 44 24 30 00 00 00 00 48 89 44 24 20 48 8b 43 68 <48> 8b 80 80 00 00 00 48 8b 80 f8 03 00 00 48 8b 80 a8 01 00 00 0f
Oct  8 02:02:49 Homebase kernel: RSP: 0018:ffffc9001294fc90 EFLAGS: 00010297
Oct  8 02:02:49 Homebase kernel: RAX: 0000000000000000 RBX: ffff888f5db68200 RCX: ffff888fbf604878
Oct  8 02:02:49 Homebase kernel: RDX: ffff888cac98de80 RSI: ffff888f5d718c00 RDI: ffff888fbf604858
Oct  8 02:02:49 Homebase kernel: RBP: 0000000000000000 R08: ffff888f5911fa70 R09: ffff888f5911fa68
Oct  8 02:02:49 Homebase kernel: R10: ffffea0022918ec0 R11: ffff888ffe9e0b80 R12: ffff888fbfafe000
Oct  8 02:02:49 Homebase kernel: R13: ffffc9001294fd20 R14: 0000000000000000 R15: 0000000000000000
Oct  8 02:02:49 Homebase kernel: FS:  000014b7fa3ac780(0000) GS:ffff888ffe9c0000(0000) knlGS:0000000000000000
Oct  8 02:02:49 Homebase kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  8 02:02:49 Homebase kernel: CR2: 0000000000000080 CR3: 00000001a6aa2000 CR4: 0000000000340ee0


 

Edited by SnickySnacks
  • Like 1
Link to comment
Quote

At the time those diagnostics were created, were any CPUs showing 100% load? If so, which ones?

As you can see in the bad photo i made from the wub UI, 21/24 threads were at 100%...
 

Quote

Also, have you tried booting in safe mode and seeing if this occurs with no plugins/dockers/vms loaded?

I did, the issue is not occuring!
 

Quote

There does seem to be some corruption on one of your disks:

The plot thickens!
I tried starting clearing the cache disc (downloads that weren't moved yet bc mover didn't get to run etc)
--> then the load starts to rise again!
sdj is one of the cache discs (total: 2 x 1 TB Crucial SSDs).
They were Raid1 before the upgrade! --
I remember that after the boot after the new build I forgot to plug in one of the PCI Sata Card(!) and started the array (1 array disc and 1 cache drice was missing) - as it emulated the drive from the array and the cache drive were raid 1 I thought lets give it a spin!

I then installed the card and before cache was roughly 1 TB now its showing 2 TB total capacity!

So I need to check if I can grab everything from the cache drive asap (all my appdata and vdisks), maybe externally connected to another machine.
Then remove the failing cache drive config wise and then add it again?

Thank you so far!

Link to comment
14 hours ago, SnickySnacks said:

I don't know much about cache disks.

Obvious suggestion is to fix the error on the disk and then enable plugins/dockers/VMs one at a time until you can determine which one is causing the problem, I suppose.

Thank you so much for your support!

As presumed one of the cache drives seemed to have been corrupted. What i did to fix it:

- grabbed everything that I could from the healthy drive to another ssd
- hit new configuration only kept parity and data, removing the cache drives

- disabled docker and libvirt via settings

- started the array

- stopped the array

- removed the partitions of the cache drive (unassigned devices)

- added both cache drives again

- started the array

- formatted the cache drives

- copied the data from the other ssd

- enabled docker and libvirt again

--> et voila

now it also shows the correct size (raid 1, 9xx GB) again

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.