MrMister Posted October 8, 2019 Share Posted October 8, 2019 (edited) Hi everyone, I was a happy camper with my i5-8600k no issues besides a lack of cores for my usage. So I upgraded to AMD 3900X, missing the integrated GPU but that's a different story: Since the upgrade to AMD I can watch the CPU load rise, without any indication I can make out. No processes hogging CPU, no abnormal disk writes and enough RAM (64 GB). So the server then becomes unresponsive. I'm not able for force stop the docker process, libvirtd does work. So in the end I end up force stopping my machine via power button. I already tried disabling C states and adding the kernel parameter (adjusted to 24 threads aka 0-23) as recommended by spaceinvaderone (starting around 9:50 min). I'll attach the diagnostics within the next hour. I was gone for 4 days and the load was at 1300 oO This morning like 30 minutes after starting up I was at 127 already see screenshot with htop, top, iotop and vmstat. Running X570 Taichi with the latest BIOS (2.10 - 2019/9/17). Edited October 8, 2019 by MrMister spelling Quote Link to comment
SnickySnacks Posted October 8, 2019 Share Posted October 8, 2019 Diagnostics, etc, but I recall seeing this issue being related to certain dockers before: Quote Link to comment
MrMister Posted October 8, 2019 Author Share Posted October 8, 2019 (edited) 1 hour ago, SnickySnacks said: Diagnostics, etc, but I recall seeing this issue being related to certain dockers before: Thanks.! Just started reading the Topic. BTW Currently running Unraid 6.7.2. Diagnostics attached. (Couldn't do it before as the topic was under moderation or something like that) Also ran Fix Common Problem, only thing appearing is that shares are using cache although they shoudn't. I'm not running ant additional docker containers then before. Just made the switch from Intel to AMD about a week ago and this issue popped up.. homebase-diagnostics-20191008-0826.zip Edited October 8, 2019 by MrMister additional infos Quote Link to comment
MrMister Posted October 8, 2019 Author Share Posted October 8, 2019 This should be in General support, right? Did it get moved here? oO Quote Link to comment
SnickySnacks Posted October 8, 2019 Share Posted October 8, 2019 (edited) At the time those diagnostics were created, were any CPUs showing 100% load? If so, which ones? Also, have you tried booting in safe mode and seeing if this occurs with no plugins/dockers/vms loaded? There does seem to be some corruption on one of your disks: Oct 7 23:56:33 Homebase kernel: BTRFS critical (device sdj1): corrupt leaf: root=5 block=1953586397184 slot=84, bad key order, prev (288230376157862467 96 4) current (6150723 96 5) ### [PREVIOUS LINE REPEATED 4 TIMES] ### Should probably run a check on that one, as it looks like it eventually causes a kernel fault: Oct 8 02:02:49 Homebase kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 Oct 8 02:02:49 Homebase kernel: PGD 4ad0b1067 P4D 4ad0b1067 PUD 4ad0b0067 PMD 0 Oct 8 02:02:49 Homebase kernel: Oops: 0000 [#1] SMP NOPTI Oct 8 02:02:49 Homebase kernel: CPU: 15 PID: 1848 Comm: fstrim Tainted: P O 4.19.56-Unraid #1 Oct 8 02:02:49 Homebase kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Taichi, BIOS P2.10 09/09/2019 Oct 8 02:02:49 Homebase kernel: RIP: 0010:btrfs_trim_fs+0x166/0x369 Oct 8 02:02:49 Homebase kernel: Code: 00 00 48 c7 44 24 38 00 00 00 00 49 8b 45 10 48 c7 44 24 40 00 00 00 00 48 c7 44 24 30 00 00 00 00 48 89 44 24 20 48 8b 43 68 <48> 8b 80 80 00 00 00 48 8b 80 f8 03 00 00 48 8b 80 a8 01 00 00 0f Oct 8 02:02:49 Homebase kernel: RSP: 0018:ffffc9001294fc90 EFLAGS: 00010297 Oct 8 02:02:49 Homebase kernel: RAX: 0000000000000000 RBX: ffff888f5db68200 RCX: ffff888fbf604878 Oct 8 02:02:49 Homebase kernel: RDX: ffff888cac98de80 RSI: ffff888f5d718c00 RDI: ffff888fbf604858 Oct 8 02:02:49 Homebase kernel: RBP: 0000000000000000 R08: ffff888f5911fa70 R09: ffff888f5911fa68 Oct 8 02:02:49 Homebase kernel: R10: ffffea0022918ec0 R11: ffff888ffe9e0b80 R12: ffff888fbfafe000 Oct 8 02:02:49 Homebase kernel: R13: ffffc9001294fd20 R14: 0000000000000000 R15: 0000000000000000 Oct 8 02:02:49 Homebase kernel: FS: 000014b7fa3ac780(0000) GS:ffff888ffe9c0000(0000) knlGS:0000000000000000 Oct 8 02:02:49 Homebase kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 8 02:02:49 Homebase kernel: CR2: 0000000000000080 CR3: 00000001a6aa2000 CR4: 0000000000340ee0 Oct 8 02:02:49 Homebase kernel: Call Trace: Oct 8 02:02:49 Homebase kernel: ? dput.part.6+0x24/0xf6 Oct 8 02:02:49 Homebase kernel: btrfs_ioctl_fitrim.isra.7+0xfe/0x135 Oct 8 02:02:49 Homebase kernel: btrfs_ioctl+0x4f6/0x28ad Oct 8 02:02:49 Homebase kernel: ? queue_var_show+0x12/0x15 Oct 8 02:02:49 Homebase kernel: ? _copy_to_user+0x22/0x28 Oct 8 02:02:49 Homebase kernel: ? cp_new_stat+0x14b/0x17a Oct 8 02:02:49 Homebase kernel: ? vfs_ioctl+0x19/0x26 Oct 8 02:02:49 Homebase kernel: vfs_ioctl+0x19/0x26 Oct 8 02:02:49 Homebase kernel: do_vfs_ioctl+0x526/0x54e Oct 8 02:02:49 Homebase kernel: ? __se_sys_newfstat+0x3c/0x5f Oct 8 02:02:49 Homebase kernel: ksys_ioctl+0x39/0x58 Oct 8 02:02:49 Homebase kernel: __x64_sys_ioctl+0x11/0x14 Oct 8 02:02:49 Homebase kernel: do_syscall_64+0x57/0xf2 Oct 8 02:02:49 Homebase kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Oct 8 02:02:49 Homebase kernel: RIP: 0033:0x14b7fa4de397 Oct 8 02:02:49 Homebase kernel: Code: 00 00 90 48 8b 05 f9 2a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 2a 0d 00 f7 d8 64 89 01 48 Oct 8 02:02:49 Homebase kernel: RSP: 002b:00007ffc52c9f358 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Oct 8 02:02:49 Homebase kernel: RAX: ffffffffffffffda RBX: 00007ffc52c9f4b0 RCX: 000014b7fa4de397 Oct 8 02:02:49 Homebase kernel: RDX: 00007ffc52c9f360 RSI: 00000000c0185879 RDI: 0000000000000003 Oct 8 02:02:49 Homebase kernel: RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000415fd0 Oct 8 02:02:49 Homebase kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000415740 Oct 8 02:02:49 Homebase kernel: R13: 00000000004156c0 R14: 0000000000415740 R15: 000014b7fa3ac6b0 Oct 8 02:02:49 Homebase kernel: Modules linked in: veth xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap macvlan xt_nat ipt_MASQUERADE iptable_nat nf_nat_ipv4 iptable_filter ip_tables nf_nat xfs dm_crypt algif_skcipher af_alg dm_mod dax md_mod bonding edac_mce_amd kvm_amd nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) drm_kms_helper btusb btrtl btbcm drm kvm btintel igb bluetooth agpgart syscopyarea sysfillrect crct10dif_pclmul sysimgblt fb_sys_fops crc32_pclmul crc32c_intel ghash_clmulni_intel i2c_piix4 i2c_algo_bit pcbc i2c_core aesni_intel aes_x86_64 crypto_simd wmi_bmof mxm_wmi ahci ecdh_generic cryptd ccp libahci glue_helper wmi button pcc_cpufreq acpi_cpufreq Oct 8 02:02:49 Homebase kernel: CR2: 0000000000000080 Oct 8 02:02:49 Homebase kernel: ---[ end trace 9bdd9e618dc0d9c2 ]--- Oct 8 02:02:49 Homebase kernel: RIP: 0010:btrfs_trim_fs+0x166/0x369 Oct 8 02:02:49 Homebase kernel: Code: 00 00 48 c7 44 24 38 00 00 00 00 49 8b 45 10 48 c7 44 24 40 00 00 00 00 48 c7 44 24 30 00 00 00 00 48 89 44 24 20 48 8b 43 68 <48> 8b 80 80 00 00 00 48 8b 80 f8 03 00 00 48 8b 80 a8 01 00 00 0f Oct 8 02:02:49 Homebase kernel: RSP: 0018:ffffc9001294fc90 EFLAGS: 00010297 Oct 8 02:02:49 Homebase kernel: RAX: 0000000000000000 RBX: ffff888f5db68200 RCX: ffff888fbf604878 Oct 8 02:02:49 Homebase kernel: RDX: ffff888cac98de80 RSI: ffff888f5d718c00 RDI: ffff888fbf604858 Oct 8 02:02:49 Homebase kernel: RBP: 0000000000000000 R08: ffff888f5911fa70 R09: ffff888f5911fa68 Oct 8 02:02:49 Homebase kernel: R10: ffffea0022918ec0 R11: ffff888ffe9e0b80 R12: ffff888fbfafe000 Oct 8 02:02:49 Homebase kernel: R13: ffffc9001294fd20 R14: 0000000000000000 R15: 0000000000000000 Oct 8 02:02:49 Homebase kernel: FS: 000014b7fa3ac780(0000) GS:ffff888ffe9c0000(0000) knlGS:0000000000000000 Oct 8 02:02:49 Homebase kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 8 02:02:49 Homebase kernel: CR2: 0000000000000080 CR3: 00000001a6aa2000 CR4: 0000000000340ee0 Edited October 8, 2019 by SnickySnacks 1 Quote Link to comment
MrMister Posted October 9, 2019 Author Share Posted October 9, 2019 Quote At the time those diagnostics were created, were any CPUs showing 100% load? If so, which ones? As you can see in the bad photo i made from the wub UI, 21/24 threads were at 100%... Quote Also, have you tried booting in safe mode and seeing if this occurs with no plugins/dockers/vms loaded? I did, the issue is not occuring! Quote There does seem to be some corruption on one of your disks: The plot thickens! I tried starting clearing the cache disc (downloads that weren't moved yet bc mover didn't get to run etc) --> then the load starts to rise again! sdj is one of the cache discs (total: 2 x 1 TB Crucial SSDs). They were Raid1 before the upgrade! -- I remember that after the boot after the new build I forgot to plug in one of the PCI Sata Card(!) and started the array (1 array disc and 1 cache drice was missing) - as it emulated the drive from the array and the cache drive were raid 1 I thought lets give it a spin! I then installed the card and before cache was roughly 1 TB now its showing 2 TB total capacity! So I need to check if I can grab everything from the cache drive asap (all my appdata and vdisks), maybe externally connected to another machine. Then remove the failing cache drive config wise and then add it again? Thank you so far! Quote Link to comment
SnickySnacks Posted October 9, 2019 Share Posted October 9, 2019 I don't know much about cache disks. Obvious suggestion is to fix the error on the disk and then enable plugins/dockers/VMs one at a time until you can determine which one is causing the problem, I suppose. 1 Quote Link to comment
MrMister Posted October 10, 2019 Author Share Posted October 10, 2019 14 hours ago, SnickySnacks said: I don't know much about cache disks. Obvious suggestion is to fix the error on the disk and then enable plugins/dockers/VMs one at a time until you can determine which one is causing the problem, I suppose. Thank you so much for your support! As presumed one of the cache drives seemed to have been corrupted. What i did to fix it: - grabbed everything that I could from the healthy drive to another ssd - hit new configuration only kept parity and data, removing the cache drives - disabled docker and libvirt via settings - started the array - stopped the array - removed the partitions of the cache drive (unassigned devices) - added both cache drives again - started the array - formatted the cache drives - copied the data from the other ssd - enabled docker and libvirt again --> et voila now it also shows the correct size (raid 1, 9xx GB) again Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.