Failed MB, replaced it, getting hard locks


Recommended Posts

Hardware:

x399, Threadripper 2920X

64gb ram

SAS 9211-8I

12x8tb drives

1tb, 512gb and 512gb nvme for btrfs cache

 

Background - bought a new server case (supermicro 846).

Tried the new server backplane / swapable bays and everything worked fine.

In moving over to the new case and (I'm assuming) I broke the MB - would no longer post.

Bought a new MB, replaced it, and it started booting, but lots of strange issues - so far I've tried a few things:

  • Reinstalled 6.8.3 (instead of the nvidia driver version)
  • Every combination of disabled cstates and psu idle power states (some kernel dumps were referencing CPUIDLE in the error)
  • Disabled all docker autostarts (seemed like I got some docker related errors?)
  • Removed references to old pass-through GPU from my plex container (thinking GUIDs might be different?)
  • Latest BIOS on the motherboard

 

Currently there are no autostart VMs or Dockers, I autostart the array and it crashes after a few minutes. Occasionally will not boot at all.

 

There is apparently a tower diagnostics zip in the logs fodler from about 90 minutes prior to the log before, I'm not sure what I did to trigger it, though. I've attached it.

 

It had once been on long enough that I had enabled mirroring the syslog to flash, got this chunk before it died:

Apr  9 21:23:04 Tower kernel: traps: notify[7202] general protection ip:68d370 sp:7ffeee22bfb8 error:0 in php[433000+2b4000]
Apr  9 21:23:05 Tower rsyslogd: [origin software="rsyslogd" swVersion="8.1908.0" x-pid="7171" x-info="https://www.rsyslog.com"] start
Apr  9 21:24:07 Tower kernel: notify[7605]: segfault at 502 ip 000000000065caae sp 00007ffebc8df200 error 4 in php[433000+2b4000]
Apr  9 21:24:07 Tower kernel: Code: 15 81 8c 24 b4 00 00 00 00 00 00 01 83 ea 01 89 94 24 d0 00 00 00 48 8b 40 08 48 83 f8 01 76 28 48 3d ff 01 00 00 76 15 31 ed <80> 38 3f 40 0f 94 c5 48 01 c5 4d 85 e4 0f 84 7f 05 00 00 81 8c 24
Apr  9 21:24:07 Tower kernel: mdcmd (49): nocheck Pause
Apr  9 21:24:50 Tower init: Switching to runlevel: 0
Apr  9 21:24:50 Tower init: Trying to re-exec init
Apr  9 21:25:34 Tower ntpd[2465]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Apr  9 21:25:48 Tower kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Apr  9 21:25:48 Tower kernel: rcu: 	21-...0: (67 ticks this GP) idle=836/1/0x4000000000000000 softirq=3068/3068 fqs=58815 
Apr  9 21:25:48 Tower kernel: rcu: 	(detected by 18, t=240007 jiffies, g=14481, q=73080)
Apr  9 21:25:48 Tower kernel: Sending NMI from CPU 18 to CPUs 21:
Apr  9 21:25:48 Tower kernel: NMI backtrace for cpu 21
Apr  9 21:25:48 Tower kernel: CPU: 21 PID: 5299 Comm: unraidd0 Tainted: G      D    O      4.19.107-Unraid #1
Apr  9 21:25:48 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P3.90 12/04/2019
Apr  9 21:25:48 Tower kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x6b/0x171
Apr  9 21:25:48 Tower kernel: Code: 42 f0 8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 74 0e 81 e6 00 ff 00 00 75 1a c6 47 01 00 eb 14 85 f6 74 0a 8b 07 84 c0 74 04 f3 90 <eb> f6 66 c7 07 01 00 c3 48 c7 c2 40 07 02 00 65 48 03 15 80 6a f8
Apr  9 21:25:48 Tower kernel: RSP: 0018:ffffc9000730bd80 EFLAGS: 00000002
Apr  9 21:25:48 Tower kernel: RAX: 0000000000000101 RBX: ffff888ff8830b08 RCX: 0000000000000000
Apr  9 21:25:48 Tower kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff889031c3fd70
Apr  9 21:25:48 Tower kernel: RBP: ffff889031c3fd70 R08: 0000000000000000 R09: ffffc9000730bd48
Apr  9 21:25:48 Tower kernel: R10: 0000000000000fe0 R11: ffff888ff8830b88 R12: ffff888ff8830af8
Apr  9 21:25:48 Tower kernel: R13: ffff889031c3f800 R14: ffff888ff8831540 R15: ffff888ffc16e800
Apr  9 21:25:48 Tower kernel: FS:  0000000000000000(0000) GS:ffff88903d340000(0000) knlGS:0000000000000000
Apr  9 21:25:48 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  9 21:25:48 Tower kernel: CR2: 0000000000514e54 CR3: 0000000fec62c000 CR4: 00000000003406e0
Apr  9 21:25:48 Tower kernel: Call Trace:
Apr  9 21:25:48 Tower kernel: _raw_spin_lock_irq+0x1d/0x20
Apr  9 21:25:48 Tower kernel: release_stripe+0x1b/0x3d [md_mod]
Apr  9 21:25:48 Tower kernel: unraidd+0x12d7/0x136e [md_mod]
Apr  9 21:25:48 Tower kernel: ? __switch_to_asm+0x35/0x70
Apr  9 21:25:48 Tower kernel: ? __schedule+0x4f7/0x548
Apr  9 21:25:48 Tower kernel: ? md_thread+0xee/0x115 [md_mod]
Apr  9 21:25:48 Tower kernel: md_thread+0xee/0x115 [md_mod]
Apr  9 21:25:48 Tower kernel: ? wait_woken+0x6a/0x6a
Apr  9 21:25:48 Tower kernel: ? md_open+0x2c/0x2c [md_mod]
Apr  9 21:25:48 Tower kernel: kthread+0x10c/0x114
Apr  9 21:25:48 Tower kernel: ? kthread_park+0x89/0x89
Apr  9 21:25:48 Tower kernel: ret_from_fork+0x22/0x40
Apr  9 21:26:33 Tower root: Status of all loop devices
Apr  9 21:26:33 Tower root: /dev/loop1: [2049]:4 (/boot/bzfirmware)
Apr  9 21:26:33 Tower root: /dev/loop2: [0037]:260 (/mnt/cache/system/docker/docker.img)
Apr  9 21:26:33 Tower root: /dev/loop0: [2049]:3 (/boot/bzmodules)
Apr  9 21:26:33 Tower root: Active pids left on /mnt/*
Apr  9 21:26:33 Tower root:                      USER        PID ACCESS COMMAND
Apr  9 21:26:33 Tower root: /mnt/cache:          root     kernel mount /mnt/cache
Apr  9 21:26:33 Tower root: /mnt/disk1:          root     kernel mount /mnt/disk1
Apr  9 21:26:33 Tower root: /mnt/disk10:         root     kernel mount /mnt/disk10
Apr  9 21:26:33 Tower root: /mnt/disk2:          root     kernel mount /mnt/disk2
Apr  9 21:26:33 Tower root: /mnt/disk3:          root     kernel mount /mnt/disk3
Apr  9 21:26:33 Tower root: /mnt/disk4:          root     kernel mount /mnt/disk4
Apr  9 21:26:33 Tower root: /mnt/disk5:          root     kernel mount /mnt/disk5
Apr  9 21:26:33 Tower root: /mnt/disk6:          root     kernel mount /mnt/disk6
Apr  9 21:26:33 Tower root: /mnt/disk7:          root     kernel mount /mnt/disk7
Apr  9 21:26:33 Tower root: /mnt/disk8:          root     kernel mount /mnt/disk8
Apr  9 21:26:33 Tower root: /mnt/disk9:          root     kernel mount /mnt/disk9
Apr  9 21:26:33 Tower root: /mnt/user:           root     kernel mount /mnt/user
Apr  9 21:26:33 Tower root: /mnt/user0:          root     kernel mount /mnt/user0
Apr  9 21:26:33 Tower root: Active pids left on /dev/md*
Apr  9 21:26:33 Tower root:                      USER        PID ACCESS COMMAND
Apr  9 21:26:33 Tower root: /dev/md1:            root     kernel mount /mnt/disk1
Apr  9 21:26:33 Tower root: /dev/md10:           root     kernel mount /mnt/disk10
Apr  9 21:26:33 Tower root: /dev/md2:            root     kernel mount /mnt/disk2
Apr  9 21:26:33 Tower root: /dev/md3:            root     kernel mount /mnt/disk3
Apr  9 21:26:33 Tower root: /dev/md4:            root     kernel mount /mnt/disk4
Apr  9 21:26:33 Tower root: /dev/md5:            root     kernel mount /mnt/disk5
Apr  9 21:26:33 Tower root: /dev/md6:            root     kernel mount /mnt/disk6
Apr  9 21:26:33 Tower root: /dev/md7:            root     kernel mount /mnt/disk7
Apr  9 21:26:33 Tower root: /dev/md8:            root     kernel mount /mnt/disk8
Apr  9 21:26:33 Tower root: /dev/md9:            root     kernel mount /mnt/disk9
Apr  9 21:26:33 Tower root: Generating diagnostics...
Apr  9 21:26:39 Tower kernel: BUG: unable to handle kernel paging request at 00000000000096d1
Apr  9 21:26:39 Tower kernel: PGD fdf792067 P4D fdf792067 PUD fdf178067 PMD 0 
Apr  9 21:26:39 Tower kernel: Oops: 0000 [#2] SMP NOPTI
Apr  9 21:26:39 Tower kernel: CPU: 5 PID: 207 Comm: kworker/u256:6 Tainted: G      D    O      4.19.107-Unraid #1
Apr  9 21:26:39 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Taichi, BIOS P3.90 12/04/2019
Apr  9 21:26:39 Tower kernel: Workqueue: events_power_efficient gc_worker
Apr  9 21:26:39 Tower kernel: RIP: 0010:gc_worker+0x8c/0x270
Apr  9 21:26:39 Tower kernel: Code: 93 00 48 8b 15 e4 9a 93 00 3b 05 c2 9a 93 00 75 dd 39 cd 72 02 31 ed 89 e8 48 8d 04 c2 4c 8b 30 41 f6 c6 01 0f 85 4a 01 00 00 <41> 0f b6 46 37 49 c7 c0 f0 ff ff ff 41 ff c5 48 6b c0 38 49 29 c0
Apr  9 21:26:39 Tower kernel: RSP: 0018:ffffc90006ecbe60 EFLAGS: 00010246
Apr  9 21:26:39 Tower kernel: RAX: ffff889031125bb0 RBX: 0000000000000000 RCX: 0000000000010000
Apr  9 21:26:39 Tower kernel: RDX: ffff889031100000 RSI: 0000000000000175 RDI: ffffffff822aa760
Apr  9 21:26:39 Tower kernel: RBP: 0000000000004b76 R08: ffffffffffffffb8 R09: 0000746e65696369
Apr  9 21:26:39 Tower kernel: R10: 8080808080808080 R11: fefefefefefefeff R12: ffffffff822aa760
Apr  9 21:26:39 Tower kernel: R13: 0000000000000001 R14: 000000000000969a R15: ffff888fe4180000
Apr  9 21:26:39 Tower kernel: FS:  0000000000000000(0000) GS:ffff88903cf40000(0000) knlGS:0000000000000000
Apr  9 21:26:39 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  9 21:26:39 Tower kernel: CR2: 00000000000096d1 CR3: 0000001035c16000 CR4: 00000000003406e0
Apr  9 21:26:39 Tower kernel: Call Trace:
Apr  9 21:26:39 Tower kernel: process_one_work+0x16e/0x24f
Apr  9 21:26:39 Tower kernel: worker_thread+0x1e2/0x2b8
Apr  9 21:26:39 Tower kernel: ? rescuer_thread+0x2a7/0x2a7
Apr  9 21:26:39 Tower kernel: kthread+0x10c/0x114
Apr  9 21:26:39 Tower kernel: ? kthread_park+0x89/0x89
Apr  9 21:26:39 Tower kernel: ret_from_fork+0x22/0x40
Apr  9 21:26:39 Tower kernel: Modules linked in: ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod bonding igb(O) edac_mce_amd kvm_amd kvm btusb btrtl btbcm btintel bluetooth crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 i2c_piix4 crypto_simd wmi_bmof mxm_wmi i2c_core k10temp mpt3sas ecdh_generic cryptd glue_helper raid_class ccp scsi_transport_sas nvme ahci nvme_core libahci wmi pcc_cpufreq button acpi_cpufreq [last unloaded: igb]
Apr  9 21:26:39 Tower kernel: CR2: 00000000000096d1
Apr  9 21:26:39 Tower kernel: ---[ end trace 0d847ac0fcfecec6 ]---

 

tower-diagnostics-20200409-1959.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.