Sudden unclean reboot caused Machine Check Events alert


augot

Recommended Posts

At some point this afternoon my server appears to have rebooted itself. When I realised and logged back in, Fix Common Problems alerted me: "Your server has detected hardware errors. You should install mcelog via the NerdPack plugin, post your diagnostics and ask for assistance on the unRaid forums. The output of mcelog (if installed) has been logged." So... Here I am! Diagnostics zip attached. I'm on 6.9.0-rc2.

 

Specs:

CPU: Ryzen 9 5950x

RAM: 4x16GB G.Skill Ripjaws V 3400 DDR4 C16

Motherboard: Asus X570-P Prime

Storage: 2x1TB SSD cache pool for appdata and downloads and stuff like that, 2x2TB nvmes for the vdisk for a Windows 10 VM, 8x8TB HDD array

GPU: GTX1660 (used for Plex), GTX2070 Super (passed through to Windows VM)

PSU: 1000W Corsair HX1000i

 

Looking through the log, there's a repeating error message that came around a bunch of times before Unraid booted up normally:

 

Quote

Feb  6 15:40:15 Tower kernel: BUG: Bad page state in process swapper  pfn:668412
Feb  6 15:40:15 Tower kernel: page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x668412
Feb  6 15:40:15 Tower kernel: flags: 0x2ffff0000000000()
Feb  6 15:40:15 Tower kernel: raw: 02ffff0000000000 ffffea0019a10488 ffffea0019a10488 0000000000000000
Feb  6 15:40:15 Tower kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000008000000
Feb  6 15:40:15 Tower kernel: page dumped because: page still charged to cgroup
Feb  6 15:40:15 Tower kernel: page->mem_cgroup:0000000008000000
Feb  6 15:40:15 Tower kernel: Modules linked in:
Feb  6 15:40:15 Tower kernel: CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.1-Unraid #1
Feb  6 15:40:15 Tower kernel: Hardware name: System manufacturer System Product Name/PRIME X570-P, BIOS 3001 12/04/2020
Feb  6 15:40:15 Tower kernel: Call Trace:
Feb  6 15:40:15 Tower kernel: dump_stack+0x6b/0x83
Feb  6 15:40:15 Tower kernel: bad_page+0xcb/0xe3
Feb  6 15:40:15 Tower kernel: check_free_page+0x70/0x76
Feb  6 15:40:15 Tower kernel: __free_pages_ok+0x83/0x1ad
Feb  6 15:40:15 Tower kernel: memblock_free_all+0x12f/0x19e
Feb  6 15:40:15 Tower kernel: mem_init+0x18/0x138
Feb  6 15:40:15 Tower kernel: start_kernel+0x26b/0x4e3
Feb  6 15:40:15 Tower kernel: secondary_startup_64_no_verify+0xb0/0xbb
Feb  6 15:40:15 Tower kernel: Disabling lock debugging due to kernel taint

 

Not sure if this is the specific problem that caused the reboot/Machine Check Events warning or not. I searched "kernel taint" and it seems it's not the kind of thing that would usually indicate a hardware failure, but maybe I'm mistaken there? My other guess is that the RAM is new as of about six weeks ago, and it might be a fault with that - once the parity check finishes I'm gonna reboot properly and run a memtest. But any help appreciated!

 

tower-diagnostics-20210206-1558.zip

Edited by augot
Link to comment

Not sure how I missed that there was a ceiling to the supported RAM speeds for 3rd gen Ryzen - woops. Somewhat of a shame, since I went for a 3400MHz/C16 combo for gaming performance with my Windows VM rather than just for the sake of it, upgrading from 2667MHz sticks which, it turns out, would actually have been more appropriate all along... Hmm. The parity check has about an hour left to run. I'll still run memtest overnight just to be sure, but if nothing bad comes up I'll see whether it happens again and tune down the RAM in bios.

 

I had been having some weird issues with USB devices in my Windows VM, with bluetooth devices having connection dropouts if connected to the same passed-through USB 3.0 hub (which never happened on bare metal) - would incorrect RAM speed play a part in that, perhaps?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.