augot Posted February 6, 2021 Share Posted February 6, 2021 (edited) At some point this afternoon my server appears to have rebooted itself. When I realised and logged back in, Fix Common Problems alerted me: "Your server has detected hardware errors. You should install mcelog via the NerdPack plugin, post your diagnostics and ask for assistance on the unRaid forums. The output of mcelog (if installed) has been logged." So... Here I am! Diagnostics zip attached. I'm on 6.9.0-rc2. Specs: CPU: Ryzen 9 5950x RAM: 4x16GB G.Skill Ripjaws V 3400 DDR4 C16 Motherboard: Asus X570-P Prime Storage: 2x1TB SSD cache pool for appdata and downloads and stuff like that, 2x2TB nvmes for the vdisk for a Windows 10 VM, 8x8TB HDD array GPU: GTX1660 (used for Plex), GTX2070 Super (passed through to Windows VM) PSU: 1000W Corsair HX1000i Looking through the log, there's a repeating error message that came around a bunch of times before Unraid booted up normally: Quote Feb 6 15:40:15 Tower kernel: BUG: Bad page state in process swapper pfn:668412 Feb 6 15:40:15 Tower kernel: page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x668412 Feb 6 15:40:15 Tower kernel: flags: 0x2ffff0000000000() Feb 6 15:40:15 Tower kernel: raw: 02ffff0000000000 ffffea0019a10488 ffffea0019a10488 0000000000000000 Feb 6 15:40:15 Tower kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000008000000 Feb 6 15:40:15 Tower kernel: page dumped because: page still charged to cgroup Feb 6 15:40:15 Tower kernel: page->mem_cgroup:0000000008000000 Feb 6 15:40:15 Tower kernel: Modules linked in: Feb 6 15:40:15 Tower kernel: CPU: 0 PID: 0 Comm: swapper Not tainted 5.10.1-Unraid #1 Feb 6 15:40:15 Tower kernel: Hardware name: System manufacturer System Product Name/PRIME X570-P, BIOS 3001 12/04/2020 Feb 6 15:40:15 Tower kernel: Call Trace: Feb 6 15:40:15 Tower kernel: dump_stack+0x6b/0x83 Feb 6 15:40:15 Tower kernel: bad_page+0xcb/0xe3 Feb 6 15:40:15 Tower kernel: check_free_page+0x70/0x76 Feb 6 15:40:15 Tower kernel: __free_pages_ok+0x83/0x1ad Feb 6 15:40:15 Tower kernel: memblock_free_all+0x12f/0x19e Feb 6 15:40:15 Tower kernel: mem_init+0x18/0x138 Feb 6 15:40:15 Tower kernel: start_kernel+0x26b/0x4e3 Feb 6 15:40:15 Tower kernel: secondary_startup_64_no_verify+0xb0/0xbb Feb 6 15:40:15 Tower kernel: Disabling lock debugging due to kernel taint Not sure if this is the specific problem that caused the reboot/Machine Check Events warning or not. I searched "kernel taint" and it seems it's not the kind of thing that would usually indicate a hardware failure, but maybe I'm mistaken there? My other guess is that the RAM is new as of about six weeks ago, and it might be a fault with that - once the parity check finishes I'm gonna reboot properly and run a memtest. But any help appreciated! tower-diagnostics-20210206-1558.zip Edited February 7, 2021 by augot Quote Link to comment
JorgeB Posted February 7, 2021 Share Posted February 7, 2021 Make sure you stick to the officially supported RAM speeds. Quote Link to comment
augot Posted February 7, 2021 Author Share Posted February 7, 2021 Not sure how I missed that there was a ceiling to the supported RAM speeds for 3rd gen Ryzen - woops. Somewhat of a shame, since I went for a 3400MHz/C16 combo for gaming performance with my Windows VM rather than just for the sake of it, upgrading from 2667MHz sticks which, it turns out, would actually have been more appropriate all along... Hmm. The parity check has about an hour left to run. I'll still run memtest overnight just to be sure, but if nothing bad comes up I'll see whether it happens again and tune down the RAM in bios. I had been having some weird issues with USB devices in my Windows VM, with bluetooth devices having connection dropouts if connected to the same passed-through USB 3.0 hub (which never happened on bare metal) - would incorrect RAM speed play a part in that, perhaps? Quote Link to comment
JorgeB Posted February 7, 2021 Share Posted February 7, 2021 47 minutes ago, augot said: would incorrect RAM speed play a part in that, perhaps? Possibly, but unlikely. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.