Jump to content

ConqueRhor

Members
  • Posts

    3
  • Joined

  • Last visited

ConqueRhor's Achievements

Noob

Noob (1/14)

0

Reputation

  1. Haha, what a coincidence that you found my post! I had found your response to another similar question from a while back regarding a suspiciously familiar MCE event; thank you for responding here too. Totally agree. Once my wife and I manage to get out of this apartment and into a house I have plans for a purpose-built rackmounted server, just working with what we have the space for at the moment. I've gone ahead and ordered that NH-D15, it's among the best air coolers for the FCLGA1151 socket and I've got some unused Arctic MX-4 in a drawer somewhere. Checked to make sure it will have clearance for the mobo/ram and fit in the case. The GPU in this system has an AIO built onto it unfortunately and I haven't yet found a replacement cooler that I'm confident is compatible with the PCB. The purpose-built one will have an air-cooled Quadro most likely.
  2. Thank you for your time in taking a look at my post and responding! Respectfully, I'm probably just going to order a new cooler before bothering with the thermal compound and any gunk that may have accumulated on the radiator. Ultimately I think that servers should be air-cooled and I'd rather phase out the liquid cooling of both the CPU and GPU of this system - this is a good excuse to take care of the CPU part.
  3. tl;dr I think my CPU may be overheating, would really appreciate a second set of eyes on the issue before I spend money on a new cooler My unRAID server is built from a last-gen gaming computer with some additional hard disks and is plugged into a very much more-than-adequate uninterruptible power supply. I run a few shares and a few containers - Plex (using the GPU for hardware-accelerated transcoding), mineOS (currently disabled) for hosting a minecraft server, and UniFi Controller After 140 days of continuous uptime hosting these workloads, the server has begun crashing occasionally and the frequency seems to be increasing. Symptoms - hosted services + unRAID GUI go down, fans in the case spin continuously for a bit, and when the GUI becomes responsive again, the array and Docker containers are in a stopped state. Starting the array auto-starts a parity check, which has always had 0 errors. There have also been some segmentation fault errors in the Plex server logs. (note: I later changed the array and containers to auto-start. Yes, I know this is not the best idea) My first thought (esp. with segfaults) was that one or more sticks of RAM was on its way out, so I ran 4 passes of memtest86 off a bootable flash drive overnight - no errors. I installed the "Fix Common Problems" extension - after one of the reboots, there was a message about Machine Check Events, so I got nerdpack and mcelog installed and as a troubleshooting step began mirroring the syslog to my boot flash drive. I have now managed to capture the MCE output when the machine starts up and FCP presents the error in the GUI but I'm not sure how to decipher what was logged - from some googling, it seems like it could be a nothing error. kernel: mce: [Hardware Error]: Machine check events logged kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: be00000000800400 kernel: mce: [Hardware Error]: TSC 0 ADDR 14e2fdc88d58 MISC 14e2fdc88d58 kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1613594950 SOCKET 0 APIC 0 microcode d6 kernel: Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver. kernel: ... version: 4 kernel: ... bit width: 48 kernel: ... generic registers: 4 kernel: ... value mask: 0000ffffffffffff kernel: ... max period: 00007fffffffffff kernel: ... fixed-purpose events: 3 kernel: ... event mask: 000000070000000f kernel: rcu: Hierarchical SRCU implementation. kernel: smp: Bringing up secondary CPUs ... kernel: x86: Booting SMP configuration: kernel: .... node #0, CPUs: #1 #2 kernel: mce: [Hardware Error]: Machine check events logged kernel: #3 kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 3: be00000000800400 kernel: mce: [Hardware Error]: TSC 0 ADDR 14e2fdc88d58 MISC 14e2fdc88d58 kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1613594950 SOCKET 0 APIC 4 microcode d6 kernel: #4 kernel: MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/ details. kernel: TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/ ml for more details. kernel: #5 #6 #7 kernel: smp: Brought up 1 node, 8 CPUs kernel: smpboot: Max logical packages: 1 I also ran the extended SMART self-test on the array drives with no errors. One thing which did come up was a warning about the CPU being throttled. The system uses a 240/280mm AIO liquid cooler that's about 4 years old, but it's possible it could be going. kernel: traps: Plex Transcoder[31499] general protection ip:14a0be75db3f sp:14a0b292aad0 error:0 in libh264_decoder.so[14a0be644000+1d3000] kernel: CPU4: Core temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU0: Core temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU6: Package temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU4: Package temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU7: Package temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU5: Package temperature above threshold, cpu clock throttled (total events = 1) kernel: CPU4: Core temperature/speed normal kernel: CPU0: Core temperature/speed normal kernel: CPU0: Package temperature/speed normal I have been able to sporadically replicate this failure by playing back a particular file in Plex and forcing software transcoding - the output of `signals` shows the CPU core temps shooting up past 90 C. To compare, I ran the file through handbrake on my desktop with a much beefier processor (Ryzen 9 3900x) which has an approximately equivalent cooler, and that hit 96 C and throttled, but didn't cause the system to shut down. My thought here is that depending on my particular luck spamming the `signals` command to get the core temps, I could be missing the event that causes the system to shut down to protect the CPU from overheating too much. I think the next step is probably to replace the cooler with something like an NH-D15 air cooler and see if that fixes the issue, but I would really appreciate it someone who knows more about mcelog outputs could take a look at the diagnostics and confirm if I'm likely on the right track or if the message about the CPU becoming throttled has led me down the wrong path. leviathan-diagnostics-20210217-1947.zip
×
×
  • Create New...