Help To Diagnose, Is My CPU Partially Defective? (T.M.I. Inside, Read Bold)


Recommended Posts

So........... It's been a "slice" lately in the world of my server..  :-[

 

The short and skinny (read bold):

About a week ago one of my rear USB ports no longer worked at all, just all of a sudden dead.

Back story:This was being used by my UPS, so I got a notification it went offline. Checked cables, reseated, no dice.

Plugged into another open USB port, all is well.

 

A week later my server is unresponsive, frozen, all VM's, etc... offline.

Server was on (fans spinning) when the wife originally let me know. Hours pass, I get home.

NIC LED's are flashing, but the computer is now powered off.

Turns out my primary GPU is dead, DEAD... No response, MB hates when it's installed (acts odd, won't boot, no POST beeps).

Test in another computer, it won't display and I get a POST beep (likely) indicating no video.

Buy a new primary GPU, all is well, back in business.

 

A couple of days later.

My motherboard died, DEAD, and needed to be RMA'd.

Back story: I came home to a server off, with no NIC lights, and a power supply making a soft tick, tick, tick noise.

Did all the testing one would do (swap power supplies (same tick), eliminate extra components), it was dead.

I purchase the exact make/model motherboard and replace it (Gigabyte X99-SLI) to ensure it is not something else, and it boots right up, on and working (or so I thought).

 

Hindsight (20/20 = a lot of broken shit! >:( ) Now thinking about this all, the dead USB port, the dead video card, and the now dead motherboard I realize it's all related, and the MB while slowly dying was taking things out with it..  :-\

 

So anyway the new MB had some quirks when changing BIOS settings, as it wouldn't save some time, or would cycle on/off as if it couldn't boot up/POST properly (far more than normal).

At times this would lead to the settings being set to default.

At first I hadn't loaded the newest BIOS (which the original one was super old), so I updated that to the newest (same as my previous MB that worked fine prior to its demise).

This really didn't fix anything, similar situation.

I finally found the reason for what seemed to be a lack of proper power up/resetting of settings, and that was the "Profile 1" being set for the XMP memory settings. If I leave it set to off/disabled the settings retain as they should, and the power on/off at boot is better (but still not what I recall as normal for this board). My original motherboard (identical in every way) always had the XMP profile 1 set, and I had previously ran Memtest for 24+ hours verifying it was working properly. I also used this for the last ~3 months without any of the issues I'm describing to you.

 

Anyhow, whatever, boot into UnRAID and sometimes it gets stuck loading at the attached screenshot screen. I let it sit there for a good 5 minutes, and it is stuck. Reboot server (button press) and it will likely load up the next time.

 

Continue on, but have an error in the syslog about the clock or HPET being off, and it complaining (all settings in BIOS are correct, time is correct, not sure the exact message). I then start to get a couple of these over a short period of time:

Mar 29 05:20:52 Server kernel: pcieport 0000:00:03.0: AER: Corrected error received: id=0018
Mar 29 05:20:52 Server kernel: pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0018(Transmitter ID)
Mar 29 05:20:52 Server kernel: pcieport 0000:00:03.0:   device [8086:2f08] error status/mask=00001000/00002000
Mar 29 05:20:52 Server kernel: pcieport 0000:00:03.0:    [12] Replay Timer Timeout  

Never seen those before... I research the issue, it seems to be mentioned in some bug reports from as far back as 2011, nothing current, and not something you'd expect.

So I figure out which GPU it is, swap it out (same model), and the error is much less frequent, but I still see it once in a while (thought that GPU may have been bad as it was in the next slot next to the one I had to RMA, but I don't think that's the case).

 

So... As of now I am typing in a VM right now (primary computer), have 3 others going (2 of those with GPU's).

My server is "working" but it seems flakey at best. No stability issues to report, I've even hammered two of the VM's with multiple performance tests to stress them.

 

I have a Corsair TX650W that has served me well, and I really don't think it is bad in any way.

However it's one of my next things to look into further.

 

Now my question:

Do you think it is possible my CPU was damaged "lightly" when the MB, or video card died?

I have never seen this model MB lose settings, not be able to set the XMP profile to 1 (on), or power on/off at boot until it finally goes.

I think (even though it is unlikely) that the MB died from far too much concrete dust that built up a bit in my server when I was removing some of a concrete floor in my basement (yes I'm an idiot, should have been more careful).

 

A link to a diagnostics from the other day (with the PCIerror in syslog), and one from just now.

https://app.box.com/s/r1kg2zr4jud3vkiclq0f7go36q5g4erc

 

----

Also: Yes I know about all the USB related "rounding interval to" and "reset low-speed USB" messages in the syslog. They've been there forever, and only changed as I attempted to disable the XHCI controller, leading to it now listing the "reset low-speed" message, instead of the rounding one.. May need some powered hubs/better USB extensions.

TSC.jpg.dc080c174dd9f8880234c179bbaa1670.jpg

Link to comment

Thanks!

I appreciate the input and thoughts with this.

 

Being that I had some pretty nasty failures, to me everything is suspect.

I cannot think of anything else that would be more likely than the CPU for the TSC clock sync error, and that it at times freezes at that one point (maybe this new MB I am using temporarily is partially defective, it would be unlikely, but at this point I don't know).

I have completed parity syncs after both hardware failures/lockups without any issues, while still having all other VM's, etc. operating. This is why I don't feel the PSU is the culprit.

I ran Memtest over night (~8 hours) completed 3 passes without any issues (this is with XMP set to off).

 

Here's my current plan as I see it.

 

Wait for my MB to come back from RMA, set it all back up and see if the problem exists.

Run Memtest for a couple of passes with "Profile 1" XMP set to on and verify no issues (this should work as it always did previously).

At that point I'll make a decision as to if I want to replace the PSU (or test with a smaller unit I have), or RMA the CPU.

 

I have since scrubbed my Cache drive and found no errors present.

I'd like to run a verification on my array XFS discs for any corruption. What would be the best way to accomplish this?

Parity sync finished both times without errors.

Link to comment

xfs_check is supported on unRAID, but I don't think it is as comprehensive as the Reiser utilities.

the reiserfs recovery utilities are the best I have come across for any file system type.  It is a shame that reiserfs is no longer being developed or even fully supported any more.

Morbidly appropriate that the file system is starting to feel like a cult. It's talked about in reverent tones of miraculous deeds. Kind of like C64 or Amiga junkies.

 

I personally have experienced the miraculous recovery capabilities of ReiserFS, so I'm not knocking the FS, it's just interesting how the story of ReiserFS is playing out.

Link to comment

+1 for looking at the PSU.

 

Especially if you say there was concrete dust in the server. The PSU may have been damaged as well... In my experience, random booting issues and hardware failures are almost always caused by power problems. Since you have a UPS, chances are good that it is not power spikes from your line, so I would look at the PSU definitely. Especially if the replacement mobo is also acting flaky...

 

As far as your question regarding slightly damaged CPU... I guess it's possible, but I've never experienced it, or heard of it, so can't really comment on that part...

Link to comment

When running these checks (never had to) can I do it with the array active, or need to be in maintenance mode?

 

(No comment on the unfortunate circumstances for his wife.. It also hard to make it laughable in any way (I did try, seemed wrong) )

 

Does anyone know of a CPU utility to properly test for defective conditions?

Kind of like a hard drive long test, S.M.A.R.T. check?

 

Considering all the I/O it does, Vt-D, various extensions,  memory controllers, it's hard to know for sure if something is just a little off.

A dying MB that was left on with a clicking supply (likely its own internal protection) can't be good for a couple million transistors..

Link to comment

A little more investigation going against the CPU being bad.

 

This guy had the same issue with XMP and his X99-SLI, one board worked great, the other did not (odd).

https://hardforum.com/threads/ga-x99-sli-xmp-fail.1892493/

 

Also the TSC timer issue seems to be broken in some Gigabyte, and also Asus boards from my research.

This may have always been the case (minus the lockup), however I never gave it much thought.

Since I had all of these issues I'm much more critical of things looking out of place in the syslog.

 

I think the pcierror's I received are also related to the XMP setting that I have always used on the other (exact same) MB, and not knowing it didn't work quite right with this one.

Since disabling it, running the Memtest, and then booting up UnRAID (going on 12 hours here) that message has not returned!

 

So I may be out of the woods soon, and my trigger finger may need to be relaxed from picking up the HX850i I have my eye on (currently on a pretty good sale) here: http://www.newegg.com/Product/Product.aspx?Item=N82E16817139083

 

Edit: Found an old syslog from my previously good working board..

Same exact TSC message, just never noticed. So with that, I think the CPU is fine.

Thanks for listening!

Feb 15 09:42:55 Server kernel: Switched APIC routing to physical flat.
Feb 15 09:42:55 Server kernel: ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
Feb 15 09:42:55 Server kernel: TSC deadline timer enabled
Feb 15 09:42:55 Server kernel: smpboot: CPU0: Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz (fam: 06, model: 3f, stepping: 02)
Feb 15 09:42:55 Server kernel: Performance Events: PEBS fmt2+, 16-deep LBR, Haswell events, full-width counters, Intel PMU driver.
Feb 15 09:42:55 Server kernel: ... version:                3
Feb 15 09:42:55 Server kernel: ... bit width:              48
Feb 15 09:42:55 Server kernel: ... generic registers:      4
Feb 15 09:42:55 Server kernel: ... value mask:             0000ffffffffffff
Feb 15 09:42:55 Server kernel: ... max period:             0000ffffffffffff
Feb 15 09:42:55 Server kernel: ... fixed-purpose events:   3
Feb 15 09:42:55 Server kernel: ... event mask:             000000070000000f
Feb 15 09:42:55 Server kernel: x86: Booting SMP configuration:
Feb 15 09:42:55 Server kernel: .... node  #0, CPUs:        #1
Feb 15 09:42:55 Server kernel: TSC synchronization [CPU#0 -> CPU#1]:
Feb 15 09:42:55 Server kernel: Measured 228446458923 cycles TSC warp between CPUs, turning off TSC clock.
Feb 15 09:42:55 Server kernel: tsc: Marking TSC unstable due to check_tsc_sync_source failed
Feb 15 09:42:55 Server kernel:  #2  #3  #4  #5  #6  #7  #8  #9 #10 #11

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.