Jump to content

[6.9.2] Spontaneous reboots help needed


Jahf

Recommended Posts

I recently (couple of weeks) started migrating from my old system to a new case and more hardware to start using VFIO VMs. 

 

New system is seemingly stable for hours but will often have a spontaneous reboot in the middle of the night causing unclean shutdown. 

 

I've done some searching and found this can be an issue with PCIE ASPM so I disabled that in BIOS. 

 

It's also possible that the issue is due to my PCIE2 LSI HBA moving from the 8x middle PCIE slot to the bottom 4x slot (which may be controlled by the PCH). I put a GTX 1060 in the middle slot for use as graphics for non-gaming VMs (top slot is a 3080ti for games). Unfortunately it is a 2-slot card and I can't put it in the bottom 4x slot because it crushes the USB headers and front panel connectors. 

 

Questions

 

Is there any chance the ASPM is still an issue (I haven't disabled it in any config files) if it is off on the BIOS?

* How likely is it that this would be caused by being on the probably-PCH x4 slot?

* Is there anything I can proactively do to force the various possible causes while actively using the system? (see next)

 

Note: I thought I enabled syslog to see previous errors but simply enabling syslog in the server didn't save any files in the syslog share I created. So all I have are the (anonymized) Diagnostics attached. I also save non-anon diags if Limetech requests them that match the anon ones here. 

 

...

 

The biggest problem I'm having tracking this down is these reboots never happen while I'm active on the system. In fact most don't even occur when I'm awake. I just wake up hoping the system survived after changing something and usually find it has restarted overnight. 

 

Some times I'm doing literally nothing on the system when it happens, other times I have active torrents to keep it busy through the night (trying to force an active condition).

 

I thought I had it stable so last night I let a couple of long disk activities run that had not completed before the system restarted. The two things that were running last night were:

 

* Copying of 1.75TB from a 4TB unassigned drive ... it was going great guns when I went to sleep with about 1TB completed. It made it to about 1.45TB when the system rebooted about 5 hours later. 

* preclear of a different 4TB unassigned drive via binhex's Docker. Was doing this both to verify the disk won't throw errors on me (it's old but light use) as well as to stress the system. I'm unsure if that process succeeded. Today I decided to just nuke the old empty partitions and have Unraid format it (in process atm).

* Fired off an extended SMART test on the 4TB drive that was being precleared ... unsure if this was an issue but ... I do note that the UI reports the test was aborted by host. I'd be surprised if that SMART test was still running 4-5 hours later but ... reporting for completeness. I checked -after- download diagnostics, which IIRC fire off a SMART on all drives, so this may be a non-issue as later on the page I see "Self-test execution status: The previous self-test routine completed without error or no self-test has ever been run."

 

I'm actually a bit surprised the system rebooted under those conditions as most times it seemed to happen at idle or right after turning on the array after the server had been allowed to stay online for hours without the array started (Squid or whoever, I "hid" the issue I had when this started showing that happening but I'm assuming LT can still see the hidden issue if it helps with diagnosing anything ... that issue may or may not be related as I opened it due to an MCE error I was told was safe to ignore but it did show a reboot as described below). Prior to this I felt disabling ASPM had been successful for a couple of days. 

 

I'm now sorta thinking I might have had both an issue with ASPM as -well- as an issue running the card in the bottom slot, but I'm asking for a second opinion along with ideas on what I can do to test. 

 

For now I'm planning to temporarily remove the 1060 and put the HBA in the second slot (when my 4TB format completes as I already fired it up) and running for a few days before getting hopeful I've fixed the issue. But that's a lot of blind faith trial and error so help figuring out testing procedures would be very appreciated. 

 

If that works I'll get to go on a hunt for a decent 1-slot GPU for the bottom slot ... something I'd hoped to avoid :)

 

(Updated: deleted the anonymized diags from the post out of internet paranoia as I'm hopeful the issue is handled ... if LT wants the non-anonymized ones to look into it for me just buzz me)

 

 

 

Edited by Jahf
Link to comment

@Constuctor ... thanks, going through it now, will reply back

 

EDIT: will have to wait for my disk activity to be done in an hour or so to reboot and verify the BIOS stuff. 

 

C-states were enabled. Going to have to run through the night with them disabled and see. At this point long-term I'd actually rather it turns out to be the PCH slot issue given I was hoping to be able to have the power savings. 

Edited by Jahf
Link to comment

Disabled C-States may have addressed the problem (possibly in addition to disabling ASPM before this). 

At some point I'll be getting a single slot GPU and moving the HBA up to the x8 CPU slot and seeing if I can safely reenable C-States. But ... if this continues to be stable ... that will be when my wallet won't scream at me for buying even more parts :)

 

...

At this point, with a fair amount of stress given to the system for 3 days (mixed use between parity check while running benchmarks, docker torrents, benchmarks, gamestreaming from the Windows VM) I've been more solid than any point since building the new system with no more lockups/reboots. 

 

I've of course just jinxed it *knock on wood*. 

 

I'll be leaving the system in a low use state for a couple of days to see if I run across the problem again. 

 

...


As for RAM, I know I'm out of spec. Just going over 3200mhz on these sticks is out of spec. I spent over a week trying to overload with various things like OCCT, memtest, etc and haven't had any issues during actual use (no WHEA errors when in Windows with ECC on, no errors with ECC off in extended memtests, never a crash/freeze/lockup when system getting use). So since this isn't a mission critical machine I'm willing to accept being out of spec. I did find a couple of tables confirming the 2667mhz specification on 3950x CPUs and I know the 5950x shares the same IMC, so yeah ... agreed. It's out of spec. But specs tend to be conservative and I'm not currently seeing an issue that makes me think it's the RAM. 
 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...