Jump to content

[Solved] Frequent spontaneous reboots since 6.4 upgrade


FreeMan

Recommended Posts

Posted

TL;DR: AMD held their hand up and said it's a CPU failure. I got a link to their warranty return page and a recommendation to send the CPU back. Read on for all the gory details...

 

---

 

I've been having regular crashes since installing 6.4. I installed the update to 6.4.1 and the issue has continued. It had been happening at slightly less than 24-hour intervals.

 

It was suggested that the problem was probably somewhere with all the plugins I have installed, so I uninstalled most of them. After the system had been stable for about 24 hours, I added another plugin. It had been running fine for nearly 2 weeks, as I added plugins on roughly 24-hour intervals, then the crashes started again.

 

I am currently running:

* CA Auto Turbo Write Mode

* CA Auto Update Applications

* CA Backup / Restore Appdata

* CA Cleanup Appdata

* CA Docker Autostart Manager

* Community Applications

* ControlR

* Dynamix Active Streams

* Dynamix Cache Directories

* Dynamix File Integrity

* Dynamix Local Master

* Dynamix SSD TRIM

* Fix Common Problems

* Nerd Tools 

        (Installed: mcelog-153-x86_64-1.txz, perl-5.26.1-x86_64-4.txz (needing an update), python-2.7.14a-x86_64-1.txz, screen-4.6.1s-x86_64-1.txz, utempter-1.1.6-x86_64-2.txz)

* Tips and Tweaks

 

I had installed each of those until I got to Dynamix System Buttons - that's where the first crash (still running 6.4 at the time) happened in about 10-12 days. This was a day or so after the 6.4.1 release had come out, but I hadn't yet turned on notifications, so I was unaware it was available. I uninstalled it and the system was stable for a couple of days, so I upgraded to 6.4.1. Dynamix SSD Trim was the next most recent plugin installed, and the system has crashed at least 2 days in a row with this set of plugins.

 

This morning the server crashed while I was working in lidarr from my Win10 machine - literally right before my eyes. Usually, the crashes seem to happen around 0800-0930 and I've been at work and have been unable to grab logs, etc.

 

I have the following dockers installed:

hex-emby
binhex-libresonic
binhex-lidarr
cadvisor
CrashPlanPRO

Dolphin
FileBot
Headphones
letsencrypt

mariadb
nzbget
openvpn-as

Piwigo
radarr
sonarr

but only those in bold are actually running the rest don't auto start thanks to Squid's auto start manager. (Yes, I'm aware Headphones & lidarr are duplicates - I just realized I hadn't yet updated auto start manager). Some will be deleted as duplicate/unnecessary, the others (like Dolphin & FileBot) sit quietly until needed, then I start/stop them.

 

I had FCP in trouble shooting mode, and am attaching the latest diagnostics it captured, plus the FCPsyslog tail file, and the controlr.log from @jbrodriguez's great ControlR app & plugin, since it was fresh at the time. I am reenabling trouble shooting mode in case it happens again.

controlr.log

FCPsyslog_tail.txt

nas-diagnostics-20180210-0847.zip

  • Replies 71
  • Created
  • Last Reply
Posted

In the syslog there's a MCE logged very early in the boot sequence:

 

Quote

Feb  9 14:59:53 NAS kernel: mce: [Hardware Error]: Machine check events logged

 

It might be worth updating your BIOS. You're on F1 and I'm pretty sure there's a newer version.

 

I notice you have Marvell disk controllers installed and you also have IOMMU enabled, which is not a good combination. Has the Fix Common Problems plugin warned you about this? As you are not using it I would disable IOMMU in the BIOS.

Posted
16 minutes ago, John_M said:

In the syslog there's a MCE logged very early in the boot sequence:

 

 

It might be worth updating your BIOS. You're on F1 and I'm pretty sure there's a newer version.

 

I notice you have Marvell disk controllers installed and you also have IOMMU enabled, which is not a good combination. Has the Fix Common Problems plugin warned you about this? As you are not using it I would disable IOMMU in the BIOS.

 

Thanks for the feedback. The MCEs did start with the 6.4 upgrade - I'd never had one prior. I'll look into a BIOS update.

 

Not sure what IOMMU is (off to Google), so I'm prolly not using it. :) I'll try disabling it.

Posted
2 minutes ago, FreeMan said:

Not sure what IOMMU is (off to Google), so I'm prolly not using it. :) I'll try disabling it.

 

It's used to allow hardware to be passed through to a virtual machine. If you don't use VMs or you use VMs but don't pass through hardware then you don't need it. If you're unsure about it you're almost certainly not using it. It isn't used by dockers or plugins.

Posted

Thanks.

 

Just trying to decide if I should let the parity check finish before shutting down for the BIOS change. I've not had any errors from any of the previous crashes, so things have been going very well in that department... ;)

Posted

If you've only just started your parity check I'd stop it. If it's been running for a while I'd let it run. If the BIOS update doesn't fix the MCEs a different kernel might. I'm sure there will be a new one coming out before very long as more Spectre mitigations are included.

Posted

IMMOU disabled, we'll see what happens.

 

Really need to get a nice new HBA and replace the hodgepodge of 2 port controllers I stuck in there on the cheap...

 

I've also got the BIOS downloaded for an update. I grabbed F9 because F10 is listed as "beta" but that was about 3 years ago. I'm thinking that if there were serious issues, they'd have pulled it and issued a replacement by now. ;) If I continue to have issues, I'll do the BIOS update. I'll update after things have been stable for a week or so.

 

Thanks for your help!

  • 2 weeks later...
Posted

Sadly, the bios upgrade doesn't seem to have helped - the server just rebooted.

I'll post a fresh set of FCP syslogs when I get home, but I'm going to have to downgrade to 6.3.5 to get the machine to run stable.
@limetech - any ideas what might be going on? This machine has been running without issue from late v5.x to v6.3.5. It's only been some the 6.4 update that this has started happening.

Sent from Tapatalk

Posted

What do you mean by "crash"?  Some choices:

a) kernel exception and stack dump spits out on console but console still alive

b) same as a) but can't use console

c) everything just hangs (freezes)

d) machine spontaneously reboots

e) machine catches on fire

f) other...

Posted
What do you mean by "crash"?  Some choices:
a) kernel exception and stack dump spits out on console but console still alive
b) same as a) but can't use console
c) everything just hangs (freezes)
d) machine spontaneously reboots
e) machine catches on fire
f) other...
Option D - spontaneous reboot

Sent from Tapatalk

Posted
5 minutes ago, FreeMan said:

Option D - spontaneous reboot

 

99% chance something broken with your hardware.  Have you run memtest lately?

Why would this only show up with 6.4?  Probably because the placement of code in memory is different.  Pretty hard to track down these sorts of problems unfortunately.  Maybe leave the Log window open or open terminal/telnet/ssh session with this command:

tail -f /var/log/syslog

This might catch something happening before the reboot.

Posted
 
99% chance something broken with your hardware.  Have you run memtest lately?
Why would this only show up with 6.4?  Probably because the placement of code in memory is different.  Pretty hard to track down these sorts of problems unfortunately.  Maybe leave the Log window open or open terminal/telnet/ssh session with this command:
tail -f /var/log/syslog

This might catch something happening before the reboot.

I was thinking about running memtest. I'll kick that off this evening when I get home.

I've attached 2 logs from Fix Common Problems to earlier posts. I didn't see anything obvious in them, but I'm far from the expert.

I'll let memtest run overnight and/or a couple of test cycles, whichever comes first and report back.

Sent from Tapatalk

Posted
35 minutes ago, FreeMan said:

Memtest is running. Would it run faster and/or have another advantage in SMP mode?

 

Might not work on your h/w, but you can try it.  Even if memtest passes multiple times without error it doesn't necessarily mean your RAM is good.  This is because, while it does test the cells themselves, it only tests one data path: CPU to/from RAM.  Does not test other data paths that may exist, eg, PCI to/from RAM using DMA.

Posted

Is there any way to test any of these other data paths?

 

Is it a matter of throwing new components at the box until it stops rebooting?

Posted
3 minutes ago, FreeMan said:

Is there any way to test any of these other data paths?

 

Is it a matter of throwing new components at the box until it stops rebooting?

 

If you suspect memory, do something like this.  If you have 2 sticks, remove one of them.  If fails, replace with other one.  If now passes: you have a bad stick.  If both fail, or both pas, you have something else probably.  A marginal PSU can also cause very strange symptoms.  But don't forget also to have an open window with that 'tail' command running.

Posted
9 hours ago, FreeMan said:

Option D - spontaneous reboot

Sent from Tapatalk
 

 

I had this happen and it was a bad PSU.  It an easy (and usually cheap) to test as many times there is a spare PSU that you can use for testing.  (If you are on decent terms with some the larger suppliers, they will 'loan' a unit for 'free' for thirty days.  9_9 )

Posted

It doesn't hurt to run mprime or similar to stress-test the CPU + memory while consuming a huge amount of power.

 

Combine with some disk test program - or possibly just concurrently copy from every disk to /dev/null - to give the PSU a real work-out and verify that it can handle a high load for an extended time without overheating or starting to produce unclean power.

Posted

24 hours of memtest, 6+ full passes, no errors found. No guarantee that the memory is good, I understand that, but it's a good sign, right?

 

Interesting. I kept some fairly detailed notes on the hardware I bought when I upgraded the server, but not a single thing about when I did it. :/ I know the current memory is just over a year old. I think I upgraded everything about 3-4 years ago. I put a Corsair TX series PSU in it - not the highest quality one out there, but reasonable. I guess even the highest quality is no guarantee of life expectancy.

 

@Frank1940 - it took me a couple of minutes to figure your post out, but I got it! ;)

 

@pwm - I'm sure a parity check doesn't max out the CPU like mprime would, but I've made it through quite a number of parity checks over the last several weeks involving spinning all the drives. If memory serves, I've had at most one reboot during a parity check, possibly none. Would a parity check be reasonably comparable?

Posted

Also, I was going to post the FCPsyslog_tail.txt from the most recent reboot, but I foolishly used TeamViewer to access the server and put FCP back into trouble shooting mode which overwrote the file from the previous reboot. There are two up above from earlier reboots, though, if they might provide some clues.

 

I did notice this in the syslog:

Feb 28 10:04:01 NAS root: Fix Common Problems Version 2018.02.18
Feb 28 10:04:11 NAS root: Fix Common Problems: Error: unclean shutdown detected of your server
Feb 28 10:04:17 NAS root: Fix Common Problems: Error: Machine Check Events detected on your server
Feb 28 10:04:17 NAS root: mcelog: ERROR: AMD Processor family 21: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
Feb 28 10:04:17 NAS root: CPU is unsupported

I've asked on the FCP tread, but I'll ask here (because everyone loves a cross post :)) is there something I can do to enable the edac_mce_amd, or does Squid have to bake that into FCP at his end?

Posted
Also, I was going to post the FCPsyslog_tail.txt from the most recent reboot, but I foolishly used TeamViewer to access the server and put FCP back into trouble shooting mode which overwrote the file from the previous reboot. There are two up above from earlier reboots, though, if they might provide some clues.
 
I did notice this in the syslog:
Feb 28 10:04:01 NAS root: Fix Common Problems Version 2018.02.18Feb 28 10:04:11 NAS root: Fix Common Problems: Error: unclean shutdown detected of your serverFeb 28 10:04:17 NAS root: Fix Common Problems: Error: Machine Check Events detected on your serverFeb 28 10:04:17 NAS root: mcelog: ERROR: AMD Processor family 21: mcelog does not support this processor.  Please use the edac_mce_amd module instead.Feb 28 10:04:17 NAS root: CPU is unsupported

I've asked on the FCP tread, but I'll ask here (because everyone loves a cross post [emoji4]) is there something I can do to enable the edac_mce_amd, or does Squid have to bake that into FCP at his end?

If nothing else, at least you now know it's a hardware issue

Sent from my SM-T560NU using Tapatalk

Posted
21 hours ago, Frank1940 said:

 

I had this happen and it was a bad PSU.  It an easy (and usually cheap) to test as many times there is a spare PSU that you can use for testing.  (If you are on decent terms with some the larger suppliers, they will 'loan' a unit for 'free' for thirty days.  9_9 )

Before I getting a 'loaner', are the PSU testers actually good for detecting these flaky types of issues?

Posted
2 hours ago, FreeMan said:

Before I getting a 'loaner', are the PSU testers actually good for detecting these flaky types of issues?

You can read the reviews.  But I suspect that the best use for these 'testers' is to make sure that the various voltages are close the the spec limits and that they power on and off.  Of course, The more expensive a unit you buy the better the test results will be and the more things that can be tested.  I would imagine that the best ones will  cost more than a replacement PS.  

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...