FreeMan Posted February 10, 2018 Posted February 10, 2018 TL;DR: AMD held their hand up and said it's a CPU failure. I got a link to their warranty return page and a recommendation to send the CPU back. Read on for all the gory details... --- I've been having regular crashes since installing 6.4. I installed the update to 6.4.1 and the issue has continued. It had been happening at slightly less than 24-hour intervals. It was suggested that the problem was probably somewhere with all the plugins I have installed, so I uninstalled most of them. After the system had been stable for about 24 hours, I added another plugin. It had been running fine for nearly 2 weeks, as I added plugins on roughly 24-hour intervals, then the crashes started again. I am currently running: * CA Auto Turbo Write Mode * CA Auto Update Applications * CA Backup / Restore Appdata * CA Cleanup Appdata * CA Docker Autostart Manager * Community Applications * ControlR * Dynamix Active Streams * Dynamix Cache Directories * Dynamix File Integrity * Dynamix Local Master * Dynamix SSD TRIM * Fix Common Problems * Nerd Tools (Installed: mcelog-153-x86_64-1.txz, perl-5.26.1-x86_64-4.txz (needing an update), python-2.7.14a-x86_64-1.txz, screen-4.6.1s-x86_64-1.txz, utempter-1.1.6-x86_64-2.txz) * Tips and Tweaks I had installed each of those until I got to Dynamix System Buttons - that's where the first crash (still running 6.4 at the time) happened in about 10-12 days. This was a day or so after the 6.4.1 release had come out, but I hadn't yet turned on notifications, so I was unaware it was available. I uninstalled it and the system was stable for a couple of days, so I upgraded to 6.4.1. Dynamix SSD Trim was the next most recent plugin installed, and the system has crashed at least 2 days in a row with this set of plugins. This morning the server crashed while I was working in lidarr from my Win10 machine - literally right before my eyes. Usually, the crashes seem to happen around 0800-0930 and I've been at work and have been unable to grab logs, etc. I have the following dockers installed: hex-emby binhex-libresonicbinhex-lidarrcadvisor CrashPlanPRO Dolphin FileBotHeadphones letsencrypt mariadbnzbget openvpn-as Piwigoradarr sonarr but only those in bold are actually running the rest don't auto start thanks to Squid's auto start manager. (Yes, I'm aware Headphones & lidarr are duplicates - I just realized I hadn't yet updated auto start manager). Some will be deleted as duplicate/unnecessary, the others (like Dolphin & FileBot) sit quietly until needed, then I start/stop them. I had FCP in trouble shooting mode, and am attaching the latest diagnostics it captured, plus the FCPsyslog tail file, and the controlr.log from @jbrodriguez's great ControlR app & plugin, since it was fresh at the time. I am reenabling trouble shooting mode in case it happens again. controlr.log FCPsyslog_tail.txt nas-diagnostics-20180210-0847.zip
John_M Posted February 10, 2018 Posted February 10, 2018 In the syslog there's a MCE logged very early in the boot sequence: Quote Feb 9 14:59:53 NAS kernel: mce: [Hardware Error]: Machine check events logged It might be worth updating your BIOS. You're on F1 and I'm pretty sure there's a newer version. I notice you have Marvell disk controllers installed and you also have IOMMU enabled, which is not a good combination. Has the Fix Common Problems plugin warned you about this? As you are not using it I would disable IOMMU in the BIOS.
FreeMan Posted February 10, 2018 Author Posted February 10, 2018 16 minutes ago, John_M said: In the syslog there's a MCE logged very early in the boot sequence: It might be worth updating your BIOS. You're on F1 and I'm pretty sure there's a newer version. I notice you have Marvell disk controllers installed and you also have IOMMU enabled, which is not a good combination. Has the Fix Common Problems plugin warned you about this? As you are not using it I would disable IOMMU in the BIOS. Thanks for the feedback. The MCEs did start with the 6.4 upgrade - I'd never had one prior. I'll look into a BIOS update. Not sure what IOMMU is (off to Google), so I'm prolly not using it. I'll try disabling it.
John_M Posted February 10, 2018 Posted February 10, 2018 2 minutes ago, FreeMan said: Not sure what IOMMU is (off to Google), so I'm prolly not using it. I'll try disabling it. It's used to allow hardware to be passed through to a virtual machine. If you don't use VMs or you use VMs but don't pass through hardware then you don't need it. If you're unsure about it you're almost certainly not using it. It isn't used by dockers or plugins.
FreeMan Posted February 10, 2018 Author Posted February 10, 2018 Thanks. Just trying to decide if I should let the parity check finish before shutting down for the BIOS change. I've not had any errors from any of the previous crashes, so things have been going very well in that department...
John_M Posted February 10, 2018 Posted February 10, 2018 If you've only just started your parity check I'd stop it. If it's been running for a while I'd let it run. If the BIOS update doesn't fix the MCEs a different kernel might. I'm sure there will be a new one coming out before very long as more Spectre mitigations are included.
FreeMan Posted February 11, 2018 Author Posted February 11, 2018 IMMOU disabled, we'll see what happens. Really need to get a nice new HBA and replace the hodgepodge of 2 port controllers I stuck in there on the cheap... I've also got the BIOS downloaded for an update. I grabbed F9 because F10 is listed as "beta" but that was about 3 years ago. I'm thinking that if there were serious issues, they'd have pulled it and issued a replacement by now. If I continue to have issues, I'll do the BIOS update. I'll update after things have been stable for a week or so. Thanks for your help!
FreeMan Posted February 22, 2018 Author Posted February 22, 2018 Well, it crashed again yesterday. Attaching the syslog_tail and latest diagnostics courtesy of FCP. If anyone sees anything in there that might indicate anything else, please let me know, otherwise I'll be trying the BIOS update soon. FCPsyslog_tail.txt nas-diagnostics-20180221-1234.zip
FreeMan Posted February 25, 2018 Author Posted February 25, 2018 BIOS updated (that was an adventure). We'll see how it goes.
FreeMan Posted February 28, 2018 Author Posted February 28, 2018 Sadly, the bios upgrade doesn't seem to have helped - the server just rebooted.I'll post a fresh set of FCP syslogs when I get home, but I'm going to have to downgrade to 6.3.5 to get the machine to run stable.@limetech - any ideas what might be going on? This machine has been running without issue from late v5.x to v6.3.5. It's only been some the 6.4 update that this has started happening.Sent from Tapatalk
limetech Posted February 28, 2018 Posted February 28, 2018 What do you mean by "crash"? Some choices: a) kernel exception and stack dump spits out on console but console still alive b) same as a) but can't use console c) everything just hangs (freezes) d) machine spontaneously reboots e) machine catches on fire f) other...
FreeMan Posted February 28, 2018 Author Posted February 28, 2018 What do you mean by "crash"? Some choices: a) kernel exception and stack dump spits out on console but console still alive b) same as a) but can't use console c) everything just hangs (freezes) d) machine spontaneously reboots e) machine catches on fire f) other...Option D - spontaneous rebootSent from Tapatalk
limetech Posted February 28, 2018 Posted February 28, 2018 5 minutes ago, FreeMan said: Option D - spontaneous reboot 99% chance something broken with your hardware. Have you run memtest lately? Why would this only show up with 6.4? Probably because the placement of code in memory is different. Pretty hard to track down these sorts of problems unfortunately. Maybe leave the Log window open or open terminal/telnet/ssh session with this command: tail -f /var/log/syslog This might catch something happening before the reboot.
FreeMan Posted February 28, 2018 Author Posted February 28, 2018 99% chance something broken with your hardware. Have you run memtest lately? Why would this only show up with 6.4? Probably because the placement of code in memory is different. Pretty hard to track down these sorts of problems unfortunately. Maybe leave the Log window open or open terminal/telnet/ssh session with this command:tail -f /var/log/syslog This might catch something happening before the reboot.I was thinking about running memtest. I'll kick that off this evening when I get home.I've attached 2 logs from Fix Common Problems to earlier posts. I didn't see anything obvious in them, but I'm far from the expert.I'll let memtest run overnight and/or a couple of test cycles, whichever comes first and report back.Sent from Tapatalk
FreeMan Posted February 28, 2018 Author Posted February 28, 2018 Memtest is running. Would it run faster and/or have another advantage in SMP mode?
limetech Posted March 1, 2018 Posted March 1, 2018 35 minutes ago, FreeMan said: Memtest is running. Would it run faster and/or have another advantage in SMP mode? Might not work on your h/w, but you can try it. Even if memtest passes multiple times without error it doesn't necessarily mean your RAM is good. This is because, while it does test the cells themselves, it only tests one data path: CPU to/from RAM. Does not test other data paths that may exist, eg, PCI to/from RAM using DMA.
FreeMan Posted March 1, 2018 Author Posted March 1, 2018 Is there any way to test any of these other data paths? Is it a matter of throwing new components at the box until it stops rebooting?
limetech Posted March 1, 2018 Posted March 1, 2018 3 minutes ago, FreeMan said: Is there any way to test any of these other data paths? Is it a matter of throwing new components at the box until it stops rebooting? If you suspect memory, do something like this. If you have 2 sticks, remove one of them. If fails, replace with other one. If now passes: you have a bad stick. If both fail, or both pas, you have something else probably. A marginal PSU can also cause very strange symptoms. But don't forget also to have an open window with that 'tail' command running.
Frank1940 Posted March 1, 2018 Posted March 1, 2018 9 hours ago, FreeMan said: Option D - spontaneous reboot Sent from Tapatalk I had this happen and it was a bad PSU. It an easy (and usually cheap) to test as many times there is a spare PSU that you can use for testing. (If you are on decent terms with some the larger suppliers, they will 'loan' a unit for 'free' for thirty days. )
pwm Posted March 1, 2018 Posted March 1, 2018 It doesn't hurt to run mprime or similar to stress-test the CPU + memory while consuming a huge amount of power. Combine with some disk test program - or possibly just concurrently copy from every disk to /dev/null - to give the PSU a real work-out and verify that it can handle a high load for an extended time without overheating or starting to produce unclean power.
FreeMan Posted March 2, 2018 Author Posted March 2, 2018 24 hours of memtest, 6+ full passes, no errors found. No guarantee that the memory is good, I understand that, but it's a good sign, right? Interesting. I kept some fairly detailed notes on the hardware I bought when I upgraded the server, but not a single thing about when I did it. I know the current memory is just over a year old. I think I upgraded everything about 3-4 years ago. I put a Corsair TX series PSU in it - not the highest quality one out there, but reasonable. I guess even the highest quality is no guarantee of life expectancy. @Frank1940 - it took me a couple of minutes to figure your post out, but I got it! @pwm - I'm sure a parity check doesn't max out the CPU like mprime would, but I've made it through quite a number of parity checks over the last several weeks involving spinning all the drives. If memory serves, I've had at most one reboot during a parity check, possibly none. Would a parity check be reasonably comparable?
FreeMan Posted March 2, 2018 Author Posted March 2, 2018 Also, I was going to post the FCPsyslog_tail.txt from the most recent reboot, but I foolishly used TeamViewer to access the server and put FCP back into trouble shooting mode which overwrote the file from the previous reboot. There are two up above from earlier reboots, though, if they might provide some clues. I did notice this in the syslog: Feb 28 10:04:01 NAS root: Fix Common Problems Version 2018.02.18 Feb 28 10:04:11 NAS root: Fix Common Problems: Error: unclean shutdown detected of your server Feb 28 10:04:17 NAS root: Fix Common Problems: Error: Machine Check Events detected on your server Feb 28 10:04:17 NAS root: mcelog: ERROR: AMD Processor family 21: mcelog does not support this processor. Please use the edac_mce_amd module instead. Feb 28 10:04:17 NAS root: CPU is unsupported I've asked on the FCP tread, but I'll ask here (because everyone loves a cross post ) is there something I can do to enable the edac_mce_amd, or does Squid have to bake that into FCP at his end?
Squid Posted March 2, 2018 Posted March 2, 2018 Also, I was going to post the FCPsyslog_tail.txt from the most recent reboot, but I foolishly used TeamViewer to access the server and put FCP back into trouble shooting mode which overwrote the file from the previous reboot. There are two up above from earlier reboots, though, if they might provide some clues. I did notice this in the syslog:Feb 28 10:04:01 NAS root: Fix Common Problems Version 2018.02.18Feb 28 10:04:11 NAS root: Fix Common Problems: Error: unclean shutdown detected of your serverFeb 28 10:04:17 NAS root: Fix Common Problems: Error: Machine Check Events detected on your serverFeb 28 10:04:17 NAS root: mcelog: ERROR: AMD Processor family 21: mcelog does not support this processor. Please use the edac_mce_amd module instead.Feb 28 10:04:17 NAS root: CPU is unsupported I've asked on the FCP tread, but I'll ask here (because everyone loves a cross post [emoji4]) is there something I can do to enable the edac_mce_amd, or does Squid have to bake that into FCP at his end?If nothing else, at least you now know it's a hardware issue Sent from my SM-T560NU using Tapatalk
FreeMan Posted March 2, 2018 Author Posted March 2, 2018 21 hours ago, Frank1940 said: I had this happen and it was a bad PSU. It an easy (and usually cheap) to test as many times there is a spare PSU that you can use for testing. (If you are on decent terms with some the larger suppliers, they will 'loan' a unit for 'free' for thirty days. ) Before I getting a 'loaner', are the PSU testers actually good for detecting these flaky types of issues?
Frank1940 Posted March 2, 2018 Posted March 2, 2018 2 hours ago, FreeMan said: Before I getting a 'loaner', are the PSU testers actually good for detecting these flaky types of issues? You can read the reviews. But I suspect that the best use for these 'testers' is to make sure that the various voltages are close the the spec limits and that they power on and off. Of course, The more expensive a unit you buy the better the test results will be and the more things that can be tested. I would imagine that the best ones will cost more than a replacement PS.
Recommended Posts
Archived
This topic is now archived and is closed to further replies.