Jessie Posted July 12, 2017 Share Posted July 12, 2017 Sorry, not sure where I should place this question. I built an unraid server with a couple of VM's and a couple of dockers. Essentially it runs well for maybe a week or 2. Then the system locks up. Totally unresponsive. Can't telnet to it. All vm's and dockers are unresponsive. The information is still present on the unraid boot screen after lockup. The only way back is a forced shutdown and turn it on again, then it will go for another week or so and crash again. Hardware is an e3-245 processor on a gigabyte x170-WS-ecc motherboard. 32gb ecc ram, 2 x Samsung 500gb evo850 ssd's for the cache and 3 x western digital 3tb reds for the array. There is a ASUS radeon r5-230 card for monitoring one of the vm's. The other vm is accessed via rdp. There is a usb controller which is mapped to one of the vm's so that usb backup drives can be hot swapped within the vm. The VM's are Windows 10 pro and small business server 2008. The small business server is a domain controller, so the unraid server has a static address to be accessible, then the sbs software takes over the domain after it boots. The USB controller is mapped to the SBS vm and the graphics card is mapped to the windows 10 vm. There is a privoxy docker loaded for popup blocking and Nextcloud/Mariadb/Letsencrypt is working in the system but not being utilised yet. The system is protected by a UPS. I originally had the UPS usb cord plugged in and the unraid ups monitoring enabled. When it started crashing the unraid screen displayed a message about (I think) int 32 don't care or something along those lines. I researched int32 and found it was related to the ups monitoring so I unconfigured it and removed the usb connector. This improved the reliability from a crash every couple of days to every couple of weeks. (Realistically 1.5 to 2 weeks) So now I'm not sure where to look. I've built quite a few unraid machines and also another 3 using the same hardware but different configurations as above and they are rock solid. They just work. The others are using multiple graphics cards as well as mapped usb controllers, multiple operating systems without any issues. My only doubt is whether I have pushed the envelope too far with Small business server. Alternatively it might be a hardware issue. I don't know. HELP!!! 0 Quote Edit Quote Link to comment
Frank1940 Posted July 13, 2017 Share Posted July 13, 2017 I would try installing the 'Fix Common Problems' plugin and turn on the "Troubleshooting Mode". The log files that it produces might provide some clues... Quote Link to comment
Jessie Posted July 26, 2017 Author Share Posted July 26, 2017 Thanks Frank, I was running the fix common problems plugin. The problem was that it locked up before it could report. I took a punt and replaced the memory. Used 2 x 16gb modules in lieu of 4 x 8gb and it hasn't missed a beat since. I've heard rumours of 4 modules vs 2 in relation to not interworking well sometimes. Hard to test because it passed memory tests every time. Quote Link to comment
JonathanM Posted July 26, 2017 Share Posted July 26, 2017 9 minutes ago, Jessie said: I've heard rumours of 4 modules vs 2 in relation to not interworking well sometimes. Usually it's a result of marginal power, but not typically the PSU, it's the motherboards power supply to the CPU and RAM sticks that's not up to the challenge. More efficient sticks can help in some cases, but many times it's just better to run fewer sticks. Registered memory solves the issue completely. Quote Link to comment
Frank1940 Posted July 26, 2017 Share Posted July 26, 2017 4 minutes ago, Jessie said: Used 2 x 16gb modules in lieu of 4 x 8gb and it hasn't missed a beat since. I've heard rumours of 4 modules vs 2 in relation to not interworking well sometimes. Hard to test because it passed memory tests every time. I have heard the same rumor(s). I seem to recall something about the memory line drivers not be able to maintain the proper waveform when the load increases (i.e. 4 modules vs 2 modules). Most MB manufacturers have lists of 'approved' RAM modules because of this. A couple of memory manufacturers also have tables which will recommend which of their products should be used in various MB's. It is probably a bigger issue when you use larger capacity modules than smaller modules. (I.e., 4 2GB modules will work when 4 8GB won't work.) Quote Link to comment
Jessie Posted July 26, 2017 Author Share Posted July 26, 2017 1 hour ago, jonathanm said: Usually it's a result of marginal power, but not typically the PSU, it's the motherboards power supply to the CPU and RAM sticks that's not up to the challenge. More efficient sticks can help in some cases, but many times it's just better to run fewer sticks. Registered memory solves the issue completely. I would have thought this board would be ok. It's not bottom shelf. None the less I have deployed the 8gb modules to 2 other machines. Will be interested to see if the problem moves or goes away. Quote Link to comment
Jessie Posted August 1, 2017 Author Share Posted August 1, 2017 The plot thickens, It looks like the memory is not the issue. The machine ran reliably for a long period of time. I logged into it to check backups, then between 12 and 24 hours later... CRASH... It now seems that the trigger might be me logging in remotely. I am accessing the system via an IPSEC tunnel. Access is to the unraid web interface via the internal ip address. I also RDP'd to the SBS server and to a Windows 10 VM using internal ip addresses. Every thing works normally, then 12 - 15 hours later... CRASH... If I stay away from it. it works without issues... (At least that is my theory at this stage) Any ideas?? Quote Link to comment
Jessie Posted August 22, 2017 Author Share Posted August 22, 2017 More on this. I have read about a hyperthreading bug on skylake processors which cause crashes such as the one I am experiencing. The motherboard I am using is a gigabyte x170-ws and uses a xeon e3. I had a look on the gigabyte site and there is a firmware update to address the CPU issue. Fingers Crossed. Quote Link to comment
Jessie Posted August 27, 2017 Author Share Posted August 27, 2017 On 22/08/2017 at 11:33 PM, Jessie said: More on this. I have read about a hyperthreading bug on skylake processors which cause crashes such as the one I am experiencing. The motherboard I am using is a gigabyte x170-ws and uses a xeon e3. I had a look on the gigabyte site and there is a firmware update to address the CPU issue. Fingers Crossed. Upgraded firmware on weekend. So Far So good. One thing happened though. The Top PCie slot ceased to function. I had a USB3 card plugged into that, passed through to the Small business server VM. Moved it to the bottom PCIE slot and it seemed to interfere with the Graphics card in the top PCIex16 slot. Settled for the middle PCIe slot. Not happy about that though, because I have other servers using that slot for a second Graphics card. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.