Intermittent hangs.. HELP

July 12, 20178 yr

Sorry, not sure where I should place this question.

I built an unraid server with a couple of VM's and a couple of dockers.

Essentially it runs well for maybe a week or 2. Then the system locks up.

Totally unresponsive. Can't telnet to it. All vm's and dockers are unresponsive. The information is still present on the unraid boot screen after lockup.

The only way back is a forced shutdown and turn it on again, then it will go for another week or so and crash again.

Hardware is an e3-245 processor on a gigabyte x170-WS-ecc motherboard.

32gb ecc ram, 2 x Samsung 500gb evo850 ssd's for the cache and 3 x western digital 3tb reds for the array.

There is a ASUS radeon r5-230 card for monitoring one of the vm's. The other vm is accessed via rdp.

There is a usb controller which is mapped to one of the vm's so that usb backup drives can be hot swapped within the vm.

The VM's are Windows 10 pro and small business server 2008. The small business server is a domain controller, so the unraid server has a static address to be accessible, then the sbs software takes over the domain after it boots. The USB controller is mapped to the SBS vm and the graphics card is mapped to the windows 10 vm.

There is a privoxy docker loaded for popup blocking and Nextcloud/Mariadb/Letsencrypt is working in the system but not being utilised yet.

The system is protected by a UPS. I originally had the UPS usb cord plugged in and the unraid ups monitoring enabled.

When it started crashing the unraid screen displayed a message about (I think) int 32 don't care or something along those lines.

I researched int32 and found it was related to the ups monitoring so I unconfigured it and removed the usb connector.

This improved the reliability from a crash every couple of days to every couple of weeks. (Realistically 1.5 to 2 weeks)

So now I'm not sure where to look. I've built quite a few unraid machines and also another 3 using the same hardware but different configurations as above and they are rock solid. They just work.

The others are using multiple graphics cards as well as mapped usb controllers, multiple operating systems without any issues.

My only doubt is whether I have pushed the envelope too far with Small business server. Alternatively it might be a hardware issue. I don't know.

HELP!!!

0

Quote
Edit

Quote

July 13, 20178 yr

Community Expert

I would try installing the 'Fix Common Problems' plugin and turn on the "Troubleshooting Mode". The log files that it produces might provide some clues...

Quote

July 26, 20178 yr

Author

Thanks Frank,

I was running the fix common problems plugin. The problem was that it locked up before it could report.

I took a punt and replaced the memory.

Used 2 x 16gb modules in lieu of 4 x 8gb and it hasn't missed a beat since.

I've heard rumours of 4 modules vs 2 in relation to not interworking well sometimes.

Hard to test because it passed memory tests every time.

Quote

July 26, 20178 yr

9 minutes ago, Jessie said:

I've heard rumours of 4 modules vs 2 in relation to not interworking well sometimes.

Usually it's a result of marginal power, but not typically the PSU, it's the motherboards power supply to the CPU and RAM sticks that's not up to the challenge. More efficient sticks can help in some cases, but many times it's just better to run fewer sticks.

Registered memory solves the issue completely.

Quote

July 26, 20178 yr

Community Expert

4 minutes ago, Jessie said:

Used 2 x 16gb modules in lieu of 4 x 8gb and it hasn't missed a beat since.

I've heard rumours of 4 modules vs 2 in relation to not interworking well sometimes.

Hard to test because it passed memory tests every time.

I have heard the same rumor(s). I seem to recall something about the memory line drivers not be able to maintain the proper waveform when the load increases (i.e. 4 modules vs 2 modules). Most MB manufacturers have lists of 'approved' RAM modules because of this. A couple of memory manufacturers also have tables which will recommend which of their products should be used in various MB's. It is probably a bigger issue when you use larger capacity modules than smaller modules. (I.e., 4 2GB modules will work when 4 8GB won't work.)

Quote

July 26, 20178 yr

Author

1 hour ago, jonathanm said:

Usually it's a result of marginal power, but not typically the PSU, it's the motherboards power supply to the CPU and RAM sticks that's not up to the challenge. More efficient sticks can help in some cases, but many times it's just better to run fewer sticks.

Registered memory solves the issue completely.

I would have thought this board would be ok. It's not bottom shelf. None the less I have deployed the 8gb modules to 2 other machines. Will be interested to see if the problem moves or goes away.

Quote

August 1, 20178 yr

Author

The plot thickens,

It looks like the memory is not the issue. The machine ran reliably for a long period of time. I logged into it to check backups, then between 12 and 24 hours later... CRASH...

It now seems that the trigger might be me logging in remotely. I am accessing the system via an IPSEC tunnel. Access is to the unraid web interface via the internal ip address. I also RDP'd to the SBS server and to a Windows 10 VM using internal ip addresses.

Every thing works normally, then 12 - 15 hours later... CRASH...

If I stay away from it. it works without issues... (At least that is my theory at this stage)

Any ideas??

Quote

August 22, 20178 yr

Author

More on this. I have read about a hyperthreading bug on skylake processors which cause crashes such as the one I am experiencing. The motherboard I am using is a gigabyte x170-ws and uses a xeon e3. I had a look on the gigabyte site and there is a firmware update to address the CPU issue. Fingers Crossed.

Quote

August 27, 20178 yr

Author

On ‎22‎/‎08‎/‎2017 at 11:33 PM, Jessie said:

More on this. I have read about a hyperthreading bug on skylake processors which cause crashes such as the one I am experiencing. The motherboard I am using is a gigabyte x170-ws and uses a xeon e3. I had a look on the gigabyte site and there is a firmware update to address the CPU issue. Fingers Crossed.

Upgraded firmware on weekend. So Far So good. One thing happened though. The Top PCie slot ceased to function. I had a USB3 card plugged into that, passed through to the Small business server VM. Moved it to the bottom PCIE slot and it seemed to interfere with the Graphics card in the top PCIex16 slot.

Settled for the middle PCIe slot. Not happy about that though, because I have other servers using that slot for a second Graphics card.

Quote

Intermittent hangs.. HELP

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)