Still locking up

PR. · September 2, 2018

Another couple of weeks another lockup...

Locked up just over two weeks ago, ran memtest86+ on the unRAID boot tool using SMP option and after a few minutes it will error out and reboot the machine. Looking at the memtest86+ information the reported RAM timings are wrong and it's reported as DDR3 rather than DDR4. So I downloaded the latest memtest86+ version off their website and ran it on a separate USB stick and it ran for 24 hours without issue.

I updated the BIOS for my Asrock Z370 Taichi Intel Z370 and checked the BIOS settings again, I then switched to UEFI boot for UnRAID rather than BIOS, and its been up and running for 16 days (running VMs, Docker containers, maxing the CPU Cores on transcoding etc.) until the early hours of this morning (again) and once again when I woke up this morning the server was locked up, no errors on the monitor (just the 'login' line) and nothing in the logs.

Looking at the logs this time there's not even any call traces as I'd stopped using PiHole.

I've attached the latest log file, and screenshots of my BIOS config just in case I'm missing something.

I really don't know what to do...

Server Specs:

Asrock Z370 Taichi Intel Z370

Intel Core i7-8700K 3.7GHz

Corsair Vengeance LPX 16GB RAM (2x8GB)
Samsung 500GB 850 EVO SSD (2x for Cache Drive)

SanDisk SSDs (2x as Unassigned Devices for VM Storage)

WD Red Assorted for Array
Seasonic P Series 760w '80 Plus Platinum' Modular Power Supply

syslog-1534553021.txt

BIOS Photos.zip

John_M · September 2, 2018

Diagnostics might reveal something (Tools -> Diagnostics). The Fix Common Problems plugin running in troubleshooting mode might reveal something. The wording of your post suggests that this has happened before so a link to your previous thread would be useful - though it would have been even more useful if you had continued there.

PR. · September 2, 2018

I've not used Troubleshooting Mode in the Fix Common Problems plugin in the past because the machine can run for as long as 40+ days before locking up and I'm concerned it will fill the USB stick up, I am using a Custom Script that writes the syslog file to the USB Stick as information is generated which is what I've posted earlier, I've attached to this post my full diagnostics, though obviously doesn't include the logs from before the crash.

Here's my previous posts:

aventine-diagnostics-20180902-1542.zip

Edited September 2, 2018 by PR.

John_M · September 3, 2018

Why do you keep starting new threads that are about the same unresolved issue? If you kept to one long thread it would be easier to see what others have suggested and what you've done. The threads seem to follow a pattern: you ask for help; someone offers advice; you presumably act on it but I'm not really sure because there's a gap of a few days and then the thread ends with you saying something along the lines of "another day, another crash". No progress, just frustration.

So you've run memtest for a good long time and you've stopped using what was believed to be causing call traces. You're running the latest stable release of unRAID and your disks look ok. There's nothing of any concern in your syslog and the rest of your diagnostics look normal. You're on a very recent BIOS. Nothing much to go on.

OK. A few things to think about. How is your power distribution? And how is your cooling? Is your CPU merely idling most of the time or do you work it hard? You're not tempted to overclock it, are you (seeing as how you bought the more expensive K version)? Have you tried running a basic NAS for a while with dockers and VMs disabled? When you have confidence in that then start enabling things a little at a time. I wouldn't worry about Fix Common Problems filling up your USB flash - the 14 GB of free space will hold an awful lot of text. So either turn on troubleshooting mode or run your own script if you prefer and see how long it goes before it crashes again.

PR. · September 4, 2018

Problem is sometimes it locks up with an error on the display, sometimes it just locks up with nothing. When it first started I assumed it was something to do with the new release of 6.4.x

I don't think it will be a power issue, I’ve had a Dell server plugged in to that outlet for 8+ years (running Windows Home Server) and then an old home made server plugged in there for a 4+ years (running Windows Server 2012). I’ve also got an old 2010 i7 iMac and an old i7 Gaming PC plugged in nearby that have ran for years without issues.

As of right now the server spends most of the day idling with the array drives spun down, it’s got Plex on it so it occasionally does some DVR transcoding, but I’m pretty much the sole user of Plex so it’s never stressed. None of the other docker containers are very demanding, I stopped running Pi-Hole because it needing a separate IP address seemed to be causing macvlan Call Trace errors and also when the server crashed it was obviously unable to respond to DNS queries from other devices.

CPU temps seem reasonable, I’m not overclocking, and I have a large aftermarket Noctua cooler fitted that keeps idle temps to low 30s and load temps at high 40s, that’s what the Dynamix plugin reports and BIOS/memtest also reports similar values so have no reason to doubt them, the room it’s in doesn’t get overly warm either.

The vast majority of crashes have happened overnight between 4am and 7am, made me think it was an SSD Trim issue as this is when it runs its daily purge, sometimes the trim is in the logs pre-crash, sometimes it is not.

I could disable docker/VMs but I have a work VM on that I really need access to and I don’t really want to have the server unavailable for anywhere up to a month.

I’ll turn on troubleshooting mode and see what happens.

Thanks

John_M · September 5, 2018

I meant the DC power distribution within the server: the power supply, cables, any splitters or backplanes, any nasty Molex connectors, the high current 12 volt supply to the CPU. Lots of weird faults end up being power supply related.

Try changing the schedule for trim if it's on your list of suspects. What else happens around that time? The mover?

If you really want to fix the problem you need to break it down and start eliminating things. It's your choice whether you live without some services for a while and maybe narrow down the problem or live with a permanently unreliable server.

PR. · September 6, 2018

I can't say it's not the PSU, it's new and there's no dodgy connectors, purely the included cables going straight to the components. Voltages in the BIOS look ok, though that's not exactly proof.

Server crashed again today, came home at just after 17:00 and checked it was still up and running Diagnostic Mode ran at 17:09, 17:39, and never ran again, I checked it after 19:00 and it was not responding. I had moved the Trim schedule to run at 19:00 and thought this might be the smoking gun but no, the server looks to have already crashed before it had chance to run.

Disabled VMs (not that I'd run any since the last crash) and only running 3 Docker Containers and see how we go...

Last Diagnostics are attached.

aventine-diagnostics-20180906-1739.zip

lionelhutz · September 7, 2018

I had similar lockups. Finally dumped the ASRock H170 Pro 4 motherboard for a Gigabyte Z270 board and no issues since. Probably not what you want to hear, but it's not likely you'll find anything to narrow it down, But, my money would be on the ASRock motherboard.

PR. · September 30, 2018

Since 6.6.0 lock ups seem to have increased, don't think I've seen more than a couple of days before it locks up, after another lockup I started getting errors on one of the cache drives:

BTRFS critical (device sdh1): corrupt leaf: root=5 block=1935130624 slot=25 ino=625408 file_offset=0, invalid disk_num_bytes for file extent, have 18412971642897364992

Tried running a repair but it wasn't having it, in the end I accidentally on purposely wiped the cache pool recreated it and restored from a backup, no more errors. Self tests on both SSDs reported no issues. It locked up again a couple of days later.

Opened it up and ripped out the old SanDisk SSDs that were being used in Unassigned Devices. Removed, re-connected, and checked all the remaining SATA cables. Did the 6.6.1 update yesterday morning, did a Duplicati backup to the array from my gaming PC last night, turned PC off and got in to bed, unRAID had locked up, no errors.

Updated the BIOS this morning and kicked off a MPrime/Prime95 blend test this afternoon, been running nearly 10 hours so far without issue, will leave it running for 28 hours and see what happens...

Edited September 30, 2018 by PR.

lionelhutz · October 2, 2018

You still won't want to hear it, but you'll likely have to start swapping major components to figure this out. Something in there is a piece of crap and the only way to fix it is to replace it.

PR. · October 2, 2018

So after 29 hours of 100% load across all the cores without issue I stopped the process yesterday afternoon.

Ran a parity check after starting the array which finished this morning, didn't start any docker containers or VMs.

Checked it this evening about 9pm still running ok, just checked it again at 11pm and it's locked...

Guess I'll try 29 hours of memtest next...

Vr2Io · October 3, 2018

I would suggest not use memtest86+, I use memtest (HCI design) for Windows. Open several session with each 2000MB memory test (so may be 6 for 16GB)

https://hcidesign.com/memtest/

For server, I would test memory running in a step-up speed in 2 or 3 pass only. After test without error, I may step-down it for long-run.

I.e. my Ryzen test on 2400Mhz but longrun in 2133Mhz (4 sticks), I haven't got noticable slow down. Test time were short and setting make me comfortable.

Agreee with @lionelhutz , if all stress test got positive and problem presist, mainboard should change.

And last, if Prime test somthing like Blend profile, then funrther test really not necessary.

"The stress-test feature in Prime95 can be configured to better test various components of the computer by changing the fast fourier transform (FFT) size. Three pre-set configurations are available: Small FFTs and In-place FFTs, and Blend. Small and In-place modes primarily test the FPU and the caches of the CPU, whereas the Blend mode tests everything, including the memory. "

Edited October 3, 2018 by Benson

PR. · October 8, 2018

Well it locked up again on Friday and then again this afternoon, again with nothing running. On Friday I pulled out the internal USB header connector (so the USB Stick stays in the server rather than using an external port) and it had me wonder about the RAM again, again memtest said all was ok.

So this afternoon after crashing again I popped the server open replaced the internal USB Header connector, and looked again at the RAM, and checked the manual, and it's not fitted in the right order...

I have the two DIMMs plugged in to A1/A2, apparently for Dual Channel I need to be plugged in to A2/B2, I was sure I'd have noticed this before but I checked my photos from February of the BIOS and it's there clear as day...

So I swapped it to the correct ports and now it reports Dual-Channel.

Maybe that will make the difference...

PR. · October 10, 2018

No it didn’t, less than 2 days uptime and I woke up this morning to it not responding.

Now a new development: the webui won’t load, log on to th server locally check the syslog and surprise surprise it says everything’s fine...

PR. · December 1, 2018

I thought I'd post a little update on this, as I was having lock ups even when Dockers and VMs were disabled and being fairly sure the hardware was sound I decided to look at the Plugins, so I removed the following:

Dynamix Directory Caching
Dynamix System Info
Dynamix System Stats
Dynamix System Temp
Nerd Pack
Tip and Tweaks

Leaving:

CA Backup
CA Cleanup Appdata
CA Update Applications
Community Applications
Dynamix SSD Trim
Fix Common Problems
Unassigned Devices
unBalance
User Scripts

I also decided to boot in GUI mode so that when it locked up I could get an exact time (the clock would lock up I guessed).

Now 45 days later (the longest uptime since I built the server) I've not had any lockups despite heavy Docker & VM usage, Syslog has recorded no errors and (touch wood) everything has worked as it should. Assuming it's not a GUI/CLI display issue my theory is that the constant reading of the temp sensors was causing a problem with the motherboard(?) seems unlikely but that's all I could come up with.

My next planned step is to update from 6.6.2 to 6.6.5 and boot back in to normal CLI and see how we go. If I have problems try going back to GUI and if still problems go back to 6.6.2 CLI.

I'll update again once I have further info!

Edited December 1, 2018 by PR.

PR. · January 20, 2019

Final update on this...

Having installed 6.6.6 and booting in to normal CLI as menitoned previously I've not had any further issues and I'm currently on 45 days up time.

If you're reading this and having problems with instability and are sure your hardware is sound I'd suggest disabling plugins as a first step!

trurl · January 20, 2019

On 9/2/2018 at 9:07 AM, John_M said:

The wording of your post suggests that this has happened before so a link to your previous thread would be useful - though it would have been even more useful if you had continued there.

On 9/3/2018 at 12:45 AM, John_M said:

Why do you keep starting new threads that are about the same unresolved issue? If you kept to one long thread it would be easier to see what others have suggested and what you've done. The threads seem to follow a pattern: you ask for help; someone offers advice; you presumably act on it but I'm not really sure because there's a gap of a few days and then the thread ends with you saying something along the lines of "another day, another crash". No progress, just frustration.

I would try to clean this up but it is all just too old and too much trouble. I'm sorry I didn't notice any of this way back then, I would have cleaned it up before it got this bad and tried to get the user to quit doing it.

@PR.

Please, please don't do this again. Stick to one thread so we can all keep all of the information together. I really don't see why you would have even considered starting another and then another thread without any context for anybody. It would take a lot of trouble for us to even realize these were all the same person talking about the same problem.

PR. · January 20, 2019

At the time the issues weren't entirely clear, yes the machine was locking up but the errors were different, initially I was running RC version of unRAID so put it down to that (first post), then the issue went away, then it came back and I was getting screen dumps (second post), then I was getting drive errors and the cache pool failed (third post), then I had a HDD failure, it's only been since August/September time that the issue has been identical with it locking up at the cursor with 0 errors (this post).

When I was asked to keep everything to one post I did so and have been journaling my problems here ever since.

trurl · January 21, 2019

OK. Sorry for being a bad cop. There was a lot of time between some of these. And you did pull it all back together by linking to the other threads.

Thanks for the wrapup. Hope it continues to work for you.

Still locking up

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation