July 26, 20178 yr See last post for diagnostics Ever since I started using unRAID about a week ago, my server locks up hard after 4-12 hours of heavy downloading with NZBget. I have a single VM running with NZBget, SickRage and Plex running it in. I have allocated anywhere from 4-12 cores and 16-50 GB of RAM (I have a Xeon-D 1540 with 16 cores at 2 GHz and 64 GB of DDR4 ECC RAM), and the damn thing will still crash. Usually it's the VM that crashes, then at some point it locks the entire system: The Web GUI dies, SSH dies, hell even a local console locks after entering "root" with no errors piped out to the console. Even a clean "software shutdown" via IPMI does nothing, nor does an ACPI shutdown (button press), so I have to do a hard power off to get it back to working order. This happens about every day or two and I can't narrow it down to what is actually the problem. I know my CPU doesn't seem to like KVMs since the VM would regularly crash and lockup when using bhyve under FreeNAS, but FreeNAS itself would be fine. The odd thing is that I've used KVM for months under Arch Linux, CentOS and Ubuntu with no issues at all. I've have 3-4 VMs running at a time on a "base" OS such as Arch with no issues. Since unRAID lives in RAM I can't seem to get any crash logs or anything from dmesg upon reboot since they're all gone. I typed "root" into the login prompt at least 15 minutes ago and it's still sitting there waiting for the password prompt to appear! I'm going to be ordering a new motherboard (Asus X99-W/IPMI) and a beefier CPU (Xeon E5-22xx 3.6 GHz, 6 cores) with an AIO water cooler, but that's gonna cost me about a grand and is gonna take me at least 2-3 weeks to get, and I'd like to figure this out in the mean time so I can actually use my system for more than a few hours at a time. Edited July 27, 20178 yr by brando56894
July 26, 20178 yr I can't offer much help with the VM issue, but have you considered running NZBget, SickRage and Plex as Dockers?
July 26, 20178 yr Community Expert 53 minutes ago, brando56894 said: Since unRAID lives in RAM I can't seem to get any crash logs or anything from dmesg upon reboot since they're all gone. Fix Common Problems plugin - Troubleshooting mode will save logs to flash periodically.
July 26, 20178 yr Community Expert One thing to try is to install the 'Fix Common Problems' plugin. Then turn on the troubleshooting mode. (Turn on "Help" to see what this will be doing...) Second thing is connect a monitor to the console and see what messages are displayed after the lockup. (This does not always work but it is worth a try. IF you do get something, take a good, sharp, and in focus picture--- Preferably without a flash spot in it --- to post up. The Gurus will have to be able to read the text on the screen.) Third thing, If server build has a new components in it, run a Memtst (Boot option on Boot Menu when unRAID boots) for twenty-four hours minimum. No errors are allowed!!! Fourth thing, post the spec's on your PS. Replacement of flaky/unsized PSU's have been involved in solving a few number of these types of problems... Fifth thing, check to see what your CPU temperatures after two-to-three hours of heavy usage. Look and see how high the disk drive temperatures are going. You may have to increase the air flow through your server to get things under control. Edited July 26, 20178 yr by Frank1940
July 26, 20178 yr Author 17 minutes ago, tdallen said: I can't offer much help with the VM issue, but have you considered running NZBget, SickRage and Plex as Dockers? Yea I have them setup but my problem is that I use nzbToMedia post-processing scripts which requires extra binaries not found in most nzbget containers. I've found two and managed to get one mostly working, haven't tried the other yet.
July 26, 20178 yr Author 4 minutes ago, Frank1940 said: One thing to try is to install the 'Fix Common Problems' plugin. Then turn on the troubleshooting mode. (Turn on "Help" to see what this will be doing...) Second thing is connect a monitor to the console and see what messages are displayed after the lockup. (This does not always work but it is worth a try. IF you do get something, take a good, sharp, and in focus picture--- Preferably without a flash spot in it --- to post up. The Gurus will have to be able to read the text on the screen.) Third thing, If server build has a new components in it, run a Memtst (Boot option on Boot Menu when unRAID boots) for twenty-four hours minimum. No errors are allowed!!! Fourth thing, post the spec's on your PS. Replacement of flaky/unsized PSU's have been involved in solving a few number of these types of problems... Fifth thing, check to see what your CPU temperatures after two-to-three hours of heavy usage. Look and see how high the disk drive temperatures are going. You may have to increase the air flow through your server to get things under control. Forgive the double post since I'm on my phone. 2 isn't necessary since I have IPMI and as stated above there's nothing that is output to the console during lockups. 3: the components are at least six months old. I was gonna run a memtest just for the hell of it. 4: it's a 1200 watt PSU I forget what brand, but it was highly rated on NewEgg. 5: drives are always under 100F, I'm pretty sure my issue is and has been my CPU since it's an FPBGA socket and the board is meant to be in a rack mount case. There aren't any good aftermarket fans for it and I can't replace the heatsink so I have a 120mm fan blowing on it. It idles around 120F and is usually around 140-150F under moderate load, extremely heavy load (running 64 make threads while compiling the kernel, or transcoding a bunch of 1080P or 4K streams in Plex) will make it want to roast itself and shoot it up to 190+ but I never let it get that high except for a few minutes when I don't notice it and kill whatever is taking up all the cycles. There's an audible alarm so I know when it gets that high. Most of the time while downloading and post processing, it's around 150 F.
July 27, 20178 yr Author Ok, so the whole system didn't crash (yet) but the VM did crash about 4 hours after I last posted. It initially crashed around 3 am this morning, then I restarted it at 6, and then enabled troubleshooting mode around 8:30 and went to sleep. It looks to have crashed right before 12:30. I've attached the relevant log files. FCPsyslog_tail.rar tower-diagnostics-20170726-1141.zip tower-diagnostics-20170726-1211.zip tower-diagnostics-20170726-1241.zip tower-diagnostics-20170726-1312.zip tower-diagnostics-20170726-1342.zip Edit: About an hour or so later I setup a Plex docker and scanned in my library, while doing that I decided to create a Sonarr container and doing both of those seemed to freeze the WebUI since it's been unresponsive for about 15+ minutes, terminal works fine though. tower-diagnostics-20170726-2146.zip tower-diagnostics-20170726-2216.zip Edited July 27, 20178 yr by brando56894
July 27, 20178 yr Author I've been running memtest86 for 9.5 hours and no errors so it doesn't look to be a RAM issue....
July 27, 20178 yr Community Expert Twenty-four hours is usually recommended. You have to make sure that you don't have any issue with memory.
July 27, 20178 yr Author I decided to cut to the chase and against all good decision making drop $1300 on some new hardware and save myself a few weeks of headaches haha The following should be here in a few days Asus X99-W/IPMI Corsair Hydro Series H115i Extreme Performance Liquid CPU Cooler 280mm radiator Intel Xeon E5-1650 v3 3.5GHz 6 x 256 KB Edited July 27, 20178 yr by brando56894
Archived
This topic is now archived and is closed to further replies.