May 2, 20179 yr Hey all, got a really interesting case here. So for the past few weeks my server has been going completely offline, without any warnings. Sometimes it's when I'm browsing the SMB shares, sometimes it's when I'm playing back a movie via Plex, sometimes when I'm invoking the File Integrity script. So I'm thinking it has something to do with the high-CPU loads? When it goes offline The server does not respond to WebUI calls (either using NetBIOS name or using IP address) Does not respond to SSH logins (either using NetBIOS or IP) Clicking on the power button does nothing (expectation: beep twice and then start powerdown) The monitor turns blank (plugging in a monitor via HDMI - no output) The only solution is to power down completely (by holding down the power button for ten seconds) and then starting again. So I'm tracking down the problems in this order: Bad CPU. The errors occurred mostly when there was a high amount of CPU activity. Bad flash. The drive might be dying. Bad RAM. This is really, really unlikely because I've tested them multiple times and they came from the same batch, same manufacturer, same product shelf. Currently I'm running a Memtest86+ on the server (with the flash drive pulled out, I'm going to mirror it to a new device in case the old one dies), and I'm going to leave it running until you guys have a better idea. I've also ordered an A8-3870K from eBay for about $30 (it's because my socket is FM1, it's better than the old A4-3400 I've been using anyway), and I'm going to see if that makes things better but I don't want to start swapping out components at this stage because it may make the problem worse. What do you guys recommend I should do now? Server is running Memtest86+, I'll leave it running until you guys give me some options. Thanks in advance. Edited May 5, 20179 yr by Guest
May 2, 20179 yr Do you have another PSU to run with for a while? A dodgy PSU can cause all sorts of havoc.
May 2, 20179 yr 19 minutes ago, HellDiverUK said: Do you have another PSU to run with for a while? A dodgy PSU can cause all sorts of havoc. I can transplant my PSU from my workstation to the server (it's an Antec PSU that was recommended to me because of it's stability, so no worries there), but that means I'll be left without a working computer. I'm not sure if the PSU is at fault here but it certainly can't be out of the ballpark. I'm only willing to test that as a last resort if all else fails because I still have to work. BTW, the PSU in the server is a Zalmann 500W PSU (or was it 400W?). It certainly fits the requirements my computer is pushing out, but I'm thinking the recently added graphics card might be the problem too. It's a very low-power card though (the R7 240) so I'm not too sure if it's a power problem. Also, I thought if PSUs were the failpoint then the server would completely shut down? The fans still run and the lights are on when it freezes if it helps? I'll try to get my hands on a replacement PSU if I can, but good deals are hard to come by right now. In the meantime, what should I do?
May 2, 20179 yr A PSU doesn't have to totally switch off, it's output can fluctuate enough to freak out a motherboard or CPU, where other components will continue to run happily. Can you not pick up something like a Corsair CX430? They're dirt cheap and pretty decent for the price. Like $20 after rebates.
May 2, 20179 yr 5 minutes ago, HellDiverUK said: A PSU doesn't have to totally switch off, it's output can fluctuate enough to freak out a motherboard or CPU, where other components will continue to run happily. Can you not pick up something like a Corsair CX430? They're dirt cheap and pretty decent for the price. Like $20 after rebates. Like I said, I'm still looking for an alternative PSU in the market. I don't want to buy something cheap and used just to blow up my server, right? I'm looking at good products that won't crap out on me, but that takes time and I can't buy them right now (right now it's a national holiday so all the markets have closed up) At the moment, the best I can do is to check the memory, swap the flash and hope it survives a few more days. Once the holidays are over I'll go and buy something like the Micronics 500W PSU that's rated good by the local tech community. Any more ideas besides the PSU? I know you're probably screaming "Well duh the PSU is probably the problem!" and cursing how stupid I am but seriously, I know, I'm not forgetting the PSU. It's just not something I will test right now because it's something I cannot improve on at the moment. Hope you understand EDIT: Here's something I have in mind once the holiday is over: http://prod.danawa.com/info/?pcode=1928673&cate=112777#bookmark_product_information Edited May 2, 20179 yr by Guest
May 2, 20179 yr Have you tried running the Fix Common Problems plugin in troubleshooting mode to capture the logs? I agree that this feels more like a hardware problem but it can't hurt to have a look at the logs... Have you made any recent changes?
May 2, 20179 yr 1 hour ago, tdallen said: Have you tried running the Fix Common Problems plugin in troubleshooting mode to capture the logs? I agree that this feels more like a hardware problem but it can't hurt to have a look at the logs... Have you made any recent changes? I will certainly try that once I get at least 3 passes on my RAM sticks. I'm not sure if the logs will pick up the problem because it seems more like a hardware failure than an actual software messing up but yeah, will try that. Also, no recent changes except for the additional graphics card R7 240, which is the only graphics card in the system besides the integrated Radeon one inside the CPU, and also that card may or may not be the problem because I've had similar issues to this prior to installing the card so I'm not really sure if the card can be blamed here. It might be drawing too much power at high load but technically Linux does not use GPU for workloads so I'm guessing the power is not that strong enough to withstand both CPU and GPU at once? Will post logs back once I get that puppy running again. Right now it's still testing RAM sticks.
May 2, 20179 yr A complete lockup makes me wonder about an interrupt conflict, so don't rule the graphics card out yet.
May 2, 20179 yr 4 hours ago, ideaman924 said: I can transplant my PSU from my workstation to the server (it's an Antec PSU that was recommended to me because of it's stability, so no worries there), but that means I'll be left without a working computer. +1 for PSU. Why not do a direct swap? you'll still have a working computer, but if it does start crashing, you'll have proven for sure that it was the psu causing issues.
May 3, 20179 yr 8 hours ago, tdallen said: A complete lockup makes me wonder about an interrupt conflict, so don't rule the graphics card out yet. Yeah, I'll be pulling that graphic card out if things start locking up again, but at this stage I'm hesitant to pull it out because the graphics card has an HDMI port and it's the only port that's compatible with my monitor. (The VGA port was long since deprecated from my monitor) 6 hours ago, Squid said: Why are you running an add-on Video Card? Are you utilizing a VM on the A4-3400? Same reason above. If I don't have that card in there then I have no way to connect a monitor. 5 hours ago, DoeBoye said: +1 for PSU. Why not do a direct swap? you'll still have a working computer, but if it does start crashing, you'll have proven for sure that it was the psu causing issues. That's a good idea, and I'll try that and see if it helps. Thanks! UPDATE: I've completely mirrored the USB drive to my laptop in case it fails, and I ordered two 8GB USB sticks from eBay. It's going to take a while until it gets here so I'll start troubleshooting. The memtest86+ passed with 4 complete cycles. So it's definitely not a RAM issue. UPDATE2: I've put the USB back and started the server up without any changes to locate the issue. The idea here is to get it to crash and then remove the components one by one - the graphics card, the USB, and finally the PSU. (I'm leaving that off for last because replacing the PSU is a difficult and time-consuming process and I'm super lazy) I've started maintenance mode, but where should I look if it crashes? BTW, I'm going to run some File Integrity scripts to simulate the high-CPU loads on the server. Edited May 3, 20179 yr by Guest
May 3, 20179 yr I've found some weird error messages while browsing the syslog, anybody make heads or tails out of it? Just found something more... hopefully relevant: Edited May 3, 20179 yr by Guest
May 3, 20179 yr OK, so my server just blinked out again. My family was out in the park and we were running Plex on three devices simultaneously. As soon as the last device logged off (shut down the stream because we got tired of watching it), one of our family members noted that "Direct connection to the server was unavailable." was showing on Plex. Tried logging in via SSH and WebUI, same lockups as before, no progress. I just pulled the power and grabbed the flash, feel free to look at the diagnostics I've posted. Strangely, the server stopped responding at around 2PM (and that's the last entry on the log), however I came back home at around 5PM. If the server software had still been alive it should've posted what was wrong with timestamps up to 5PM, right? I'm thinking this is definitely a hardware issue now? FCPsyslog_tail.txt derrickserver-diagnostics-20170503-1219.zip derrickserver-diagnostics-20170503-1249.zip derrickserver-diagnostics-20170503-1319.zip derrickserver-diagnostics-20170503-1349.zip derrickserver-diagnostics-20170503-0948.zip derrickserver-diagnostics-20170503-1019.zip derrickserver-diagnostics-20170503-1049.zip derrickserver-diagnostics-20170503-1119.zip derrickserver-diagnostics-20170503-1149.zip I will start by pulling out the graphics card and see if that helps any. Maybe the PSU is struggling with the power load. Edited May 3, 20179 yr by Guest
May 3, 20179 yr Plex does a regular maintenance routine by default between 2am and 5am. Too much of a coincidence with your observed outage window. As a test you can disable the Plex maintenance, but it might indicate some deeper issues in Plex itself, e.g. database corruption.
May 3, 20179 yr 36 minutes ago, bonienl said: Plex does a regular maintenance routine by default between 2am and 5am. Too much of a coincidence with your observed outage window. As a test you can disable the Plex maintenance, but it might indicate some deeper issues in Plex itself, e.g. database corruption. I'm sorry, I said 2PM Also if Plex had an issue then it'd probably crash the Docker service worse-case scenario, but wouldn't grind my system to a complete halt just because of database corruption right? I mean, Plex malfunctioning shouldn't cause the console to just blink out or something. I ended up solving the mystery when I opened up the server on my workbench. So the room I keep the server in is dark, and when I open the case it's usually for some last-minute checkups and additions like making sure the SATA cables are properly seated and reseating the RAM and that sort of stuff. So when I opened the server up on my workbench with better lighting conditions, I was met with this situation: I know, I know, I was completely stupid and the CPU was overheating. This is what it looked like AFTER cleanup: But it's still weird that UnRAID does not warn the user if the CPU is overheating... can this be addressed in a future release?
May 3, 20179 yr Somehow I read your PM as AM, sorry. Looks like you didn't touch that system for a while unRAID itself doesn't monitor cpu temperatures, this is only available as an optional plugin. Adding CPU temperature support is a tricky thing because there is a lot of hardware dependency. You'll will see miss and hit stories of people who have installed the Dynamix System Temp plugin. I guess LT is not much in favor of carrying all the extra burdon to do this. Ps. Most motherboards have the option to shutdown the processor in case of overheating and prevent real damage. Edited May 3, 20179 yr by bonienl
May 5, 20179 yr Well, I think I've solved the problem. The server hasn't frozen up. In the future I'll look at more adequate cooling solutions. Thanks to all the help! And @bonienl thanks for the Dynamic System Temp plugin. Maybe @Squid can add it to his Fix Common Problems plugin?
May 5, 20179 yr 2 hours ago, ideaman924 said: Maybe @Squid can add it to his Fix Common Problems plugin? Depending upon the CPU installed, a Machine Check Event may be generated in the case of an overheat situation (assuming the system throttles down and stays running), and FCP will catch that. But, personally, while I do consider having system temp plugin installed to be mandatory, I don't think that having it not installed is either a warning or an error situation - mainly because as @bonienl stated, it is a real hit and miss thing and even when it does retrieve values, there are many instances where the values are incorrect (implementation issues on the motherboard etc) Edited May 5, 20179 yr by Squid
Archived
This topic is now archived and is closed to further replies.