Random Webui Lockup and can't "powerdown"


Recommended Posts

I had built a mass storage server for a customer of ours for file shares and backup images. Recently we have been having issues where unRAID will just lock up. Sometimes it takes a week or two, others it would be daily. We tried a bunch of things, but it would even lock up in "safe mode". Checked RAM, replaced unRAID USB drive, even replaced the 10G on board NiC with a 1G NiC. Reading on what to do to try to diagnose the system, everyone was suggesting to look at the system log. Unfortunately, when the system locks up, you could not even putty into it. I recently came to the realization I had setup IPMI on it. I was able to remote into a secondary computer and then remotely control the system using IPMI and finally got the system log, but I am more confused by it. Here are the events today, and the IT "guy" at the company said he noticed it around 2pm when he emailed me, but it was working fine when he came in this morning. Here is the system log since midnight:

 

Jul 10 00:20:07 AscotBackup sSMTP[4676]: Creating SSL connection to host
Jul 10 00:20:07 AscotBackup sSMTP[4676]: SSL connection using (removed)
Jul 10 00:20:10 AscotBackup sSMTP[4676]: Sent mail for unRAID@(removed)
Jul 10 01:01:50 AscotBackup apcupsd[30323]: Power failure.
Jul 10 01:01:56 AscotBackup apcupsd[30323]: Power is back. UPS running on mains.
Jul 10 14:39:36 AscotBackup login[8711]: ROOT LOGIN  on '/dev/tty1'
 

I am wondering if the lockup is due to the UPS Daemon. I have since shut off the daemon and had the IT guy unplug the USB cable. Running unRAID 6.3.5 and I would say we have been having the issue since 6.2ish. The system had last been restarted on the 6th (4 days ago) due to the same thing. I did notice in the logs it shows there was a power failure on the 8th, but it did not lock up then. Any other thoughts on troubleshooting this issue?

Edited by DDock
Link to comment

Not sure if this is related, or even an issue. I did another quick search but found no results. I am seeing this in the system log:

 

Jul  6 14:45:20 AscotBackup kernel: perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Jul  6 15:02:46 AscotBackup kernel: perf: interrupt took too long (3157 > 3130), lowering kernel.perf_event_max_sample_rate to 63000

Jul  9 04:09:45 AscotBackup kernel: perf: interrupt took too long (3951 > 3946), lowering kernel.perf_event_max_sample_rate to 50000

Link to comment
  • 2 weeks later...

No idea about if IPMI can show you this or not, but you really need to see if anything is only the local monitor when this happens which may give clues.  Barring that, it could be anything.  overheating, bad ram, bad cpu, bad mobo, bad powersupply, etc.

 

Also, if the UPS didn't properly supply the correct voltage all the time (ie: it dipped too far before it switched to battery), then the drop could have also caused the problem.

Link to comment

Yea the IPMI interface has a live view that is just like you have a local monitor attached. No errors are showing up on the CLI. This time there was no power outage. We also run two power supplies on the system. Not sure if ECC memory will, but we ran a Memory Test for awhile with no errors. Temps wise, everything is good. With the CPU's, we run a dual CPU motherboard. Chassis, PSU (Dual 800W), and Motherboard are SuperMicro. Have two Intel Xeon E5-2620's with Crucial ECC Memory. Any thoughts on how to start testing the system for a bad motherboard or CPU? Normally we have spares of our consumer gear for testing, but nothing server grade.

Link to comment

I guess a bigger question is there a way to make a cronjob or something similar to keep an up to date log written somewhere on the array? Before we have to start pulling hardware, I would like to make sure it isn't something in software causing issues. I run to other unRAID servers and neither of them have issues, so I doubt it is software.

Link to comment
24 minutes ago, DDock said:

I guess a bigger question is there a way to make a cronjob or something similar to keep an up to date log written somewhere on the array?

Fix Common Problems plugin in troubleshooting mode will keep a live tail of the syslog stored on the flashdrive up to the crash.  Also has some other debugging info which will get logged.  Empirical evidence shows that more often than not with this type of crash, the syslog shows nothing, but its better than nothing, and might shed a clue.

 

A biggie though is  to stop any VMs from running and see if that makes a difference (especially if they utilize passthrough).  If it does, then you're going to need to look at 

  • BIOS updates
  • Incompatible / flakey hardware being passed through
  • etc

IE: Just because IOMMU is present and an option in the BIOS, doesn't mean that it works correctly, as it is not something the manufacturers at this point in time concentrate on.

Link to comment

We have Fix Common Problems installed, so I will try that in the morning. We have a VM that runs some backup software (urBackup, no IMMOU) and a docker (dokuWiki). Maybe I can plan some time to go out there and update the BIOS to see if that works. Thanks for the help, I will keep updating you if I find more.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.