[Solved] Unraid 6.01 - webgui crashes - Unraid 6.1 Also


Recommended Posts

I upgraded to unraid 6.01 about 10 days ago.  To date I have had two webgui crashes and would like some assistance in troubleshooting. 

 

System Specs:  Supermicro C2SEE, Celeron E1200 CPU, 2 GB Ram, parity drive, 8 data disks, one cache drive

 

I have two dockers running...Logitech Media Server and BTSync.  I ran both of these as plugins in unraid 5, in my reading of unraid 6 I realize my ram may be low, so between my first and second crash as discussed below.  I implemented a swapfile and set it up for an additional 2 GB. 

 

I have some plugins running including cache_dirs, preclear, new snap type plugin to mount drives, community docker/spps.

 

First webgui crash...not responsive in browser, could not telnet in to console...major mistake-forgot to grab syslog before reboot.

 

Second webgui crash...not responsive in browser...but I could telnet in.  I did grab a syslog...see attached.  I was unable to unmount /dev/md2, /dev/md5 and dev/md8.  Tried to see what was running on these discs and fuser would not return any results. 

 

I did ps -A and I noticed that there were a lot of smbd entries listed...I mean a lot  50 to 60.  This seemed unusual???

 

Ultimately, I did issue the poweroff command...got the system was going down message but the Tower never powered off.  After an hour I powered down manually with switch.  I have rebooted and parity check is in progress.  Since this occurred I did install the powerdown script.  I also stopped both Dockers (is stopping them enough...or do I need to remove them??)

 

If my hardware can't support what I am doing...please let me know as having a stable NAS is most important.  In terms of hardware I do have a spare Q9300 CPU that will fit my motherboard.  Unfortunately my motherboard is limited to 4GB ram....would updating to 4 GB be beneficial? 

 

Any help is greatly appreciated!

 

Dan

 

Modified Subject title now the Unraid is updated to 6.1 Stable

syslog.txt

Link to comment

For what it's worth, I'm having the same issue right now and I had just upgraded from 4GB to 8GB of RAM yesterday.

 

# tail /var/log/syslog
Jul 30 07:53:44 tower emhttp: /usr/bin/tail -n 42 -f /var/log/syslog 2>&1
Jul 30 08:39:25 tower kernel: mdcmd (66): spindown 8
Jul 30 08:45:14 tower kernel: mdcmd (67): spindown 2
Jul 30 08:45:49 tower kernel: mdcmd (68): spindown 6
Jul 30 08:59:28 tower kernel: mdcmd (69): spindown 9
Jul 30 10:11:06 tower kernel: mdcmd (70): spindown 0
Jul 30 10:11:06 tower kernel: mdcmd (71): spindown 3
Jul 30 13:38:16 tower sshd[14987]: Accepted password for root from 192.168.1.219 port 49594 ssh2
Jul 30 13:50:12 tower sshd[19546]: Accepted password for root from 192.168.1.219 port 49677 ssh2
Jul 30 13:51:06 tower sshd[19885]: Accepted password for root from 192.168.1.219 port 49679 ssh2

# free -m
             total       used       free     shared    buffers     cached
Mem:          7736       6556       1180          0        443       5317
-/+ buffers/cache:        795       6941
Swap:            0          0          0

# ps aux | grep emhttp
root      8678  3.0  0.0  89500  3764 ?        Rl   Jul29  40:24 /usr/local/sbin/emhttp
root     31107  0.0  0.0   5104  1728 pts/0    S+   14:18   0:00 grep emhttp

 

In the past when I was running v4.5.6 and I ran out of memory, I remember it showed a line in syslog about killing emhttp (or something to that effect).  Now, it doesn't seem to show any type of strange activity in the syslog and emhttp is still alive though inaccessible.  It was working earlier today though.  I'm looking forward to figuring this problem out.

Link to comment

So I am still having an issue. 

 

Since my last post, I have ordered more RAM.  It will arrive today or tomorrow.  After the crash documented in my posts, I increased the swap file to 4096 mb, I installed powerdown script and I stopped my logitech media server and BTSync Dockers as I am just looking to get a stable NAS at this point.

 

After 3 days and 21 hours of uptime (as reported by webgui which is unresponsive so this time may be off) the webgui crashed, however I am still able to telnet in to the console.  Please note that after capturing the outputs attached, I initiated the powerdown script and it is not shutting my server down...it says powerdown v2.17 initiated and I get the system going down for a halt in the telnet window...but the server has not gone down and its been 15 minutes as of this writting. 

 

Any suggestions for how to get the system to shut down safely????  Last time I have to manually powerdown...but I know that is not good for the data.  Any suggestions...what would keep the powerdown script from completing?

 

Attached are two files...one my syslog and another is my telnet window with the results of issuing the following ps -A, free -m, ps- ef, and lsof -Pni as referenced in the thread trurl mentioned.  Please note all the smbd processes running and that the memory is almost completely used.  What is making all the smbd processes run?

 

Also here is a link to a screenshop of htop, CPU is at 100% with one particular process.  Here is a link to my htop screenshot:  https://www.dropbox.com/s/wuyb33vm4zgr1kt/unraidhtop.jpg?dl=0

 

 

Any help would be greatly appreciated!

 

Dan

unraid_crash_ps-A.txt

syslog_08_3_2015.txt

Link to comment

Additional Info:

 

I tried to follow limetech's instructions in this post to shutdown unraid from the console:  http://lime-technology.com/forum/index.php?topic=41938.msg398562#msg398562

 

When I run: umount /mnt/disk*  I get a couple of disks that are in use or busy and to run lsof or fuser to see the list of files in use. 

 

fuser locks up the telnet session.  However I captured the attached output from lsof....in its unzipped form it is a 12 MB file!  The smbd process has a lot of open files.

 

Here is a link to the zipped putty session log that contains the output of the lsof. 

 

https://www.dropbox.com/s/2xdofs47vuhlbj3/Unraid_putty.zip?dl=0

 

I am hoping that by posting this, someone will have an idea of what to try next.

 

Thanks,

Dan

 

 

 

 

 

 

 

Link to comment

I was able to restart by killing the smdb process, but it looks like you would have many different pids to kill if you were to do that. In the end my restart was not graceful anyway and a parity check was initiated (although it completed with no errors thankfully).

 

There's a bit more info here: http://lime-technology.com/wiki/index.php/Console#To_cleanly_Stop_the_array_from_the_command_line

 

That wiki may be old though, so I'd go with the procedure you linked to. You might be able to use some of the commands shown on the wiki to help get the shared unmounted.

 

Unfortunately, I just don't really know enough to be of much help.

Link to comment

Shooga,

 

Thanks for the link...I worked through that posting this morning.  When I type the fuser commands the telnet session locks up and I can't even kill the command with control-c.  I tried killing a few smbd processes...but they don't seem to kill.

 

It seems like there are a couple of other users that are having a similiar situation.  Hopefully someone will have some additional suggestions.  Thanks again!

 

Dan

 

Link to comment

Here is where I am at:  Powercycled Tower...parity check in progress.  Deleted my two dockers (which were previously stopped anyway) and I uninstalled all plugins with the exception of powerdown.  So I am completely stock with the exception of powerdown. 

 

I should be getting additional ram, probably tomorrow.  It seems like there are several posts in the support forum about freezing/crashing webgui.

 

Dan

Link to comment

Update...WebGUI keeps crashing.  It seems to take 4 days or so for it to crash.  Additional RAM came, however it was not compatible with my motherboard so I am RMAing it.  I have reinstalled my original 2 GB or RAM in my Supermicro C2SEE motherboard. 

 

I am completely stock with the exception of the powerdown plugin.

 

This is really frustrating.  It seems like there are several posts with similar crashes.  I did grab diagnostics from the console...checked syslog and I don't see anything.

 

Dan

 

 

Link to comment

DGASCHK,

 

I have not left the log window open.  I will give that a try for the next crash.  During the last crash (8/10/2015), I telneted in to the console and ran diagnostics from the command line.  Attached is that diagnostic file. 

 

I know I have a 3 drives with smart errors.  I have all critical data from these three drives backed up on an external drive...the smart errors/properties don't seem to be changing.

 

Thanks for taking a look!

 

Dan

tower-diagnostics-20150810-1553.zip

Link to comment

Not so concerned with the reallocations, but I would be worried about running with multiple drives with pending sectors. Even if you have backups for the files on those disks, what about the other disks? If you have 2 drives fail at once you won't be able to rebuild any.

Link to comment

Excellent point...clearly, I wasn't thinking that through.    Please note that the critical data (digital photos, home videos, is backed up across all disks..not just the ones with pending sectors...but I will address this asap).  I have a spare 2 TB drive, I was contemplating using this drive to convert from reiserfs to xfs, but was holding on this process, as I was uncomfortable completing it with the webgui crashes.  It seems like I should deal with these pending sectors first.  Any suggestions as to the safest way to do this operation...again I am concerned as I never know when the webgui will crash. 

So the two disks with pending sectors are 500GB.  Should I do a disk replace immediately...then once complete...preclear the first 500GB with pending sectors and then replace it for the second 500 gb that has pending sectors?  Or follow the safer method for replacing multiple smaller drives with a single larger disk as outlined in the WIKI:  https://lime-technology.com/wiki/index.php/Replacing_Multiple_Data_Drives_with_a_Single_Larger_Drive

 

Thanks,

Dan

Link to comment

I (re)wrote that wiki a few months ago.

 

Either approach involves 2 rebuilds. You can rebuild each data drive, or use the wiki method, which rebuilds one drive, copies the other drive, then rebuilds parity.

 

The wiki method will take the most time, but it will eliminate both suspect drives, which you can test later if you think you might want to re-use them. It may be that they won't pass. It will also free up a port for later installation of another, possibly larger drive, and that one can be added as a new precleared drive without affecting parity.

 

Whichever approach you take, I think I would recommend using SAFE mode, with no running dockers, VMs, or plugins, while you do this, and nothing reading or especially writing to the array. Maybe that would also make it less likely for you to have a webgui crash.

 

Also, maybe go ahead and install the new RAM and test it first. And always double-check your drive connections when poking around inside.

Link to comment

Update

 

following the safe method in the Wiki for replacing smaller drives with one larger drive, I completed the data rebuild on disk3 and I completed the command for rsyncing the data from the second disk, disk4.  Here is the output of the rsync....the sent and the filesize on disk aren't the same....do you think I am O.K.?

 

sent 493,105,344,795 bytes  received 211,862 bytes  21,350,720.13 bytes/sec

total size is 492,984,352,575  speedup is 1.00

 

I am unable to continue with the safe method...disconnecting disk4....rebuild parity, etc. as I will not be home until Wednesday of next week.  Sometime after completing the rsync unraid webgui crashed.  I keep the log window open as dgachk recommended...but see nothing at the end of that  the last several lines were disk spin downs.  i captured diagnostics but can not attach them as I dont know a console/command line way to geAt them from the flash drive since the webgui and samba have crashed.  i have attached a txt file with my putty session showing the rsync output above and the output of ps - A ...again there are a ton of smdb processes running???  Please note unraid was running in safe mode.

 

Since I am not home to power cycle unraid but I can still putty in...is there any other logs or commands I can run to help figure out what was is going on?

 

unraid_putty08142015.txt

Link to comment
  • 2 weeks later...

Update:

 

I have replaced the two drives with pending sectors with one larger drive.  This process completed with no issues.  My supermicro C2SEE motherboard will not post with the new ram which needs more investigation, so I have reinstalled my 2 GB original ram that has worked fine for the past 6 years.

 

I have had two more crashes of the WebGUI since the drive replacement.  Attached are the diagnostics for each of those crashes.

 

One of these crashes occurred while browsing data on the unraid server on a Windows 7 machine via a samba connection.  Everytime I run ps -A when in a normal non-crashed state...I get what I would expect for processes.  When the webgui crashes, and I run ps -A there are a ton of smbd processes.

 

Since each of these crashes results in an unclean shutdown, I am concerned about the potential for data loss.  Any help is greatly appreciated!

tower-diagnostics-20150820-0703.zip

tower-diagnostics-20150823-0933.zip

Link to comment

I upgraded to unraid 6.01 about 10 days ago.  To date I have had two webgui crashes and would like some assistance in troubleshooting. 

 

I don't see any evidence of webGui 'crashing', meaning the process itself is not terminating.

 

Anyway, which particular Docker containers are you running?

Link to comment

Limetech,

 

No Dockers or plugins are running...completely stock unraid 6.01.

I am not sure how to define the "Crash"  other than the webgui becomes unresponsive...If I go to tower my web browser wait icon just spins.

 

Look at the attached htop screenshot (you may need to download and zoom in to read)  https://www.dropbox.com/s/wuyb33vm4zgr1kt/unraidhtop.jpg?dl=0

 

Memory and CPU have a high utilization...maybe that is keeping the webgui from running properly???

 

Checkout the attached unraid_crash_ps-A.txt file...it shows the output of ps -A and ps -ef.  For the first time I counted...there are 255 instances of smbd running.  Why would this many processes be running???

 

Thanks again for your assistance!

 

Dan

 

 

 

unraid_crash_ps-A.txt

Link to comment
Guest
This topic is now closed to further replies.