goinsnoopin Posted July 30, 2015 Share Posted July 30, 2015 I upgraded to unraid 6.01 about 10 days ago. To date I have had two webgui crashes and would like some assistance in troubleshooting. System Specs: Supermicro C2SEE, Celeron E1200 CPU, 2 GB Ram, parity drive, 8 data disks, one cache drive I have two dockers running...Logitech Media Server and BTSync. I ran both of these as plugins in unraid 5, in my reading of unraid 6 I realize my ram may be low, so between my first and second crash as discussed below. I implemented a swapfile and set it up for an additional 2 GB. I have some plugins running including cache_dirs, preclear, new snap type plugin to mount drives, community docker/spps. First webgui crash...not responsive in browser, could not telnet in to console...major mistake-forgot to grab syslog before reboot. Second webgui crash...not responsive in browser...but I could telnet in. I did grab a syslog...see attached. I was unable to unmount /dev/md2, /dev/md5 and dev/md8. Tried to see what was running on these discs and fuser would not return any results. I did ps -A and I noticed that there were a lot of smbd entries listed...I mean a lot 50 to 60. This seemed unusual??? Ultimately, I did issue the poweroff command...got the system was going down message but the Tower never powered off. After an hour I powered down manually with switch. I have rebooted and parity check is in progress. Since this occurred I did install the powerdown script. I also stopped both Dockers (is stopping them enough...or do I need to remove them??) If my hardware can't support what I am doing...please let me know as having a stable NAS is most important. In terms of hardware I do have a spare Q9300 CPU that will fit my motherboard. Unfortunately my motherboard is limited to 4GB ram....would updating to 4 GB be beneficial? Any help is greatly appreciated! Dan Modified Subject title now the Unraid is updated to 6.1 Stable syslog.txt Link to comment
trurl Posted July 30, 2015 Share Posted July 30, 2015 Don't really see it in the syslog, but it does sound like the classic out of memory problem. I would definitely max out your RAM. Link to comment
goinsnoopin Posted July 30, 2015 Author Share Posted July 30, 2015 Trurl Thanks for the reply...I found this post...the original poster never got it to work...but I know my board is good as it is my current machine...I wonder if 8GB is worth testing. Trying not to throw money away...or do you think I should just stick with purchasing 4 GB and playing it safe. http://lime-technology.com/forum/index.php?topic=30700.0 Dan Link to comment
trurl Posted July 30, 2015 Share Posted July 30, 2015 Maybe get it somewhere with a liberal return policy. Link to comment
funbubba Posted July 30, 2015 Share Posted July 30, 2015 For what it's worth, I'm having the same issue right now and I had just upgraded from 4GB to 8GB of RAM yesterday. # tail /var/log/syslog Jul 30 07:53:44 tower emhttp: /usr/bin/tail -n 42 -f /var/log/syslog 2>&1 Jul 30 08:39:25 tower kernel: mdcmd (66): spindown 8 Jul 30 08:45:14 tower kernel: mdcmd (67): spindown 2 Jul 30 08:45:49 tower kernel: mdcmd (68): spindown 6 Jul 30 08:59:28 tower kernel: mdcmd (69): spindown 9 Jul 30 10:11:06 tower kernel: mdcmd (70): spindown 0 Jul 30 10:11:06 tower kernel: mdcmd (71): spindown 3 Jul 30 13:38:16 tower sshd[14987]: Accepted password for root from 192.168.1.219 port 49594 ssh2 Jul 30 13:50:12 tower sshd[19546]: Accepted password for root from 192.168.1.219 port 49677 ssh2 Jul 30 13:51:06 tower sshd[19885]: Accepted password for root from 192.168.1.219 port 49679 ssh2 # free -m total used free shared buffers cached Mem: 7736 6556 1180 0 443 5317 -/+ buffers/cache: 795 6941 Swap: 0 0 0 # ps aux | grep emhttp root 8678 3.0 0.0 89500 3764 ? Rl Jul29 40:24 /usr/local/sbin/emhttp root 31107 0.0 0.0 5104 1728 pts/0 S+ 14:18 0:00 grep emhttp In the past when I was running v4.5.6 and I ran out of memory, I remember it showed a line in syslog about killing emhttp (or something to that effect). Now, it doesn't seem to show any type of strange activity in the syslog and emhttp is still alive though inaccessible. It was working earlier today though. I'm looking forward to figuring this problem out. Link to comment
funbubba Posted July 31, 2015 Share Posted July 31, 2015 Here's mine. I don't have those CLOSE_WAIT lines on my lsof though. Thanks for looking. I'm going to powerdown. I didn't realize my CPU has probably been running since emhttp has gone on a rampage. output.txt Link to comment
goinsnoopin Posted August 3, 2015 Author Share Posted August 3, 2015 So I am still having an issue. Since my last post, I have ordered more RAM. It will arrive today or tomorrow. After the crash documented in my posts, I increased the swap file to 4096 mb, I installed powerdown script and I stopped my logitech media server and BTSync Dockers as I am just looking to get a stable NAS at this point. After 3 days and 21 hours of uptime (as reported by webgui which is unresponsive so this time may be off) the webgui crashed, however I am still able to telnet in to the console. Please note that after capturing the outputs attached, I initiated the powerdown script and it is not shutting my server down...it says powerdown v2.17 initiated and I get the system going down for a halt in the telnet window...but the server has not gone down and its been 15 minutes as of this writting. Any suggestions for how to get the system to shut down safely? Last time I have to manually powerdown...but I know that is not good for the data. Any suggestions...what would keep the powerdown script from completing? Attached are two files...one my syslog and another is my telnet window with the results of issuing the following ps -A, free -m, ps- ef, and lsof -Pni as referenced in the thread trurl mentioned. Please note all the smbd processes running and that the memory is almost completely used. What is making all the smbd processes run? Also here is a link to a screenshop of htop, CPU is at 100% with one particular process. Here is a link to my htop screenshot: https://www.dropbox.com/s/wuyb33vm4zgr1kt/unraidhtop.jpg?dl=0 Any help would be greatly appreciated! Dan unraid_crash_ps-A.txt syslog_08_3_2015.txt Link to comment
shooga Posted August 3, 2015 Share Posted August 3, 2015 I had a webgui/emhttp crash after about 3 days too (my first three days with 6.0.1). Seems like maybe there are a few of us experiencing the same bug. Here's my thread, with a recommended shutdown procedure from limetech: http://lime-technology.com/forum/index.php?topic=41938.0 It worked, but then Unraid did do a parity check on reboot, so I guess it wasn't really a graceful shutdown. Link to comment
goinsnoopin Posted August 3, 2015 Author Share Posted August 3, 2015 Additional Info: I tried to follow limetech's instructions in this post to shutdown unraid from the console: http://lime-technology.com/forum/index.php?topic=41938.msg398562#msg398562 When I run: umount /mnt/disk* I get a couple of disks that are in use or busy and to run lsof or fuser to see the list of files in use. fuser locks up the telnet session. However I captured the attached output from lsof....in its unzipped form it is a 12 MB file! The smbd process has a lot of open files. Here is a link to the zipped putty session log that contains the output of the lsof. https://www.dropbox.com/s/2xdofs47vuhlbj3/Unraid_putty.zip?dl=0 I am hoping that by posting this, someone will have an idea of what to try next. Thanks, Dan Link to comment
shooga Posted August 3, 2015 Share Posted August 3, 2015 I was able to restart by killing the smdb process, but it looks like you would have many different pids to kill if you were to do that. In the end my restart was not graceful anyway and a parity check was initiated (although it completed with no errors thankfully). There's a bit more info here: http://lime-technology.com/wiki/index.php/Console#To_cleanly_Stop_the_array_from_the_command_line That wiki may be old though, so I'd go with the procedure you linked to. You might be able to use some of the commands shown on the wiki to help get the shared unmounted. Unfortunately, I just don't really know enough to be of much help. Link to comment
goinsnoopin Posted August 3, 2015 Author Share Posted August 3, 2015 Shooga, Thanks for the link...I worked through that posting this morning. When I type the fuser commands the telnet session locks up and I can't even kill the command with control-c. I tried killing a few smbd processes...but they don't seem to kill. It seems like there are a couple of other users that are having a similiar situation. Hopefully someone will have some additional suggestions. Thanks again! Dan Link to comment
goinsnoopin Posted August 3, 2015 Author Share Posted August 3, 2015 Here is where I am at: Powercycled Tower...parity check in progress. Deleted my two dockers (which were previously stopped anyway) and I uninstalled all plugins with the exception of powerdown. So I am completely stock with the exception of powerdown. I should be getting additional ram, probably tomorrow. It seems like there are several posts in the support forum about freezing/crashing webgui. Dan Link to comment
goinsnoopin Posted August 10, 2015 Author Share Posted August 10, 2015 Update...WebGUI keeps crashing. It seems to take 4 days or so for it to crash. Additional RAM came, however it was not compatible with my motherboard so I am RMAing it. I have reinstalled my original 2 GB or RAM in my Supermicro C2SEE motherboard. I am completely stock with the exception of the powerdown plugin. This is really frustrating. It seems like there are several posts with similar crashes. I did grab diagnostics from the console...checked syslog and I don't see anything. Dan Link to comment
dgaschk Posted August 11, 2015 Share Posted August 11, 2015 Click the Log button on the top right of the unRAID GUI. Leave the Log window open until the server halts. Save a Diagnostics file after opening the Log window. Attach both the diagnostics file and the entire contents of the Log window. Link to comment
goinsnoopin Posted August 11, 2015 Author Share Posted August 11, 2015 DGASCHK, I have not left the log window open. I will give that a try for the next crash. During the last crash (8/10/2015), I telneted in to the console and ran diagnostics from the command line. Attached is that diagnostic file. I know I have a 3 drives with smart errors. I have all critical data from these three drives backed up on an external drive...the smart errors/properties don't seem to be changing. Thanks for taking a look! Dan tower-diagnostics-20150810-1553.zip Link to comment
trurl Posted August 11, 2015 Share Posted August 11, 2015 Not so concerned with the reallocations, but I would be worried about running with multiple drives with pending sectors. Even if you have backups for the files on those disks, what about the other disks? If you have 2 drives fail at once you won't be able to rebuild any. Link to comment
trurl Posted August 11, 2015 Share Posted August 11, 2015 Wiki article Pending sectors occur as a result of a read failures. An unreadable sector will interfere with the reconstruction of a failed drive. Pending sectors need to be cleared as soon as possible because 2 drives with unreadable sectors will most likely be unrecoverable within unRAID. Link to comment
goinsnoopin Posted August 11, 2015 Author Share Posted August 11, 2015 Excellent point...clearly, I wasn't thinking that through. Please note that the critical data (digital photos, home videos, is backed up across all disks..not just the ones with pending sectors...but I will address this asap). I have a spare 2 TB drive, I was contemplating using this drive to convert from reiserfs to xfs, but was holding on this process, as I was uncomfortable completing it with the webgui crashes. It seems like I should deal with these pending sectors first. Any suggestions as to the safest way to do this operation...again I am concerned as I never know when the webgui will crash. So the two disks with pending sectors are 500GB. Should I do a disk replace immediately...then once complete...preclear the first 500GB with pending sectors and then replace it for the second 500 gb that has pending sectors? Or follow the safer method for replacing multiple smaller drives with a single larger disk as outlined in the WIKI: https://lime-technology.com/wiki/index.php/Replacing_Multiple_Data_Drives_with_a_Single_Larger_Drive Thanks, Dan Link to comment
trurl Posted August 11, 2015 Share Posted August 11, 2015 I (re)wrote that wiki a few months ago. Either approach involves 2 rebuilds. You can rebuild each data drive, or use the wiki method, which rebuilds one drive, copies the other drive, then rebuilds parity. The wiki method will take the most time, but it will eliminate both suspect drives, which you can test later if you think you might want to re-use them. It may be that they won't pass. It will also free up a port for later installation of another, possibly larger drive, and that one can be added as a new precleared drive without affecting parity. Whichever approach you take, I think I would recommend using SAFE mode, with no running dockers, VMs, or plugins, while you do this, and nothing reading or especially writing to the array. Maybe that would also make it less likely for you to have a webgui crash. Also, maybe go ahead and install the new RAM and test it first. And always double-check your drive connections when poking around inside. Link to comment
goinsnoopin Posted August 14, 2015 Author Share Posted August 14, 2015 Update following the safe method in the Wiki for replacing smaller drives with one larger drive, I completed the data rebuild on disk3 and I completed the command for rsyncing the data from the second disk, disk4. Here is the output of the rsync....the sent and the filesize on disk aren't the same....do you think I am O.K.? sent 493,105,344,795 bytes received 211,862 bytes 21,350,720.13 bytes/sec total size is 492,984,352,575 speedup is 1.00 I am unable to continue with the safe method...disconnecting disk4....rebuild parity, etc. as I will not be home until Wednesday of next week. Sometime after completing the rsync unraid webgui crashed. I keep the log window open as dgachk recommended...but see nothing at the end of that the last several lines were disk spin downs. i captured diagnostics but can not attach them as I dont know a console/command line way to geAt them from the flash drive since the webgui and samba have crashed. i have attached a txt file with my putty session showing the rsync output above and the output of ps - A ...again there are a ton of smdb processes running??? Please note unraid was running in safe mode. Since I am not home to power cycle unraid but I can still putty in...is there any other logs or commands I can run to help figure out what was is going on? unraid_putty08142015.txt Link to comment
FredG89 Posted August 15, 2015 Share Posted August 15, 2015 I'm in the same situation. My Emby plugin is very unstable and my server emhttp and smb locks up pretty much everyday. Link to comment
goinsnoopin Posted August 24, 2015 Author Share Posted August 24, 2015 Update: I have replaced the two drives with pending sectors with one larger drive. This process completed with no issues. My supermicro C2SEE motherboard will not post with the new ram which needs more investigation, so I have reinstalled my 2 GB original ram that has worked fine for the past 6 years. I have had two more crashes of the WebGUI since the drive replacement. Attached are the diagnostics for each of those crashes. One of these crashes occurred while browsing data on the unraid server on a Windows 7 machine via a samba connection. Everytime I run ps -A when in a normal non-crashed state...I get what I would expect for processes. When the webgui crashes, and I run ps -A there are a ton of smbd processes. Since each of these crashes results in an unclean shutdown, I am concerned about the potential for data loss. Any help is greatly appreciated! tower-diagnostics-20150820-0703.zip tower-diagnostics-20150823-0933.zip Link to comment
limetech Posted August 24, 2015 Share Posted August 24, 2015 I upgraded to unraid 6.01 about 10 days ago. To date I have had two webgui crashes and would like some assistance in troubleshooting. I don't see any evidence of webGui 'crashing', meaning the process itself is not terminating. Anyway, which particular Docker containers are you running? Link to comment
goinsnoopin Posted August 24, 2015 Author Share Posted August 24, 2015 Limetech, No Dockers or plugins are running...completely stock unraid 6.01. I am not sure how to define the "Crash" other than the webgui becomes unresponsive...If I go to tower my web browser wait icon just spins. Look at the attached htop screenshot (you may need to download and zoom in to read) https://www.dropbox.com/s/wuyb33vm4zgr1kt/unraidhtop.jpg?dl=0 Memory and CPU have a high utilization...maybe that is keeping the webgui from running properly??? Checkout the attached unraid_crash_ps-A.txt file...it shows the output of ps -A and ps -ef. For the first time I counted...there are 255 instances of smbd running. Why would this many processes be running??? Thanks again for your assistance! Dan unraid_crash_ps-A.txt Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.