Unraid stops executing commands after certain period

kil0gram · December 23, 2017

Not sure what is going on with my setup but as of late, I have noticed that my unraid setup stops responding to new connections such as SSH/SMB/FTP/etc. The weird thing about this is that if I already have a established SSH session, I can ping out and also do simple stuff like ifconfig so that tells me that network-wise we are ok. However when I executed the command "diagnostics" via my existing SSH session, it just halted.

Next thing I tried was plug a monitor up to the box and it would not take my keyboard input. FYI, this is my not my first time trying to troubleshoot this, sometimes it does take keyboard input such as when I enter login name and once I hit enter, it stalls. There are other times where I thought executing "reboot" manually on the box and it would just be stuck at "System going down" (or whatever the message is when you reboot).

Anyways, hoping I can get some help on this....I ended up running diagnostic command after physically rebooting so hoping it has everything there to further troubleshoot this.

Other things to note, SMB stops working

Portquery returns result of LISTENING on tcp/22 but I am unable to start a SSH session

~~Is it safe to post diagnostics here? Not sure if it contains any secrets...please advise. Thanks!~~

kil0gram-tower-diagnostics-20171223-1446.zip

limetech · December 24, 2017

Most important thing to capture is syslog. If in this locked up state, if you can manage type this command, it will copy the system log to the root of the flash device, where you will be able to get at it after a reboot:

cp /var/log/syslog /boot/syslog.txt

Also, if you can make this happen (lock up), it's sometimes useful to have telnet/ssh window opening, tailing the the log:

tail -f /var/log/syslog

You an also click on webGui 'Log' button on right of menu bar and leave that window up.

kil0gram · December 25, 2017

Thanks for the reply.

Ok I have tail -f running and will try to copy the logs off if I can. The problem that I am running into is when it gets to this bad state, I cannot execute any commands....hard to explain but here is a snippet of what it looks like when I tried to get a diagnostic when the issue was happening -

root@HULK:~# diagnostics

^C^Cq
^C
^C

Thats all it does, no out put besides me trying to cancel the command and at this point I was unable to break out of this using ctrl + c and also unable to do a ctrl + alt + del in hopes that it would issue reboot command.

kil0gram · December 26, 2017

Ok it repro'd again and here is are the last bits of the syslog that I was able to capture using tail -f

Dec 26 02:48:20 HULK in.telnetd[15651]: connect from 10.0.0.45 (10.0.0.45)
Dec 26 02:55:06 HULK rpc.mountd[26874]: authenticated mount request from 10.0.0.34:301 for /mnt/user/vms (/mnt/user/vms)
Dec 26 03:00:01 HULK root: mover started
Dec 26 03:00:01 HULK root: moving "vms" to array
Dec 26 03:00:01 HULK root: .d..t...... ./
Dec 26 03:00:01 HULK root: .d..t...... vms/
Dec 26 03:00:01 HULK root: .d..t...... vms/.iorm.sf/
Dec 26 03:00:11 HULK root: .d..t...... vms/
Dec 26 03:00:11 HULK root: .d..t...... vms/DC3/
Dec 26 03:00:11 HULK root: >f+++++++++ vms/DC3/DC3-ctk.vmdk
Dec 26 03:00:22 HULK root: .d..t...... vms/DC3/
Dec 26 03:00:22 HULK root: >f+++++++++ vms/DC3/DC3.vmx
Dec 26 03:00:32 HULK root: .d..t...... vms/DC3/
Dec 26 03:00:32 HULK root: >f+++++++++ vms/DC3/.lck-3700000000000000
Dec 26 03:00:44 HULK root: .d..t...... vms/DC3/
Dec 26 03:00:44 HULK root: >f+++++++++ vms/DC3/DC3.vmsd
Dec 26 03:00:46 HULK root: .d..t...... vms/DC3/
Dec 26 03:00:46 HULK root: .d..t...... vms/
Dec 26 03:00:46 HULK move: rmdir: /mnt/cache/./vms Directory not empty
Dec 26 03:00:46 HULK root: moving "appdata" to cache
Dec 26 03:00:46 HULK move: rmdir: /mnt/disk1/./appdata/hydra/hydra Directory not empty
Dec 26 03:00:46 HULK move: rmdir: /mnt/disk1/./appdata/hydra Directory not empty
Dec 26 03:00:47 HULK move: rmdir: /mnt/disk1/./appdata Directory not empty
Dec 26 03:00:47 HULK root: mover finished
Dec 26 03:55:50 HULK vsftpd[4890]: connect from 10.0.0.45 (10.0.0.45)
Dec 26 03:55:52 HULK sshd[5002]: Connection closed by 10.0.0.45 port 58280 [preauth]
Dec 26 03:55:52 HULK in.telnetd[5005]: connect from 10.0.0.45 (10.0.0.45)
Dec 26 03:56:01 HULK vsftpd[5569]: connect from 10.0.0.45 (10.0.0.45)
Dec 26 03:56:03 HULK sshd[5683]: Connection closed by 10.0.0.45 port 40634 [preauth]
Dec 26 03:56:03 HULK in.telnetd[5687]: connect from 10.0.0.45 (10.0.0.45)
Dec 26 04:00:01 HULK root: mover started
Dec 26 04:00:01 HULK root: moving "vms" to array
Dec 26 04:00:01 HULK root: .d..t...... vms/.iorm.sf/
Dec 26 04:00:02 HULK Docker Auto Update: Community Applications Docker Autoupdate running
Dec 26 04:00:02 HULK root: .d..t...... vms/
Dec 26 04:00:02 HULK root: moving "appdata" to cache
Dec 26 04:00:02 HULK root: .d..t...... ./
Dec 26 04:00:02 HULK move: rmdir: /mnt/disk1/./appdata/hydra/hydra Directory not empty
Dec 26 04:00:02 HULK Docker Auto Update: Checking for available updates
Dec 26 04:00:02 HULK move: rmdir: /mnt/disk1/./appdata/hydra Directory not empty
Dec 26 04:00:02 HULK move: rmdir: /mnt/disk1/./appdata Directory not empty
Dec 26 04:00:02 HULK root: mover finished
Dec 26 04:00:06 HULK vsftpd[14271]: connect from 10.0.0.45 (10.0.0.45)
Dec 26 04:00:08 HULK sshd[14391]: Connection closed by 10.0.0.45 port 53390 [preauth]
Dec 26 04:00:08 HULK in.telnetd[14394]: connect from 10.0.0.45 (10.0.0.45)
Dec 26 04:00:39 HULK Docker Auto Update: No updates will be installed
Dec 26 05:38:18 HULK kernel: md: sync done. time=57860sec

Forgot to mention that this is happening on a daily basis now and only fix is to hard reboot it

kil0gram · December 27, 2017

Ok so far has not froze up like last time and the only difference that I can think of is this time my UNRAID instance is not joined to a domain. I will keep an eye on this, still need a solution.

kil0gram · January 4, 2018

Bummer, I bought this product thinking it had a great community around it but neither the community nor the developers are helping here....

FYI when I type "diagnostics" it just does what you see below....at this point I won't be able to cancel out of this

image.png.3ee31c4e07563a8f7e6cd4b4ea21cfdd.png

JorgeB · January 4, 2018

It's difficult to help without any diagnostics, does the server run any other OS without issues? Did you try running memtest for at least 24 Hours?

kil0gram · January 4, 2018

Well thats the problem, once it is in this state, I cannot generate a diagnostics. I have noticed that this happens after I join it to the domain, which I need for easy permission and share access.

limetech · January 4, 2018

The way you left it was:

Quote

Ok so far has not froze up like last time and the only difference that I can think of is this time my UNRAID instance is not joined to a domain. I will keep an eye on this, still need a solution.

No more info from you after that. For example, did you toggle Active Directory on/off a couple times and definitively determine issue is related to this?

No, your next post is:

Quote

Bummer, I bought this product thinking it had a great community around it but neither the community nor the developers are helping here....

And then 10 min later you got a response from someone.

Look man, no one can read your mind, and people here are extremely helpful but it's a 2-way street. Please consider editing your post and then I'll remove this one.

kil0gram · January 4, 2018

Ok great, now I have a next step to try. Me figuring out it is related to AD was just by chance since I got no response from you (I also emailed you). I am not expecting you to read my mind, I am expecting you to read my lines so here it is again -

Quote

Ok so far has not froze up like last time and the only difference that I can think of is this time my UNRAID instance is not joined to a domain. I will keep an eye on this, still need a solution.

So what is the solution offered so far? Toggle AD? Well I have already determined it is related to AD so what is the next step to troubleshoot that? Per my responses above, when it gets to this state, I can no longer generate a diagnostics file nor can I run any additional commands that are helpful. I gave you output of the /var/log/syslog as well but no response on that either. Perhaps the person who responded saw my previous replies and noticed no one is answering my questions?

Regardless, I have purchased a product that no longer works as advertised and is causing outtages everyday, we can bicker over he said she said but that still does not solve the issue.

What other information would you need to troubleshoot this further?

Thanks

jonp · January 4, 2018

Hi Kil0gram,

What we need is something in your logs or diagnostics that points to a problem. Right now we are not seeing one. The tailed log you captured doesn't show any errors of any kind. The AFPD messages in your tailed log screenshot are not relevant (they have to do with AFP protocol, nothing to do with SMB or Active Directory and wouldn't cause system hangs).

Have you tried a memtest on the system? (asked previously in this thread with no response)
Have you tried another OS on the system? (asked previously in this thread with no response)
How have you ruled out faulty hardware, cables, etc.?

The unRAID community is incredibly helpful and willing to jump in to lend a hand, but when there is not a thread of evidence to support a software problem (such as a kernel panic, call trace, etc.), there is really nothing for us to do. We have plenty of customers using unRAID with an AD domain and I'm sure we'd have a lot more e-mails about the problem if we had a software defect on our hands.

My biggest concern with the way this thread is going is that the first time you heard from someone in our community (Johnnie.Black), he posed two questions to you that you completely disregarded. If you want us to help you, you have to respond to what folks say in here, otherwise it feels like a lost cause.

bonienl · January 4, 2018

A good starting point for troubleshooting is to start your server in safemode, without active plugins, docker and VM.

limetech · January 4, 2018

1 hour ago, kil0gram said:

Bummer, I bought this product thinking it had a great community around it but neither the community nor the developers are helping here....

I took this to mean, "You guys need to drop everything you're doing and help me solve my problem right now."

Sorry if that's not what you meant. If it wasn't, then please edit your post and I'll investigate further. If it was, I'd suggest not using AD for now until we can get to this issue.

kil0gram · January 14, 2018

Dear @limetech,

You are getting me wrong again. Let me start over so that we are on the same page, I am experiencing an issue with the product that I purchased. Per your technical advice, you would like me to run a command to generate a diagnostics file and would also like me to produce a copy of the running log file so that it could give you some insight as to what is happening so that you could provide me some support. I am with you so far, hope you are as well.

The problem that I am facing is that I do have a running log file (tail -f /var/logs/syslog on both console session as well as SSH, same result) however it does not provide any good logging to help diagnose the issue. Next request was for me to generate a diagnostics file (something like show tech) so that you could get even more details in hopes of finding some sort of clue, however the problem is that when the system is locked up I cannot log in via new SSH session BUT if I already have a logged in session, I am able to run additional commands such as "ifconfig" and "ping".

So next step for me is to physically connect a monitor and see what happens but again, when the system is in such a state, all I am able to do is enter the command and then the system halts....not sure if that is the correct word but it no longer takes any commands at that point. Here is an example of what that looks like -

image.png.3ee31c4e07563a8f7e6cd4b4ea21cfdd.png

As you can see, after I entered "diagnostics", in theory it should do something but in reality it stays like the above. So here we are, what would you like me to try next? I could definitely try a new OS on this hardware BUT are you serious?????????????????????????????????????????????????

One fact I could provide is that my system has been running for over a week just fine, no issues what so ever and the only difference at this point in time is that I did not join my UNRAID instance to my Windows AD.

Please let me know if I am not making my issue clear and we can work out the wording, as for me editing my previous posts, just let that one go man, you are making both yourself as well as your product not look so appealing.

EDIT. I am not asking anyone to drop everything and solve this issue, what I am asking is give me some reassurance and or direction. I do not mean to be a dick but you are getting me wrong on every reply that it hurts and you are forcing me to spell it out for you. My apologies on that man, I am not a douche and I do work with plenty of vendors and would love to help resolve an issue that might not effect a huge user base for you now but could possibly later.

kil0gram · January 14, 2018

On 1/4/2018 at 10:16 AM, johnnie.black said:

It's difficult to help without any diagnostics, does the server run any other OS without issues? Did you try running memtest for at least 24 Hours?

Yes, memtest passed with no issues and I do understand that without logging, things can be difficult. Its like when you are trying to debug a software without a debugger, shit gets tricky.

Thanks for the suggestion

kil0gram · January 14, 2018

On 1/4/2018 at 11:44 AM, bonienl said:

A good starting point for troubleshooting is to start your server in safemode, without active plugins, docker and VM.

I have not tried safemode and it is worth trying definitely. One thing I did do was turn off all docker apps and join it the domain and eventually it had the same results....perhaps in safemode it might last longer and it could point to a faulty plugin or something, worth a shot and will try that as well but am waiting on a response from LIMETECH to see what he would like me to try.

Thanks man!

kil0gram · January 14, 2018

On 1/4/2018 at 11:34 AM, jonp said:

Hi Kil0gram,

What we need is something in your logs or diagnostics that points to a problem. Right now we are not seeing one. The tailed log you captured doesn't show any errors of any kind. The AFPD messages in your tailed log screenshot are not relevant (they have to do with AFP protocol, nothing to do with SMB or Active Directory and wouldn't cause system hangs).

Have you tried a memtest on the system? (asked previously in this thread with no response)

Have you tried another OS on the system? (asked previously in this thread with no response)

How have you ruled out faulty hardware, cables, etc.?

The unRAID community is incredibly helpful and willing to jump in to lend a hand, but when there is not a thread of evidence to support a software problem (such as a kernel panic, call trace, etc.), there is really nothing for us to do. We have plenty of customers using unRAID with an AD domain and I'm sure we'd have a lot more e-mails about the problem if we had a software defect on our hands.

My biggest concern with the way this thread is going is that the first time you heard from someone in our community (Johnnie.Black), he posed two questions to you that you completely disregarded. If you want us to help you, you have to respond to what folks say in here, otherwise it feels like a lost cause.

Sorry but I did not disregard, I was a bit frustrated with support at the point and already tried his suggestions but failed to report back (already did). Please read my issue on my latest post, I wrote it out in hopes that we can all be on the same page.

FYI -

1. Memtest passed

2. No new OS tried, UNRAID runs just fine without joining to domain, what would running a new OS prove?

3. Yes

bonienl · January 14, 2018

Have you tried to run in safemode?

kil0gram · January 18, 2018

Update -

After applying update 6.4.0, things have been stable thus far. No idea on what was updated but I do know that my UI which was using a custom port has been reset, so am assuming perhaps some other binaries have been replaced/updated. No root cause found but am glad its working.

trurl · January 18, 2018

10 minutes ago, kil0gram said:

Update -

After applying update 6.4.0, things have been stable thus far. No idea on what was updated but I do know that my UI which was using a custom port has been reset, so am assuming perhaps some other binaries have been replaced/updated. No root cause found but am glad its working.

The method for running webUI on another port changed in 6.4. Go to Settings - Identification - SSL Certificate Settings. You might also want to read the release notes if you haven't already.

Unraid stops executing commands after certain period

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived