The builtin powerdown script can easily hang preventing a shutdown


Recommended Posts

From a general use perspective, the problem is that most of us know how to troubleshoot these issues -- i.e. if there's an issue with hanging, we can repeat the process and look at open files and mount points to see what might be causing a problem ... but the general user isn't likely to do that => they'll just be frustrated that when they try to power off the server it doesn't work.

 

The PowerDown plugin pretty much resolved this issue over the past few years; but it is indeed VERY nice to see that the built-in function in UnRAID is now being updated so that plugin is no longer needed.  I'm confident that between Tom & dlandon's dialogue in this thread the need for the plugin will be completely gone by 6.2.2  :)

Indeed, 6.2.1 resolves it for most folks ... a few more "tweaks" and it should be "bullet proof"  :)  [Yes, I know there will always be someone who attacks it with a Kevlar-piercing round  :) ]

AFAIK it already is "bullet proof".

Link to comment
  • Replies 102
  • Created
  • Last Reply

Top Posters In This Topic

Moving forward,  will unraid be able to get past any unmounting hangs during the stopping of the array?

 

The array Stop operation, commonly invoked by clicking 'Stop' on the Main page is a distinct operation that is also invoked by the shutdown process.  The Stop operation itself does not have a time-out.  The main reason a Stop might hang is due to an open file on a mounted file system.  You correctly point out the most common way this happens is someone holding command shell open, eg, 'mc'.  In this case it would probably be ok to just kill the process.  However there are other reasons.  For example, maybe someone has a loopback-mounted file which is located on the array/cache.  In this case it will not show up in 'fuser', but if we force a Stop it could result in data loss inside that loobback.  Hence when invoked manually (by clicking Stop button) unRAID takes the safe approach and if it can't unmount array/cache file sytems it sits in a loop until you, the user, correct what's wrong.

 

The problem is that the gui is unusable while it's waiting for the array to stop, so you can't switch over to the Open Files plugin to see what is causing the problem.

 

I feel like basic functionality of the Open Files plugin should be built into the Array Operation tab, so it can warn you about things like an open bash shell before you try stopping the array.

 

Ok but if you know enough to open a bash shell, you probably know enough to open another bash shell and see what the problem is?  If not, maybe shouldn't be opening bash shells...  ;)

Link to comment

For example, maybe someone has a loopback-mounted file which is located on the array/cache.  In this case it will not show up in 'fuser', but if we force a Stop it could result in data loss inside that loobback.

 

I've certainly done this in the past and been scratching my head trying to work out what the problem was as fuser showed no open file.

 

Running 'mount' to list the mounted file systems made it all clear and I was able to unmount the loopback and the system proceeded to shutdown correctly.

 

haha, if we made 'Stop' forcibly shut things down, instead of some users complaining their array won't stop, we'd have other users complaining we forced a stop and clobbered their data  ::)

Link to comment

I feel like basic functionality of the Open Files plugin should be built into the Array Operation tab, so it can warn you about things like an open bash shell before you try stopping the array.

 

Ok but if you know enough to open a bash shell, you probably know enough to open another bash shell and see what the problem is?  If not, maybe shouldn't be opening bash shells...  ;)

 

I'd respectfully disagree... there is a huge difference between knowing how to move files with mc and knowing how to use fuser (is that right? I always have to google it) to figure out why the array won't stop. 

 

Also, that was just an example of something that always prevents the array from stopping. There are probably other things.

 

But... elsewhere you've talked about moving to nginx. If that solves the "emhttp can only do one thing at a time" problem, then users could run the Open Files plugin in a second tab to figure out why the array won't stop.  And that sort of solves the problem too.

 

haha, if we made 'Stop' forcibly shut things down, instead of some users complaining their array won't stop, we'd have other users complaining we forced a stop and clobbered their data  ::)

 

100% agree!  If it isn't an "emergency stop" we don't want to risk data loss by forcibly closing files.

Link to comment

I can't believe it, but I did get it to hang on the second try.  Ok, so here is what I have:

 

This command:

/usr/bin/fuser -mv /mnt/user/* /mnt/disk*[!s] /mnt/cache /dev/md* 2>&1 | logger

 

produced the following in the log:

Oct  5 13:00:25 MediaServer root:                      USER        PID ACCESS COMMAND
Oct  5 13:00:25 MediaServer root: /mnt/user/AV Media:  root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/Computer Backups:
Oct  5 13:00:25 MediaServer root:                      root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/DVD Movies:
Oct  5 13:00:25 MediaServer root:                      root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/Movies:    root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/Music:     root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/NetDrive:  root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/Pictures:  root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/Public:    root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/Software Development:
Oct  5 13:00:25 MediaServer root:                      root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/Users:     root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/Videos:    root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/appdata:   root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/docker:    root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/domains:   root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/iTunes:    root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/isos:      root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/user/system:    root     kernel mount /mnt/user
Oct  5 13:00:25 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:25 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:25 MediaServer root: /mnt/disk1:          root     kernel mount /mnt/disk1
Oct  5 13:00:25 MediaServer root: /mnt/disk2:          root     kernel mount /mnt/disk2
Oct  5 13:00:25 MediaServer root: /mnt/disk3:          root     kernel mount /mnt/disk3
Oct  5 13:00:25 MediaServer root: /mnt/disk4:          root     kernel mount /mnt/disk4
Oct  5 13:00:25 MediaServer root: /mnt/disk5:          root     kernel mount /mnt/disk5
Oct  5 13:00:25 MediaServer root: /mnt/disk6:          root     kernel mount /mnt/disk6
Oct  5 13:00:25 MediaServer root: /mnt/cache:          root     kernel mount /mnt/cache
Oct  5 13:00:25 MediaServer root: /dev/md1:            root     kernel mount /mnt/disk1
Oct  5 13:00:25 MediaServer root: /dev/md2:            root     kernel mount /mnt/disk2
Oct  5 13:00:25 MediaServer root: /dev/md3:            root     kernel mount /mnt/disk3
Oct  5 13:00:25 MediaServer root: /dev/md4:            root     kernel mount /mnt/disk4
Oct  5 13:00:25 MediaServer root: /dev/md5:            root     kernel mount /mnt/disk5
Oct  5 13:00:25 MediaServer root: /dev/md6:            root     kernel mount /mnt/disk6

 

This command let the shutdown continue:

/usr/bin/fuser -mvk /mnt/user/* /mnt/disk*[!s] /mnt/cache /dev/md* 2>&1 | logger

 

and produced the following in the log:

Oct  5 13:00:32 MediaServer root:                      USER        PID ACCESS COMMAND
Oct  5 13:00:32 MediaServer root: /mnt/user/AV Media:  root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/Computer Backups:
Oct  5 13:00:32 MediaServer root:                      root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/DVD Movies:
Oct  5 13:00:32 MediaServer root:                      root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/Movies:    root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/Music:     root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/NetDrive:  root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/Pictures:  root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/Public:    root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/Software Development:
Oct  5 13:00:32 MediaServer root:                      root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/Users:     root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/Videos:    root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/appdata:   root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/docker:    root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/domains:   root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/iTunes:    root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/isos:      root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/user/system:    root     kernel mount /mnt/user
Oct  5 13:00:32 MediaServer root:                      root      27376 f.c.. smbd
Oct  5 13:00:32 MediaServer root:                      root      27725 f.c.. smbd
Oct  5 13:00:32 MediaServer root: /mnt/disk1:          root     kernel mount /mnt/disk1
Oct  5 13:00:32 MediaServer root: /mnt/disk2:          root     kernel mount /mnt/disk2
Oct  5 13:00:32 MediaServer root: /mnt/disk3:          root     kernel mount /mnt/disk3
Oct  5 13:00:32 MediaServer root: /mnt/disk4:          root     kernel mount /mnt/disk4
Oct  5 13:00:32 MediaServer root: /mnt/disk5:          root     kernel mount /mnt/disk5
Oct  5 13:00:32 MediaServer root: /mnt/disk6:          root     kernel mount /mnt/disk6
Oct  5 13:00:32 MediaServer root: /mnt/cache:          root     kernel mount /mnt/cache
Oct  5 13:00:32 MediaServer root: /dev/md1:            root     kernel mount /mnt/disk1
Oct  5 13:00:32 MediaServer root: /dev/md2:            root     kernel mount /mnt/disk2
Oct  5 13:00:32 MediaServer root: /dev/md3:            root     kernel mount /mnt/disk3
Oct  5 13:00:32 MediaServer root: /dev/md4:            root     kernel mount /mnt/disk4
Oct  5 13:00:32 MediaServer root: /dev/md5:            root     kernel mount /mnt/disk5
Oct  5 13:00:32 MediaServer root: /dev/md6:            root     kernel mount /mnt/disk6
Oct  5 13:00:33 MediaServer emhttp: shcmd (9134): set -o pipefail ; umount /mnt/user |& logger
Oct  5 13:00:33 MediaServer emhttp: shcmd (9135): rmdir /mnt/user |& logger
Oct  5 13:00:33 MediaServer emhttp: shcmd (9136): rm -f /boot/config/plugins/dynamix/mover.cron
Oct  5 13:00:33 MediaServer emhttp: shcmd (9137): /usr/local/sbin/update_cron &> /dev/null
Oct  5 13:00:33 MediaServer emhttp: Unmounting disks...
Oct  5 13:00:33 MediaServer emhttp: shcmd (9138): umount /mnt/disk1 |& logger
Oct  5 13:00:33 MediaServer kernel: XFS (md1): Unmounting Filesystem
Oct  5 13:00:33 MediaServer emhttp: shcmd (9139): rmdir /mnt/disk1 |& logger
Oct  5 13:00:33 MediaServer emhttp: shcmd (9140): umount /mnt/disk2 |& logger
Oct  5 13:00:34 MediaServer kernel: XFS (md2): Unmounting Filesystem
Oct  5 13:00:34 MediaServer emhttp: shcmd (9141): rmdir /mnt/disk2 |& logger
Oct  5 13:00:34 MediaServer emhttp: shcmd (9142): umount /mnt/disk3 |& logger
Oct  5 13:00:34 MediaServer kernel: XFS (md3): Unmounting Filesystem
Oct  5 13:00:34 MediaServer emhttp: shcmd (9143): rmdir /mnt/disk3 |& logger
Oct  5 13:00:34 MediaServer emhttp: shcmd (9144): umount /mnt/disk4 |& logger
Oct  5 13:00:34 MediaServer kernel: XFS (md4): Unmounting Filesystem
Oct  5 13:00:34 MediaServer emhttp: shcmd (9145): rmdir /mnt/disk4 |& logger
Oct  5 13:00:34 MediaServer emhttp: shcmd (9146): umount /mnt/disk5 |& logger
Oct  5 13:00:34 MediaServer kernel: XFS (md5): Unmounting Filesystem
Oct  5 13:00:34 MediaServer emhttp: shcmd (9147): rmdir /mnt/disk5 |& logger
Oct  5 13:00:34 MediaServer emhttp: shcmd (9148): umount /mnt/disk6 |& logger
Oct  5 13:00:34 MediaServer kernel: XFS (md6): Unmounting Filesystem
Oct  5 13:00:35 MediaServer emhttp: shcmd (9149): rmdir /mnt/disk6 |& logger
Oct  5 13:00:35 MediaServer emhttp: shcmd (9150): umount /mnt/cache |& logger
Oct  5 13:00:35 MediaServer kernel: XFS (sdi1): Unmounting Filesystem
Oct  5 13:00:35 MediaServer emhttp: shcmd (9151): rmdir /mnt/cache |& logger
Oct  5 13:00:35 MediaServer kernel: mdcmd (53): stop 
Oct  5 13:00:35 MediaServer kernel: md1: stopping
Oct  5 13:00:35 MediaServer kernel: md2: stopping
Oct  5 13:00:35 MediaServer kernel: md3: stopping
Oct  5 13:00:35 MediaServer kernel: md4: stopping
Oct  5 13:00:35 MediaServer kernel: md5: stopping
Oct  5 13:00:35 MediaServer kernel: md6: stopping
Oct  5 13:00:35 MediaServer emhttp: shcmd (9152): rmmod md-mod |& logger
Oct  5 13:00:35 MediaServer kernel: md: unRAID driver removed

 

Hope this is helpful.

 

EDIT: All shares were still available.  I attached the complete log in case you want to see it.

 

Got some time to look at this issue in more detail... From the syslog:

 

This is where it looks like shutdown was started:

Oct  4 21:58:43 MediaServer emhttp: Stopping services...

 

Everything looks ok so far, then after initiating docker container shutdown we see:

Oct  4 21:58:58 MediaServer kernel: zmdc.pl[10613]: segfault at 18 ip 00002ab0edd304a9 sp 00007ffc406e9950 error 4 in libc-2.19.so[2ab0edcb0000+1bb000]

 

I don't think that has anything to do with array Stop timeout, I just thought I'd point that out.  Moving on...

 

Here is where Samba is stopped:

Oct  4 21:59:10 MediaServer emhttp: shcmd (1693): /etc/rc.d/rc.samba stop |& logger

 

But a few lines later we see:

Oct  4 21:59:12 MediaServer emhttp: Recycle Bin stopped...
Oct  4 21:59:12 MediaServer Recycle Bin: Recycle Bin stopped

 

I'm assuming this is a plugin using the samba recycle bin feature?  If so, something is not right because when

/etc/rc.d/rc.samba stop

is executed, it should have completely killed the nmbd and all smbd processes.  But the reason unmount is hanging later is because there is still an smbd process holding the /mnt/user mount point.  Hence either rc.samba stop did not actually stop samba (because of the plugin), or later tying to stop the recycle bin plugin actually restarted a smbd process.

 

My suggestion is to remove plugins until you get a clean shutdown and then figure out what's wrong with the last removed plugin  :P

 

It was the Recycle Bin plugin.  I've made a fix so it will not restart samba when the server is going down.  I don't think it came up before because in the powerdown plugin I stopped the plugins first and then stopped samba.

Link to comment

It was the Recycle Bin plugin.  I've made a fix so it will not restart samba when the server is going down.  I don't think it came up before because in the powerdown plugin I stopped the plugins first and then stopped samba.

How did you stop the plugins?  Plugins should tie into 'stopping_svs' or 'unmounting_disks' emhttp event in order to gracefully exit on array Stop.

Link to comment

It was the Recycle Bin plugin.  I've made a fix so it will not restart samba when the server is going down.  I don't think it came up before because in the powerdown plugin I stopped the plugins first and then stopped samba.

How did you stop the plugins?  Plugins should tie into 'stopping_svs' or 'unmounting_disks' emhttp event in order to gracefully exit on array Stop.

 

Yes.  That's how I did it.

Link to comment

For example, maybe someone has a loopback-mounted file which is located on the array/cache.  In this case it will not show up in 'fuser', but if we force a Stop it could result in data loss inside that loobback.

 

I've certainly done this in the past and been scratching my head trying to work out what the problem was as fuser showed no open file.

 

Running 'mount' to list the mounted file systems made it all clear and I was able to unmount the loopback and the system proceeded to shutdown correctly.

 

haha, if we made 'Stop' forcibly shut things down, instead of some users complaining their array won't stop, we'd have other users complaining we forced a stop and clobbered their data  ::)

 

Over all the years that I have been working with the powerdown plugin I never once heard that a user lost data.  Virtually 100% of the time I heard "I want the server to shutdown cleanly".  This is why the powerdown plugin was created in the first place.  In the early days it was pretty brute force in clobbering processes so the server could complete a shutdown.  I revised it to make it more graceful in the way it stopped any processes keeping the array drives from unmounting.  It also unmounted any loop mounts on the array.

 

Let me point out a few things for you to consider:

- It is unrealistic to think someone would go out and look for something holding the shutdown and clean it up.  I'm hearing of a lot of servers that run remote and there is no way that can be done.

- Some users experience a lot of power outages and insist that the server shutdown and not hang so their UPS battery does not run down.  They don't want the battery to be exhausted and not have enough power to shutdown the server if another outage occurs quickly after coming back on.

- Any time the server performs a shutdown, normally or from power outage, a user expects to potentially lose some work or a file or two.  It just happens.

- A lot of the powerdown hangs come from plugin issues.  For example my Recycle Bin.  As part of its stop process it made changes to the samba configuration and restarted samba.  This would then hang the unmounting.  Killing open processes on the array devices would get around this.

 

If the server shutdown is forced as you have currently implemented, you have created exactly the situation you seem to be concerned with - clobbering processes and forcing a shutdown.  Why not just do my suggestion and kill active pids on the array drives to clean things up and then unmount the drives instead of letting the 60 second time out take the server down?

 

I also expect that you will be getting a lot of feedback about the archiving of logs that was a feature of the powerdown plugin.  It was done very early on by WeeboTech and was a pretty cool feature.  You seem to be concerened about excessive flash writes, but I think the quality of flash drives has improved and is much less a concern these days.

 

in summary, I think you are concerned about a non issue.

 

EDIT: If you feel that log archiving is not something you are interested in doing, I'll add it as an option to my Tips and Tweaks plugin.  A lot of people really like the log archiving.

Link to comment

Thanks for the response Tom. That takes care of my primary concern (guaranteed shutdown on power loss).

 

I do have to agree with the points that dlandon, ljm42 and garycase have brought up.

 

Just because a user can open a telnet session to move files with mc doesn't mean they know anything past that. A lot/most of the users here I would venture to guess come from a windows background. Some of the advanced users of course have a lot better handle on linux.

 

I would give most user the benefit of the doubt to know that they shouldn't be actively writing to the tower during shutdown. But that doesn't means something wasn't just left open like a mc window.

 

For example I have a friend I got onto unraid and I frequently help him out. I've been there when he hits stop array and the process hangs. He sits and waits and nothing happens and figures the only want to get out of it is hold the power button and cause a hard shutdown. I think this is much worse then killing pids or what have you.

 

As far as writing logs to flash, I'm not sure why no ones pointed out that this would only happen on shutdown/restart. I think most users in general want as much uptime as possible so shutdown/reboots will be very infrequent and therefore writes to flash will also be infrequent. Hell I've probably caused more writes to flash due to installing plugins then reboots.

Link to comment

- Any time the server performs a shutdown, normally or from power outage, a user expects to potentially lose some work or a file or two.  It just happens.

Really?  :o  I wouldn't have thought that it is expected to lose a file or two when I shutdown my server, or that it is acceptable for that to happen, especially under a normal shutdown procedure.

 

Sure if I've got a file open, forget to save it and then shutdown I'd accept that as my own fault.  But not under any other circumstances in a normal shutdown.  A forced shutdown due to power failure is a different story.  emhttp has to take every possible precaution to guard against any data loss, and when those precautions fail, then force the system down to protect all the files, and hope for the best on the ones that may have been open for write at the time.

Link to comment

- Any time the server performs a shutdown, normally or from power outage, a user expects to potentially lose some work or a file or two.  It just happens.

Really?  :o  I wouldn't have thought that it is expected to lose a file or two when I shutdown my server, or that it is acceptable for that to happen, especially under a normal shutdown procedure.

 

Sure if I've got a file open, forget to save it and then shutdown I'd accept that as my own fault.  But not under any other circumstances in a normal shutdown.  A forced shutdown due to power failure is a different story.  emhttp has to take every possible precaution to guard against any data loss, and when those precautions fail, then force the system down to protect all the files, and hope for the best on the ones that may have been open for write at the time.

 

You and I are saying the same thing.  I just didn't say it very well.

Link to comment

When I said "bullet proof" earlier, I meant that in the sense that if you tell the system to shut down it WILL shut down ... period.    Not hang because the Recycle Bin is active (as dlandon's system did) ... or for any other reason.

 

As dlandon noted r.e. "Powerdown" shutting down:

... Virtually 100% of the time I heard "I want the server to shutdown cleanly"

 

Absolutely agree with a slight change:  instead of "... to shutdown cleanly", I'd change it to "... to shutdown -- cleanly if at all possible."    i.e. the user wants the system to absolutely shutdown!!

 

If it's easy to distinguish which case is underway, you could have two different ways to handle hangs:

 

(a)  If the shutdown was initiated outside of the GUI -- i.e. by the UPS code due to a power outage; via a remote PowerOff command via ssh (e.g. thru PLink); or via the power button or Ctrl-Alt-Del; then everything should be forced and it should absolutely shut down;

 

(b)  If the shutdown was initiated by simply clicking Shutdown (or reboot) from within the GUI, then if something is hung you could perhaps send a note to the GUI indicating that there is an issue with the the shutdown and offer options to  "Cancel Shutdown" (to allow the user to deal with it) and "Force Shutdown" [which would treat it like case (a)].    I'd automatically select the "Force" option if there's no response within a minute or so.

 

If it's too much of a hassle to segregate the actions based on whether the GUI was the source of the shutdown request, then I'd just go with (a) all the time => i.e. be certain the system shuts down, regardless of what needs to be forced.  The PowerDown plugin has done that for years, and as dlandon noted:

 

... Over all the years that I have been working with the powerdown plugin I never once heard that a user lost data.

 

 

Link to comment

When I said "bullet proof" earlier, I meant that in the sense that if you tell the system to shut down it WILL shut down ... period.    Not hang because the Recycle Bin is active (as dlandon's system did) ... or for any other reason.

His shutdown did not hang, it just got delayed by 60 sec because the Recycle Bin plugin bug prevented clean array Stop.

 

(a)  If the shutdown was initiated outside of the GUI -- i.e. by the UPS code due to a power outage; via a remote PowerOff command via ssh (e.g. thru PLink); or via the power button or Ctrl-Alt-Del; then everything should be forced and it should absolutely shut down;

Yes that is what it does now.

 

(b)  If the shutdown was initiated by simply clicking Shutdown (or reboot) from within the GUI, then if something is hung you could perhaps send a note to the GUI indicating that there is an issue with the the shutdown and offer options to  "Cancel Shutdown" (to allow the user to deal with it) and "Force Shutdown" [which would treat it like case (a)].    I'd automatically select the "Force" option if there's no response within a minute or so.

The way 'stock' webGui works you cannot initiate a shutdown/reboot without having first clicked Stop.  It's "Stop" that can sit in a polling loop.  Even if you have 'System Buttons' dynamix plugin, it can still hang because the emhttp cmdStop API call is still used.

 

The way we are going to fix this is by tying the Reboot/Power Off webGui buttons directly to poweroff/reboot commands (with 'are you sure' dialogs), and also making these buttons available whether array is stopped or not, whether emhttp is stuck in cmdStop loop or not.

 

If user gets into a situation where array won't stop, perhaps because they have a shell open somewhere, or buggy plugin, they can click the 'Reboot' button which will shutdown server and then reboot it.  (Or as they can do now, they could use the 3-finger-salute, or type 'reboot' at the console or open a new bash shell and type 'reboot').

 

Note: at the end of the shutdown process the md/unraid driver is cleanly stopped so that upon reboot a parity check will not be necessary - this is really the crux of the matter.

Link to comment

Let me point out a few things for you to consider:

- It is unrealistic to think someone would go out and look for something holding the shutdown and clean it up.  I'm hearing of a lot of servers that run remote and there is no way that can be done.

I think we are not on the same page.  Refer to my reply above to garycase.  A poweroff/reboot will always succeed eventually.  It's explicit array Stop that can hang (by clicking Stop button on webGui).  My proposal is that we change the action of the Reboot/Poweroff buttons so that they can be clicked at any time.  If user gets stuck in such a loop and don't know how to clear it, they can click Reboot.  This is the cleanest way to ensure there are no lingering processes which could interact in strange ways if array is simply re-Started following a forced Stop.

 

- Some users experience a lot of power outages and insist that the server shutdown and not hang so their UPS battery does not run down.  They don't want the battery to be exhausted and not have enough power to shutdown the server if another outage occurs quickly after coming back on.

- Any time the server performs a shutdown, normally or from power outage, a user expects to potentially lose some work or a file or two.  It just happens.

Agreed, this is how it works now.

 

- A lot of the powerdown hangs come from plugin issues.  For example my Recycle Bin.  As part of its stop process it made changes to the samba configuration and restarted samba.  This would then hang the unmounting.  Killing open processes on the array devices would get around this.

Killing open processes also masks bugs in plugins.  I am against that.

 

If the server shutdown is forced as you have currently implemented, you have created exactly the situation you seem to be concerned with - clobbering processes and forcing a shutdown.  Why not just do my suggestion and kill active pids on the array drives to clean things up and then unmount the drives instead of letting the 60 second time out take the server down?

Forcing a poweroff is different because as you pointed out above, it could be because we're running on battery power.  There is no other choice than to kill processes and unmount devices because, A) unmount will sync metadata, and B) at the end of rc.6 we cleanly stop md/unraid driver so that parity check won't be triggered upon reboot.

 

I also expect that you will be getting a lot of feedback about the archiving of logs that was a feature of the powerdown plugin.  It was done very early on by WeeboTech and was a pretty cool feature.  You seem to be concerened about excessive flash writes, but I think the quality of flash drives has improved and is much less a concern these days.

I think I already mentioned that we would add this.  But in your solution:

 

# capture the system diagnostics
logger "Capture diagnostics to /boot/logs"
echo "Capture diagnostics to /boot/logs"
rm -f /boot/logs/*diagnostics*.zip
/usr/local/sbin/diagnostics

 

Do we really want to execute that

rm -f /boot/logs/*diagnostics*.zip

command?

Link to comment
His shutdown did not hang, it just got delayed by 60 sec because the Recycle Bin plugin bug prevented clean array Stop.

 

It wasn't a bug.  It was because you stop samba before plugins are stopped.  I think this is not the right time.  I would shut down Dockers, VMs, and all plugins and then stop samba.  I am expecting some issues with UD because of the way you do this because UD shares the mounted devices and when it stops it does some modifications to the smb configuration to un-share the devices.  I also expect there are other plugins using samba to have problems.

Link to comment

His shutdown did not hang, it just got delayed by 60 sec because the Recycle Bin plugin bug prevented clean array Stop.

 

It wasn't a bug.  It was because you stop samba before plugins are stopped.  I think this is not the right time.  I would shut down Dockers, VMs, and all plugins and then stop samba.  I am expecting some issues with UD because of the way you do this because UD shares the mounted devices and when it stops it does some modifications to the smb configuration to un-share the devices.  I also expect there are other plugins using samba to have problems.

 

The code does not have any callout that explicitly "stops" plugins.  Instead plugins must tie into the proper event to execute their "stop" code.

 

Here is the cmdStop sequence:

 

[*]Print out message "Stopping services..."

[*]generate "stopping_svcs" event

[*]Shutdown libvirt (VM's)

[*]Shutdown docker (containers)

[*]Stop samba (smb services)

[*]Stop NFS

[*]Stop AFP

[*]Stop avahi

[*]execute "sync"

[*]generate "unmounting_disks" event

[*]unmount user share file system (/mnt/user and /mnt/user0)

[*]unmount array disk devices

[*]unmount cache disk/pool

[*]generate "stopping_array" event

[*]stop the md/unraid driver

[*]restart samba (to bring usb share back online)

[*]generate "stopped" event

 

Plugins should use the appropriate event to shutdown.  Which event depends on the plugin.  Sounds like the Recycle Bin plugin should use "stopping_svcs" event.

Link to comment

How about this way:

 

Print out message "Stopping services..."
generate "stopping_svcs" event
Shutdown libvirt (VM's)
Shutdown docker (containers)
generate "unmounting_disks" event  <-- This is what I mean by stopping plugins
Stop samba (smb services)
Stop NFS
Stop AFP
Stop avahi
execute "sync"
unmount user share file system (/mnt/user and /mnt/user0)
unmount array disk devices
unmount cache disk/pool
generate "stopping_array" event
stop the md/unraid driver
restart samba (to bring usb share back online)
generate "stopped" event

 

I'm concerned you are pulling the rug out from under some plugins.

Link to comment

How about this way:

 

Print out message "Stopping services..."
generate "stopping_svcs" event
Shutdown libvirt (VM's)
Shutdown docker (containers)
generate "unmounting_disks" event  <-- This is what I mean by stopping plugins
Stop samba (smb services)
Stop NFS
Stop AFP
Stop avahi
execute "sync"
unmount user share file system (/mnt/user and /mnt/user0)
unmount array disk devices
unmount cache disk/pool
generate "stopping_array" event
stop the md/unraid driver
restart samba (to bring usb share back online)
generate "stopped" event

 

I'm concerned you are pulling the rug out from under some plugins.

 

I don't understand the question/issue  :P

 

What do you mean "pulling the rug out"?

Link to comment

How about this way:

 

Print out message "Stopping services..."
generate "stopping_svcs" event
Shutdown libvirt (VM's)
Shutdown docker (containers)
generate "unmounting_disks" event  <-- This is what I mean by stopping plugins
Stop samba (smb services)
Stop NFS
Stop AFP
Stop avahi
execute "sync"
unmount user share file system (/mnt/user and /mnt/user0)
unmount array disk devices
unmount cache disk/pool
generate "stopping_array" event
stop the md/unraid driver
restart samba (to bring usb share back online)
generate "stopped" event

 

I'm concerned you are pulling the rug out from under some plugins.

 

I don't understand the question/issue  :P

 

What do you mean "pulling the rug out"?

 

First let me say that I appreciate you finally documenting the shutdown scheme.  I never really understood the details.  I'm sure this will help plugin authors understand better where to install their plugin start and stop events.

 

For an example, UD can't unmount devices at the "stopping_svcs" event because Dockers and VMs on a UD device are still running.  I am doing it at the "unmounting_disks" event and and it does some samba work because of sharing the devices, but samba is stopped.  UD also can share with NFS and has to un-share NFS devices on shutdown, but NFS is stopped.  This is what I mean by pulling the rug out from under a plugin.

 

Maybe this would work and be more in line with the "events" you've defined:

Print out message "Stopping services..."
Shutdown libvirt (VM's)
Shutdown docker (containers)
generate "stopping_svcs" event   <- move stopping services event to here.
Stop samba (smb services)
Stop NFS
Stop AFP
Stop avahi
execute "sync"
generate "unmounting_disks" event
unmount user share file system (/mnt/user and /mnt/user0)
unmount array disk devices
unmount cache disk/pool
generate "stopping_array" event
stop the md/unraid driver
restart samba (to bring usb share back online)
generate "stopped" event

 

and would let me move the UD and Recycle Bin stop events there to be more consistent with your event scheme and not cause a plugin problem.

Link to comment

Ok I see what you mean.  Actually this is what I originally considered doing:

 

Print out message "Stopping services..."
generate "stopping_libvirt" event (if libvirt started)
Shutdown libvirt (VM's)
generate "stopping_docker" event (if docker started)
Shutdown docker (containers)
generate "stopping_svcs" event
Stop samba (smb services)
Stop NFS
Stop AFP
Stop avahi
execute "sync"
generate "unmounting_disks" event
unmount user share file system (/mnt/user and /mnt/user0)
unmount array disk devices
unmount cache disk/pool
generate "stopping_array" event
stop the md/unraid driver
restart samba (to bring usb share back online)
generate "stopped" event

 

But then we started grouping libvirt/docker with the other "services" (smb/nfs/afp) in our thinking and so didn't give it much more thought  ;)

 

Probably this would be better?

Link to comment

Ok I see what you mean.  Actually this is what I originally considered doing:

 

Print out message "Stopping services..."
generate "stopping_libvirt" event (if libvirt started)
Shutdown libvirt (VM's)
generate "stopping_docker" event (if docker started)
Shutdown docker (containers)
generate "stopping_svcs" event
Stop samba (smb services)
Stop NFS
Stop AFP
Stop avahi
execute "sync"
generate "unmounting_disks" event
unmount user share file system (/mnt/user and /mnt/user0)
unmount array disk devices
unmount cache disk/pool
generate "stopping_array" event
stop the md/unraid driver
restart samba (to bring usb share back online)
generate "stopped" event

 

But then we started grouping libvirt/docker with the other "services" (smb/nfs/afp) in our thinking and so didn't give it much more thought  ;)

 

Probably this would be better?

 

Yes.  I like that a lot because it is right in line with the events and their meaning and adding the other events for VMs and Dockers makes sense.

 

Thanks for listening and working this out.  Once you settle on this, would it make sense to sticky this in a post for developers?  I think it is pretty important for all to understand this scheme.  I never have really understood the events in the scheme of things.

 

I'm going to go ahead and make the change from the unmount_devices event to stopping_svcs for Recycle Bin.  It makes a lot more sense that way.

 

I'll work something out with UD once you implement these changes and I'll have to work on backward compatibility for earlier unRAID versions.

Link to comment

When I said "bullet proof" earlier, I meant that in the sense that if you tell the system to shut down it WILL shut down ... period.    Not hang because the Recycle Bin is active (as dlandon's system did) ... or for any other reason.

His shutdown did not hang, it just got delayed by 60 sec because the Recycle Bin plugin bug prevented clean array Stop.

 

Ahh -- I didn't understand that.  I have no issue with a 60-sec delay, as long as the ultimate result is a shutdown.

 

 

(a)  If the shutdown was initiated outside of the GUI -- i.e. by the UPS code due to a power outage; via a remote PowerOff command via ssh (e.g. thru PLink); or via the power button or Ctrl-Alt-Del; then everything should be forced and it should absolutely shut down;

Yes that is what it does now.

 

Very good -- while I think the sequencing you're discussing with dlandon can indeed improve the process, that fundamental result needs to be that a shutdown command absolutely results in a shutdown -- and it sounds like the built-in PowerOff command now achieves that.    Thanks for this EXCELLENT improvement.

 

 

... and the improvements you detailed for the GUI are excellent:

The way we are going to fix this is by tying the Reboot/Power Off webGui buttons directly to poweroff/reboot commands (with 'are you sure' dialogs), and also making these buttons available whether array is stopped or not, whether emhttp is stuck in cmdStop loop or not.

 

If user gets into a situation where array won't stop, perhaps because they have a shell open somewhere, or buggy plugin, they can click the 'Reboot' button which will shutdown server and then reboot it.  (Or as they can do now, they could use the 3-finger-salute, or type 'reboot' at the console or open a new bash shell and type 'reboot').

 

 

... and of course your final comment:

Note: at the end of the shutdown process the md/unraid driver is cleanly stopped so that upon reboot a parity check will not be necessary - this is really the crux of the matter.

 

... is right on.  A user wants two things to be true when he clicks on ShutDown (or ReBoot):  (a) the system absolutely WILL shutdown/reboot; and (b) it will be a "clean" process so no auto-reboot is initiated on the next startup.

 

Link to comment

Let me point out a few things for you to consider:

- It is unrealistic to think someone would go out and look for something holding the shutdown and clean it up.  I'm hearing of a lot of servers that run remote and there is no way that can be done.

I think we are not on the same page.  Refer to my reply above to garycase.  A poweroff/reboot will always succeed eventually.  It's explicit array Stop that can hang (by clicking Stop button on webGui).  My proposal is that we change the action of the Reboot/Poweroff buttons so that they can be clicked at any time.  If user gets stuck in such a loop and don't know how to clear it, they can click Reboot.  This is the cleanest way to ensure there are no lingering processes which could interact in strange ways if array is simply re-Started following a forced Stop.

 

- Some users experience a lot of power outages and insist that the server shutdown and not hang so their UPS battery does not run down.  They don't want the battery to be exhausted and not have enough power to shutdown the server if another outage occurs quickly after coming back on.

- Any time the server performs a shutdown, normally or from power outage, a user expects to potentially lose some work or a file or two.  It just happens.

Agreed, this is how it works now.

 

- A lot of the powerdown hangs come from plugin issues.  For example my Recycle Bin.  As part of its stop process it made changes to the samba configuration and restarted samba.  This would then hang the unmounting.  Killing open processes on the array devices would get around this.

Killing open processes also masks bugs in plugins.  I am against that.

 

If the server shutdown is forced as you have currently implemented, you have created exactly the situation you seem to be concerned with - clobbering processes and forcing a shutdown.  Why not just do my suggestion and kill active pids on the array drives to clean things up and then unmount the drives instead of letting the 60 second time out take the server down?

Forcing a poweroff is different because as you pointed out above, it could be because we're running on battery power.  There is no other choice than to kill processes and unmount devices because, A) unmount will sync metadata, and B) at the end of rc.6 we cleanly stop md/unraid driver so that parity check won't be triggered upon reboot.

 

I also expect that you will be getting a lot of feedback about the archiving of logs that was a feature of the powerdown plugin.  It was done very early on by WeeboTech and was a pretty cool feature.  You seem to be concerened about excessive flash writes, but I think the quality of flash drives has improved and is much less a concern these days.

I think I already mentioned that we would add this.  But in your solution:

 

# capture the system diagnostics
logger "Capture diagnostics to /boot/logs"
echo "Capture diagnostics to /boot/logs"
rm -f /boot/logs/*diagnostics*.zip
/usr/local/sbin/diagnostics

 

Do we really want to execute that

rm -f /boot/logs/*diagnostics*.zip

command?

 

That was done to limit the number of diagnostics files to the latest one only because each is unique with a date and time stamp.

 

Maybe this is a good opportunity to easily archive stuff.  Maybe keep 10 or so copies and delete any older than the latest 10.  All of the logs are in the diagnostics and the diagnostics files are date and time stamped.

 

BTW, windows gives me an error when I try to extract any diagnostics zip file.

Link to comment

By the way, I'd really like to thank Tom & crew for the features we've now got in UnRAID.

 

A  L .. O .. N .. G  time ago there was a lot of discussion r.e. adding what I (and others -- Joe, WeeboTech, etc.) considered mandatory NAS features to UnRAID => notably UPS support and Notifications ... and of course a rock-solid shutdown process was a necessary feature for good UPS support.  Dual parity has also been a frequently requested feature.

 

And we now have them ALL !!  :)

 

From an e-mail Tom sent me over 7 years ago:

------------------------------------------------------------------------------------------

... We are adding code to make unRAID appeal more to small business, ie:

- Active Directory integration [done]

- Email alerts [in process]

- UPS control [in process]

------------------------------------------------------------------------------------------

 

... and of course it's gone far beyond that in its capabilites => Dockers, VM's, etc. along with a MUCH improved GUI

 

 

Link to comment
Guest
This topic is now closed to further replies.