Stopping server sometimes fails

bfeist · November 8, 2009

Hi all,

I've noticed that seemingly intermittently, unRAID refuses to stop. The main console will get stuck unmounting one of the drives. It's not always the same drive that gets stuck.

I've taken a tail of the syslog below (this time it was my cache drive that won't unmount). It just keeps repeating the same thing. If i issue a shutdown -h now via telnet then it does reboot. This has happened with 4.5beta7 and 4.5beta8. Anyone know what's happening? My motherboard is a GA-MA74GM-S2 which has been given a full compatibility green light here previously.

One interesting note: If I unplug the ethernet cable from the unRAID server, then plug it back in, the unmount is successful. Perhaps samba is stuck with a file being left open?

Thanks for any suggestions,

Ben

syslog snippet:

Nov  8 08:24:58 Tower emhttp: Retry unmounting disk share(s)...
Nov  8 08:24:59 Tower emhttp: shcmd (234): umount /mnt/cache >/dev/null 2>&1
Nov  8 08:24:59 Tower emhttp: _shcmd: shcmd (234): exit status: 1
Nov  8 08:24:59 Tower emhttp: shcmd (235): rmdir /mnt/cache >/dev/null 2>&1
Nov  8 08:24:59 Tower emhttp: _shcmd: shcmd (235): exit status: 1
Nov  8 08:25:00 Tower emhttp: Retry unmounting disk share(s)...
Nov  8 08:25:01 Tower emhttp: shcmd (236): umount /mnt/cache >/dev/null 2>&1
Nov  8 08:25:01 Tower emhttp: _shcmd: shcmd (236): exit status: 1
Nov  8 08:25:01 Tower emhttp: shcmd (237): rmdir /mnt/cache >/dev/null 2>&1
Nov  8 08:25:01 Tower emhttp: _shcmd: shcmd (237): exit status: 1
Nov  8 08:25:02 Tower emhttp: Retry unmounting disk share(s)...
Nov  8 08:25:03 Tower emhttp: shcmd (238): umount /mnt/cache >/dev/null 2>&1
Nov  8 08:25:03 Tower emhttp: _shcmd: shcmd (238): exit status: 1
Nov  8 08:25:03 Tower emhttp: shcmd (239): rmdir /mnt/cache >/dev/null 2>&1
Nov  8 08:25:03 Tower emhttp: _shcmd: shcmd (239): exit status: 1
Nov  8 08:25:04 Tower emhttp: Retry unmounting disk share(s)...
Nov  8 08:25:05 Tower emhttp: shcmd (240): umount /mnt/cache >/dev/null 2>&1
Nov  8 08:25:05 Tower emhttp: _shcmd: shcmd (240): exit status: 1
Nov  8 08:25:05 Tower emhttp: shcmd (241): rmdir /mnt/cache >/dev/null 2>&1
Nov  8 08:25:05 Tower emhttp: _shcmd: shcmd (241): exit status: 1
Nov  8 08:25:06 Tower emhttp: Retry unmounting disk share(s)...
Nov  8 08:25:07 Tower emhttp: shcmd (242): umount /mnt/cache >/dev/null 2>&1
Nov  8 08:25:07 Tower emhttp: _shcmd: shcmd (242): exit status: 1
Nov  8 08:25:07 Tower emhttp: shcmd (243): rmdir /mnt/cache >/dev/null 2>&1
Nov  8 08:25:07 Tower emhttp: _shcmd: shcmd (243): exit status: 1
Nov  8 08:25:08 Tower emhttp: Retry unmounting disk share(s)...
Nov  8 08:25:09 Tower emhttp: shcmd (244): umount /mnt/cache >/dev/null 2>&1
Nov  8 08:25:09 Tower emhttp: _shcmd: shcmd (244): exit status: 1
Nov  8 08:25:09 Tower emhttp: shcmd (245): rmdir /mnt/cache >/dev/null 2>&1

wholly · November 8, 2009

Use the lsof tool to see if you have files open on a share. The latest versions will block waiting for the file to be closed (for safety reasons) before shutting down.

There are many more threads here about this situation that a search will turn up.

Rob

Joe L. · November 16, 2009

In that syslog extract it appears as if you have either an open file on the cache drive being read or written... OR you have a process that you started when your current directory was /mnt/cache OR

you have changed directory to /mnt/cache and it is your current directory.

To find the open files and or processes keeping the cache drive "busy" and unable to be un-mounted, type:

lsof /mnt/cache

or

lsof /dev/sdX

(where sdX is your cache drive device)

BW · December 2, 2009

In that syslog extract it appears as if you have either an open file on the cache drive being read or written... OR you have a process that you started when your current directory was /mnt/cache OR

you have changed directory to /mnt/cache and it is your current directory.

To find the open files and or processes keeping the cache drive "busy" and unable to be un-mounted, type:

lsof /mnt/cache

or

lsof /dev/sdX

(where sdX is your cache drive device)

Hi,

I have zero knowledge with linux. And have some question regarding "Can't stop the server"

Is this scrip will list the open file only or also unmount the server?

I tried to type them while some files from the server open but it says "no such file or directory"

I could not stop the server once and found out my pc with w7 is open. Is this the problem?

Is there a way to stop the server manually from the telnet console in case the one from the menu not working?

Thanks!

EdgarWallace · April 30, 2010

Hi,

the same here - it's always the same drive that is stuck with "UNMOUNT".

I'm running 4.5.3 and what what do I need to do if I discover (using lsof) that there is a file open?

It's always the same story (most probably since I have activated the Cache drive), I'm trying to stop & shutdown via web interface and have to hard reset the server after the above message is showing up. If I switch off the server it will drive a Parity Check each time on reboot.

Last time my users were not existing any more so that I wasn't able to get into the system via AFP....with that my Timemachine isn't working....

Where should I start first it's kind of frustrating.

Joe L. · April 30, 2010

Hi,

the same here - it's always the same drive that is stuck with "UNMOUNT".

I'm running 4.5.3 and what what do I need to do if I discover (using lsof) that there is a file open?

It's always the same story (most probably since I have activated the Cache drive), I'm trying to stop & shutdown via web interface and have to hard reset the server after the above message is showing up. If I switch off the server it will drive a Parity Check each time on reboot.

Last time my users were not existing any more so that I wasn't able to get into the system via AFP....with that my Timemachine isn't working....

Where should I start first it's kind of frustrating.

Basically, the array will not stop if a disk is busy.

1. A disk is busy if a file on it is in use

or

2. A disk is busy if it is the "current directory" for any process.

or

3. A disk is busy if another disk has been mounted on a mount-point (a directory) on it.

the lsof command will not detect the third situation. (since no open files exist)

Situation #2 could even be your login. If you type "cd /mnt/disk2" you will then have disk2 as your current directory and it can not be un-mounted.

You can type

fuser -cu /dev/md1

fuser -cu /dev/md2

fuser -cu /dev/md3

fuser -cu /dev/md4

fuser -cu /dev/md5

fuser -cu /dev/md6

for each of your disks in turn to identify the process holding a disk busy. If you have any add-on-processes, you'll want to stop them. They might be keeping your disks busy.

The basic method is to stop any add-on processes you might have running that might be keeping a disk busy.

If you are trying to shut down the array you might install WeeboTech's "powerdown" add-on. It will check for and terminate processes holding disks busy prior to cleanly stopping the array and powering down. You would invoke it as

/sbin/powerdown

Joe L.

EdgarWallace · April 30, 2010

Joe,

thank you very much. Actually I used WeeboTech's "powerdown" add-on in my go script and everything went fine. But it switched off my server 11pm and that wasn´t always a good choice. So I removed it from the go script and since that time I had the issues....

Btw. some music files were opened and here is the log:

root@Tower:~# /sbin/powerdown
Capturing information to syslog. Please wait...
version[4065]: Linux version 2.6.32.9-unRAID (root@Develop) (gcc version 4.2.3) #1 SMP Fri Feb 26 19:35:20 MST 2010
ls: cannot access /dev/hd[a-z]: No such file or directory
ls: cannot access /dev/hd[a-z]: No such file or directory
/etc/rc.d/rc.unRAID: line 84: ${FILE}: ambiguous redirect
/etc/rc.d/rc.unRAID: line 84: ${FILE}: ambiguous redirect
status[4170]: State: STARTED
status[4170]: D#           Model / Serial          Status         Device    
status[4170]: 0  WDC WD15EARS-00 / WD-WCAVY2530562 DISK_OK        sda       
status[4170]: 1  WDC WD15EARS-00 / WD-WCAVY2657059 DISK_OK        sdb       
status[4170]: 2  WDC WD2500JS-40 / WD-WCANY1940426 DISK_OK        sdc       
status[4170]: SMART overall health assessment
ls: cannot access /dev/hd[a-z]: No such file or directory
status[4170]: /dev/sda: Device is in STANDBY mode, exit(2)
status[4170]: /dev/sdb: Device is in STANDBY mode, exit(2)
status[4170]: /dev/sdc: Device is in STANDBY mode, exit(2)
status[4170]: /dev/sdd: SMART Health Status: OK
status[4170]: /dev/sde: SMART overall-health self-assessment test result: PASSED
status[4170]: ACTIVE PIDS on the array
status[4170]: root      3644  3641  0 07:16 ?        00:00:00 /boot/mediatomb/usr/bin/mediatomb -m /boot/mediatomb -f config
status[4170]: root      3652  3644  0 07:16 ?        00:00:00 /boot/mediatomb/usr/bin/mediatomb -m /boot/mediatomb -f config
status[4170]: root      3653  3652  0 07:16 ?        00:00:07 /boot/mediatomb/usr/bin/mediatomb -m /boot/mediatomb -f config
status[4170]: root      3658  3652  0 07:16 ?        00:00:00 /boot/mediatomb/usr/bin/mediatomb -m /boot/mediatomb -f config
status[4170]: root      3660  3652  0 07:16 ?        00:00:00 /boot/mediatomb/usr/bin/mediatomb -m /boot/mediatomb -f config
status[4170]: root      3662  3652  0 07:16 ?        00:00:00 /boot/mediatomb/usr/bin/mediatomb -m /boot/mediatomb -f config
status[4170]: root      3664  3652  0 07:16 ?        00:00:00 /boot/mediatomb/usr/bin/mediatomb -m /boot/mediatomb -f config
status[4170]: root      3669  3652  0 07:16 ?        00:00:00 /boot/mediatomb/usr/bin/mediatomb -m /boot/mediatomb -f config
status[4170]: root      3670  3652  0 07:16 ?        00:00:00 /boot/mediatomb/usr/bin/mediatomb -m /boot/mediatomb -f config
status[4170]: root      3891  3652  0 13:01 ?        00:00:00 /boot/mediatomb/usr/bin/mediatomb -m /boot/mediatomb -f config
Removing old syslog: /boot/logs/syslog-20100425-030435.txt
Saving current syslog: /boot/logs/syslog-20100430-152901.txt
-rwxrwxrwx 1 root root 149266 Apr 30 15:29 /boot/logs/syslog-20100430-152901.txt
 adding: syslog.txt (deflated 84%)

Broadcast message from root (pts/0) (Fri Apr 30 15:29:01 2010):

The system is going down for system halt NOW!
root@Tower:~# Connection to 192.168.0.1 closed by remote host.
Connection to 192.168.0.1 closed.

The only issue now is that I can´t see any user (except root) in the web console but this might be another subject and off topic for this thread.

Thanks again.

[Edit]Guide to include the execution of the above command into a menu entry of unMENU: http://lime-technology.com/forum/index.php?topic=5475.15 Reply #27

Stopping server sometimes fails

Recommended Posts

bfeist

Link to comment

wholly

Link to comment

Joe L.

Link to comment

BW

Link to comment

EdgarWallace

Link to comment

Joe L.

Link to comment

EdgarWallace

Link to comment

Archived