umount crashes the server when stopping the array

dangil · May 24, 2012

Since beta6a, until rc3, I have a intermittent issue while stopping the array.

I have a Supermicro X8SIL-V motherboard, with 6 onboard sata. If I only use these ports, I don't see the issue.

if there is more than 1 sata controller, the issue appears.

I did several cycles of mount/umount of disk6, that is the disk attached to pci-e controller, in maintenence mode, as well as in normal mode (with SMB Export set to off), and none failed

I think this bug is brought up because the umount of disk5, the last disk on the onboard sata controller, is not completelly finished when, in the sequence, the umount for disk6, attached to the pci-e controler is started

what led me to this conclusion is that on my syslog , 2 disks remain busy after umount crashes the kernel. the conclusion is that the unmount of the second to last disk and the unmount of the last disk conflict with each other. perhaps from a race condition between them.

I think if a small delay between umounts is added, this issue could be solved

attached is the syslog with the kernel BUG info

syslog.zip

dangil · June 4, 2012

an update

since I started unmounting disk6 (connected to the PCI-E Sata Controller) manually, everything is running smoothly. no more crashes

I can press stop, unraid will unmount the disks connected to the onboard sata, and stop the array

limetech · June 4, 2012

an update

since I started unmounting disk6 (connected to the PCI-E Sata Controller) manually, everything is running smoothly. no more crashes

I can press stop, unraid will unmount the disks connected to the onboard sata, and stop the array

I have a fix for this in -rc4 that may solve the problem. I say "may" because, though I can make it happen infrequently without the fix, with the fix I can't make it happen. But maybe after the 543'd stop/start sequence it will happen. The problem is a race condition in reiserfs that I think was introduced when they removed the "BKL" (big kernel lock) from the code, but I don't understand all the code enough to isolate and solve, my solution is a workaround, though I think it works. Maybe I should go and try and visit Hans in San Quentin

madburg · June 5, 2012

an update

since I started unmounting disk6 (connected to the PCI-E Sata Controller) manually, everything is running smoothly. no more crashes

I can press stop, unraid will unmount the disks connected to the onboard sata, and stop the array

I have a fix for this in -rc4 that may solve the problem. I say "may" because, though I can make it happen infrequently without the fix, with the fix I can't make it happen. But maybe after the 543'd stop/start sequence it will happen. The problem is a race condition in reiserfs that I think was introduced when they removed the "BKL" (big kernel lock) from the code, but I don't understand all the code enough to isolate and solve, my solution is a workaround, though I think it works. Maybe I should go and try and visit Hans in San Quentin

Oh, that last part was too funny. I'll drink to that.

dalben · June 6, 2012

I'm getting the same issue on rc4.

I try to stop the array and the unraid server locks up. Doesn't even respond to a ping.

RobJ · June 7, 2012

I'm getting the same issue on rc4.

I try to stop the array and the unraid server locks up. Doesn't even respond to a ping.

Can you provide a syslog that is saved as late as possible, and/or run a console or Telnet/PuTTY 'tail -f' and show us the very last messages?

dalben · June 7, 2012

I'm getting the same issue on rc4.

I try to stop the array and the unraid server locks up. Doesn't even respond to a ping.

Can you provide a syslog that is saved as late as possible, and/or run a console or Telnet/PuTTY 'tail -f' and show us the very last messages?

Here's what I think you're after.

Jun  6 18:25:57 tdm status[23414]: No active PIDS on the array
Jun  6 18:25:58 tdm rc.unRAID[23452]: Killing active pids on the array drives
Jun  6 18:25:58 tdm rc.unRAID[23480]: Umounting the drives
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md1 umounted
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md2 umounted
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md3 umounted
Jun  6 18:25:59 tdm rc.unRAID[23494]: Stopping the Array
Jun  6 18:25:59 tdm kernel: mdcmd (20): stop 
Jun  6 18:25:59 tdm kernel: md1: stopping
Jun  6 18:25:59 tdm kernel: md2: stopping
Jun  6 18:25:59 tdm kernel: md3: stopping
Jun  6 18:26:02 tdm mdstatusdiff[23506]: --- /tmp/mdcmd.23346.1^I2012-06-06 18:25:59.481577266 +0800
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +++ /tmp/mdcmd.23346.2^I2012-06-06 18:26:02.631635271 +0800
Jun  6 18:26:02 tdm mdstatusdiff[23506]: @@ -1,14 +1,14 @@
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbName=/boot/config/super.dat
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbVersion=2.1.3
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbCreated=1314335405
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbUpdated=1338978072
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbEvents=260
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbState=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbUpdated=1338978359
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbEvents=261
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbState=1
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbNumDisks=4
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbSynced=1338945606
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbSyncErrs=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdVersion=2.1.3
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -mdState=STARTED
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +mdState=STOPPED
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdNumProtected=4
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdNumDisabled=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdDisabledDisk=0

Looking at the time, that *copuld* be a console generated shutdown command. The lockuop is more on the stop array from the webgui.

I'm not sure what tail -f is. How do I set it up. I run a putty window, typed tail -f and got the following. Now what ?

Linux 3.0.33-unRAID.
root@tdm:~# tail -f
tail: warning: following standard input indefinitely is ineffective

limetech · June 7, 2012

I'm getting the same issue on rc4.

I try to stop the array and the unraid server locks up. Doesn't even respond to a ping.

Can you provide a syslog that is saved as late as possible, and/or run a console or Telnet/PuTTY 'tail -f' and show us the very last messages?

Here's what I think you're after.

Jun  6 18:25:57 tdm status[23414]: No active PIDS on the array
Jun  6 18:25:58 tdm rc.unRAID[23452]: Killing active pids on the array drives
Jun  6 18:25:58 tdm rc.unRAID[23480]: Umounting the drives
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md1 umounted
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md2 umounted
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md3 umounted
Jun  6 18:25:59 tdm rc.unRAID[23494]: Stopping the Array
Jun  6 18:25:59 tdm kernel: mdcmd (20): stop 
Jun  6 18:25:59 tdm kernel: md1: stopping
Jun  6 18:25:59 tdm kernel: md2: stopping
Jun  6 18:25:59 tdm kernel: md3: stopping
Jun  6 18:26:02 tdm mdstatusdiff[23506]: --- /tmp/mdcmd.23346.1^I2012-06-06 18:25:59.481577266 +0800
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +++ /tmp/mdcmd.23346.2^I2012-06-06 18:26:02.631635271 +0800
Jun  6 18:26:02 tdm mdstatusdiff[23506]: @@ -1,14 +1,14 @@
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbName=/boot/config/super.dat
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbVersion=2.1.3
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbCreated=1314335405
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbUpdated=1338978072
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbEvents=260
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbState=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbUpdated=1338978359
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbEvents=261
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbState=1
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbNumDisks=4
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbSynced=1338945606
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbSyncErrs=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdVersion=2.1.3
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -mdState=STARTED
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +mdState=STOPPED
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdNumProtected=4
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdNumDisabled=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdDisabledDisk=0

Looking at the time, that *copuld* be a console generated shutdown command. The lockuop is more on the stop array from the webgui.

I'm not sure what tail -f is. How do I set it up. I run a putty window, typed tail -f and got the following. Now what ?

Linux 3.0.33-unRAID.
root@tdm:~# tail -f
tail: warning: following standard input indefinitely is ineffective

you want

tail -f /var/log/syslog

But I don't recognize "rc.unRAID" in your syslog, what is this? And please disable for now.

dalben · June 7, 2012

Thanks for the command.

In true demonstration mode, it didn't hang this time. Stopped the array fine from the webgui (simplefeautures). reboot from the webgui. All good.

Between the last hang and now I stripped back everything down to bare unraid (removed all plugins and packages) and started rebuilding the server with the add-ons I need from scratch.

I can only assume that there was some mixture of plugins/packages and caused some issues.

The only lingering issue I have is cache won't unmount via the webgui so I need to force a umount -f on it.

wrt rc.unRAID I a) don't know what it is and b) wouldn't know how to disable it.

limetech · June 7, 2012

The only lingering issue I have is cache won't unmount via the webgui so I need to force a umount -f on it.

This is a problem. Please disable all plug-ins and retest. Then if it still happens, open another issue thread.

dangil · June 7, 2012

thanks Tom! updating to rc-4 to try this fix. will report back

chickensoup · June 7, 2012

thanks Tom! updating to rc-4 to try this fix. will report back

When you update to rc-4, make sure you do a clean install on your flash to remove any traces of old plugins etc. test this with no plugins and report the results. This includes simplefeatures, unmenu or anything that isn't supplied by lime-tech's default install.

If the issue is still happening please repost a sylog with rc4.

dalben · June 13, 2012

I've had this happen to me again today with RC4

I do have a few plugins installed but none of them are any different to the plugins that worked with RC3 and all previous RC & Beta release.

I found my webgui (simplefeatures) was non responsive. unmenu worked for a while then that hang. From a console window I restarted unmenu fine. I clicked the "stop array" button and after about half a minute the server became unresponsive, not even responding to a ping.

I've attached the tail -f syslog, but the only syslog entries that appeared after I clicked "Stop Array" were the following:

Jun 13 20:21:35 tdm kernel: mdcmd (56): spinup 0
Jun 13 20:21:35 tdm kernel:
Jun 13 20:21:35 tdm kernel: mdcmd (57): spinup 1
Jun 13 20:21:35 tdm kernel:
Jun 13 20:21:35 tdm kernel: mdcmd (58): spinup 2
Jun 13 20:21:35 tdm kernel:
Jun 13 20:21:35 tdm kernel: mdcmd (59): spinup 3
Jun 13 20:21:35 tdm kernel:
Jun 13 20:21:43 tdm mountd[4117]: Caught signal 15, un-registering and exiting.
Jun 13 20:21:44 tdm kernel: nfsd: last server has exited, flushing export cache

Throughout the course of the day I had been working on cleaning all my and packages plugins and ensuring that only the latest working ones were installed. I know it was all fine earlier today as I restarted the server a few times to make sure that the SAB/SB/CP/TransM/FlexGet combo all started fine without any errors.

I could try and strip everything back to the core UnRaid rc4 again but that sort of defeats the purpose as I only have a live server and I really don't want to lose the SAB/SB/CP/TransM/FlexGet automation I have setup.

tdm_hang_syslog.txt

madburg · June 14, 2012

Well, look at it this way. Tom is changing things, right? So, you can't go by a third party plugin that (may have) worked even one version prior. It's for the plug-in creator to adjust, change etc their plugin. For your sanity and others, first try removing all plugins and test various aspects. If all checks out for you, add one plugin at a time and test over again. If you find a plugin issue, post within the plugin forum for help (most likely will help others as well).

Its not very hard to back up all the files (.plg, go script, etc) and start fresh and add each one by one, once you know the RC checks out.

dalben · June 14, 2012

Well, look at it this way. Tom is changing things, right? So, you can't go by a third party plugin that (may have) worked even one version prior. It's for the plug-in creator to adjust, change etc their plugin. For your sanity and others, first try removing all plugins and test various aspects. If all checks out for you, add one plugin at a time and test over again. If you find a plugin issue, post within the plugin forum for help (most likely will help others as well).

Its not very hard to back up all the files (.plg, go script, etc) and start fresh and add each one by one, once you know the RC checks out.

Yeah, look, I'm not looking for someone's undivided attention to fix my issue if I'm not prepared to spend time doing a proper debug. I'm just reporting what I saw and threw up a syslog and if that helps tom or others find a bug in unraid or a package/plugin, great.

I had already removed all plugins/packages and slowly re-added what I needed one by one. Stopping and restarting the array each time. All worked well with the final steps doing full server restarts. The problem appeared a day later when I wanted to stop the array.

madburg · June 14, 2012

Yeah, look, I'm not looking for someone's undivided attention to fix my issue if I'm not prepared to spend time doing a proper debug. I'm just reporting what I saw and threw up a syslog and if that helps tom or others find a bug in unraid or a package/plugin, great.

Its not so much that, as helping to get to the root cause so Tom can reproduce and notify if it is a bug or not.

The plug owner(s) mostly dont know exactly what your running via this post so you wont get help here if it is a plugin problem.

You should try to run the RC, grab a spare PC, etc. and copy a ton of data were you normally place data (cache, etc...) run something like that (and any thing else you normally do, understood that it wont be SAB/SB/CP/TransM/FlexGet off of unRAID) and wait a day, if this tests out and the Array stops, you know its a plug-in (as you were not running any).

Then add one plug-in, say SAB, start your downloads for a day, and try to stop the array,. Your choice to keep adding additional plug-ins (keeping the prior on) or removing prior plugin and add a different one. Rinse and repeat after a full day for each. You WILL find which is causing you the issue.

dangil · June 26, 2012

so far so good

no crashes with rc4 clean install, no plugins or mods as always

if it happens again I will report

umount crashes the server when stopping the array

Recommended Posts

dangil

Link to comment

dangil

Link to comment

limetech

Link to comment

madburg

Link to comment

dalben

Link to comment

RobJ

Link to comment

dalben

Link to comment

limetech

Link to comment

dalben

Link to comment

limetech

Link to comment

dangil

Link to comment

chickensoup

Link to comment

dalben

Link to comment

madburg

Link to comment

dalben

Link to comment

madburg

Link to comment

dangil

Link to comment

Join the conversation