umount crashes the server when stopping the array


Recommended Posts

Since beta6a, until rc3, I have a intermittent issue while stopping the array.

 

umount crashes the server when stopping the array

 

I have a Supermicro X8SIL-V motherboard, with 6 onboard sata. If I only use these ports, I don't see the issue.

 

if there is more than 1 sata controller, the issue appears.

 

I did several cycles of mount/umount of disk6, that is the disk attached to pci-e controller,  in maintenence mode, as well as in normal mode (with SMB Export set to off), and none failed

 

I think this bug is brought up because the umount of disk5, the last disk on the onboard sata controller, is not completelly finished when, in the sequence, the umount for disk6, attached to the pci-e controler is started

 

what led me to this conclusion is that on my syslog , 2 disks remain busy after umount crashes the kernel. the conclusion is that the unmount of the second to last disk and the unmount of the last disk conflict with each other. perhaps from a race condition between them.

 

I think if a small delay between umounts is added, this issue could be solved

 

attached is the syslog with the kernel BUG info

syslog.zip

Link to comment
  • 2 weeks later...

an update

 

since I started unmounting disk6 (connected to the PCI-E Sata Controller) manually, everything is running smoothly. no more crashes

 

I can press stop, unraid will unmount the disks connected to the onboard sata, and stop the array

 

 

Link to comment

an update

 

since I started unmounting disk6 (connected to the PCI-E Sata Controller) manually, everything is running smoothly. no more crashes

 

I can press stop, unraid will unmount the disks connected to the onboard sata, and stop the array

 

I have a fix for this in -rc4 that may solve the problem.  I say "may" because, though I can make it happen infrequently without the fix, with the fix I can't make it happen.  But maybe after the 543'd stop/start sequence it will happen.  The problem is a race condition in reiserfs that I think was introduced when they removed the "BKL" (big kernel lock) from the code, but I don't understand all the code enough to isolate and solve, my solution is a workaround, though I think it works.  Maybe I should go and try and visit Hans in San Quentin  :o

Link to comment

an update

 

since I started unmounting disk6 (connected to the PCI-E Sata Controller) manually, everything is running smoothly. no more crashes

 

I can press stop, unraid will unmount the disks connected to the onboard sata, and stop the array

 

I have a fix for this in -rc4 that may solve the problem.  I say "may" because, though I can make it happen infrequently without the fix, with the fix I can't make it happen.  But maybe after the 543'd stop/start sequence it will happen.  The problem is a race condition in reiserfs that I think was introduced when they removed the "BKL" (big kernel lock) from the code, but I don't understand all the code enough to isolate and solve, my solution is a workaround, though I think it works.  Maybe I should go and try and visit Hans in San Quentin  :o

 

Oh, that last part was too funny. I'll drink to that.

Link to comment

I'm getting the same issue on rc4.

 

I try to stop the array and the unraid server locks up.  Doesn't even respond to a ping.

 

Can you provide a syslog that is saved as late as possible, and/or run a console or Telnet/PuTTY 'tail -f' and show us the very last messages?

Link to comment

I'm getting the same issue on rc4.

 

I try to stop the array and the unraid server locks up.  Doesn't even respond to a ping.

 

Can you provide a syslog that is saved as late as possible, and/or run a console or Telnet/PuTTY 'tail -f' and show us the very last messages?

 

Here's what I think you're after.

 

Jun  6 18:25:57 tdm status[23414]: No active PIDS on the array
Jun  6 18:25:58 tdm rc.unRAID[23452]: Killing active pids on the array drives
Jun  6 18:25:58 tdm rc.unRAID[23480]: Umounting the drives
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md1 umounted
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md2 umounted
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md3 umounted
Jun  6 18:25:59 tdm rc.unRAID[23494]: Stopping the Array
Jun  6 18:25:59 tdm kernel: mdcmd (20): stop 
Jun  6 18:25:59 tdm kernel: md1: stopping
Jun  6 18:25:59 tdm kernel: md2: stopping
Jun  6 18:25:59 tdm kernel: md3: stopping
Jun  6 18:26:02 tdm mdstatusdiff[23506]: --- /tmp/mdcmd.23346.1^I2012-06-06 18:25:59.481577266 +0800
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +++ /tmp/mdcmd.23346.2^I2012-06-06 18:26:02.631635271 +0800
Jun  6 18:26:02 tdm mdstatusdiff[23506]: @@ -1,14 +1,14 @@
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbName=/boot/config/super.dat
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbVersion=2.1.3
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbCreated=1314335405
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbUpdated=1338978072
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbEvents=260
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbState=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbUpdated=1338978359
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbEvents=261
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbState=1
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbNumDisks=4
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbSynced=1338945606
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbSyncErrs=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdVersion=2.1.3
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -mdState=STARTED
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +mdState=STOPPED
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdNumProtected=4
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdNumDisabled=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdDisabledDisk=0

 

Looking at the time, that *copuld* be a console generated shutdown command.  The lockuop is more on the stop array from the webgui.

 

I'm not sure what tail -f is.  How do I set it up.  I run a putty window, typed tail -f and got the following.  Now what ?

 

Linux 3.0.33-unRAID.
root@tdm:~# tail -f
tail: warning: following standard input indefinitely is ineffective

Link to comment

I'm getting the same issue on rc4.

 

I try to stop the array and the unraid server locks up.  Doesn't even respond to a ping.

 

Can you provide a syslog that is saved as late as possible, and/or run a console or Telnet/PuTTY 'tail -f' and show us the very last messages?

 

Here's what I think you're after.

 

Jun  6 18:25:57 tdm status[23414]: No active PIDS on the array
Jun  6 18:25:58 tdm rc.unRAID[23452]: Killing active pids on the array drives
Jun  6 18:25:58 tdm rc.unRAID[23480]: Umounting the drives
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md1 umounted
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md2 umounted
Jun  6 18:25:58 tdm rc.unRAID[23484]: /dev/md3 umounted
Jun  6 18:25:59 tdm rc.unRAID[23494]: Stopping the Array
Jun  6 18:25:59 tdm kernel: mdcmd (20): stop 
Jun  6 18:25:59 tdm kernel: md1: stopping
Jun  6 18:25:59 tdm kernel: md2: stopping
Jun  6 18:25:59 tdm kernel: md3: stopping
Jun  6 18:26:02 tdm mdstatusdiff[23506]: --- /tmp/mdcmd.23346.1^I2012-06-06 18:25:59.481577266 +0800
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +++ /tmp/mdcmd.23346.2^I2012-06-06 18:26:02.631635271 +0800
Jun  6 18:26:02 tdm mdstatusdiff[23506]: @@ -1,14 +1,14 @@
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbName=/boot/config/super.dat
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbVersion=2.1.3
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbCreated=1314335405
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbUpdated=1338978072
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbEvents=260
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -sbState=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbUpdated=1338978359
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbEvents=261
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +sbState=1
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbNumDisks=4
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbSynced=1338945606
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  sbSyncErrs=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdVersion=2.1.3
Jun  6 18:26:02 tdm mdstatusdiff[23506]: -mdState=STARTED
Jun  6 18:26:02 tdm mdstatusdiff[23506]: +mdState=STOPPED
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdNumProtected=4
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdNumDisabled=0
Jun  6 18:26:02 tdm mdstatusdiff[23506]:  mdDisabledDisk=0

 

Looking at the time, that *copuld* be a console generated shutdown command.  The lockuop is more on the stop array from the webgui.

 

I'm not sure what tail -f is.  How do I set it up.  I run a putty window, typed tail -f and got the following.  Now what ?

 

Linux 3.0.33-unRAID.
root@tdm:~# tail -f
tail: warning: following standard input indefinitely is ineffective

 

you want

tail -f /var/log/syslog

 

But I don't recognize "rc.unRAID" in your syslog, what is this?  And please disable for now.

Link to comment

Thanks for the command.

 

In true demonstration mode, it didn't hang this time.  Stopped the array fine from the webgui (simplefeautures).  reboot from the webgui.  All good.

 

Between the last hang and now I stripped back everything down to bare unraid (removed all plugins and packages) and started rebuilding the server with the add-ons I need from scratch.

 

I can only assume that there was some mixture of plugins/packages and caused some issues.

 

The only lingering issue I have is cache won't unmount via the webgui so I need to force a umount -f on it.

 

wrt rc.unRAID I a) don't know what it is and b) wouldn't know how to disable it.

Link to comment

thanks Tom! updating to rc-4 to try this fix. will report back

 

When you update to rc-4, make sure you do a clean install on your flash to remove any traces of old plugins etc. test this with no plugins and report the results. This includes simplefeatures, unmenu or anything that isn't supplied by lime-tech's default install.

 

If the issue is still happening please repost a sylog with rc4.

Link to comment

I've had this happen to me again today with RC4

 

I do have a few plugins installed but none of them are any different to the plugins that worked with RC3 and all previous RC & Beta release.

 

I found my webgui (simplefeatures) was non responsive.  unmenu worked for a while then that hang.  From a console window I restarted unmenu fine.  I clicked the "stop array" button and after about half a minute the server became unresponsive, not even responding to a ping.

 

I've attached the tail -f syslog, but the only syslog entries that appeared after I clicked "Stop Array" were the following:

 

Jun 13 20:21:35 tdm kernel: mdcmd (56): spinup 0
Jun 13 20:21:35 tdm kernel:
Jun 13 20:21:35 tdm kernel: mdcmd (57): spinup 1
Jun 13 20:21:35 tdm kernel:
Jun 13 20:21:35 tdm kernel: mdcmd (58): spinup 2
Jun 13 20:21:35 tdm kernel:
Jun 13 20:21:35 tdm kernel: mdcmd (59): spinup 3
Jun 13 20:21:35 tdm kernel:
Jun 13 20:21:43 tdm mountd[4117]: Caught signal 15, un-registering and exiting.
Jun 13 20:21:44 tdm kernel: nfsd: last server has exited, flushing export cache

 

Throughout the course of the day I had been working on cleaning all my and packages plugins and ensuring that only the latest working ones were installed.  I know it was all fine earlier today as I restarted the server a few times to make sure that the SAB/SB/CP/TransM/FlexGet combo all started fine without any errors.

 

I could try and strip everything back to the core UnRaid rc4 again but that sort of defeats the purpose as I only have a live server and I really don't want to lose the SAB/SB/CP/TransM/FlexGet automation I have setup.

tdm_hang_syslog.txt

Link to comment

Well, look at it this way. Tom is changing things, right? So, you can't go by a third party plugin that (may have) worked even one version prior. It's for the plug-in creator to adjust, change etc their plugin. For your sanity and others, first try removing all plugins and test various aspects. If all checks out for you, add one plugin at a time and test over again. If you find a plugin issue, post within the plugin forum for help (most likely will help others as well).

 

Its not very hard to back up all the files (.plg, go script, etc) and start fresh and add each one by one, once you know the RC checks out.

Link to comment

Well, look at it this way. Tom is changing things, right? So, you can't go by a third party plugin that (may have) worked even one version prior. It's for the plug-in creator to adjust, change etc their plugin. For your sanity and others, first try removing all plugins and test various aspects. If all checks out for you, add one plugin at a time and test over again. If you find a plugin issue, post within the plugin forum for help (most likely will help others as well).

 

Its not very hard to back up all the files (.plg, go script, etc) and start fresh and add each one by one, once you know the RC checks out.

 

Yeah, look, I'm not looking for someone's undivided attention to fix my issue if I'm not prepared to spend time doing a proper debug.  I'm just reporting what I saw and threw up a syslog and if that helps tom or others find a bug in unraid or a package/plugin, great.

 

I had already removed all plugins/packages and slowly re-added what I needed one by one.  Stopping and restarting the array each time.  All worked well with the final steps doing full server restarts.  The problem appeared a day later when I wanted to stop the array.

Link to comment

Yeah, look, I'm not looking for someone's undivided attention to fix my issue if I'm not prepared to spend time doing a proper debug.  I'm just reporting what I saw and threw up a syslog and if that helps tom or others find a bug in unraid or a package/plugin, great.

 

Its not so much that, as helping to get to the root cause so Tom can reproduce and notify if it is a bug or not.

 

The plug owner(s) mostly dont know exactly what your running via this post so you wont get help here if it is a plugin problem.

 

You should try to run the RC, grab a spare PC, etc. and copy a ton of data were you normally place data (cache, etc...) run something like that (and any thing else you normally do, understood that it wont be SAB/SB/CP/TransM/FlexGet off of unRAID) and wait a day, if this tests out and the Array stops, you know its a plug-in (as you were not running any).

 

Then add one plug-in, say SAB, start your downloads for a day, and try to stop the array,. Your choice to keep adding additional plug-ins (keeping the prior on) or removing prior plugin and add a different one.  Rinse and repeat after a full day for each. You WILL find which is causing you the issue.

 

Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.