Crash/hang when trying to Stop array

June 7, 201313 yr

Clicking 'Stop' in the webGui sometimes causes entire server to crash.

Refer to:

http://lime-technology.com/forum/index.php?topic=27720.msg244890#msg244890

June 7, 201313 yr

Got this problem today.

Doesn't always happens it seems.

Only real difference between this restart and others is the mem=4095M addition.

I did capture some errors tho:


Message from syslogd@Tower at Fri Jun  7 17:34:01 2013 ...
Tower kernel: Stack:

Message from syslogd@Tower at Fri Jun  7 17:34:01 2013 ...
Tower kernel: Process kworker/0:4 (pid: 2266, ti=f332a000 task=f31e8d80 task.ti=f332a000)

Message from syslogd@Tower at Fri Jun  7 17:34:01 2013 ...
Tower kernel: Call Trace:

Message from syslogd@Tower at Fri Jun  7 17:34:01 2013 ...
Tower kernel: Code: 14 0c 00 00 eb 1a 85 db 79 0c 81 e6 ff 00 00 00 8d 4c f0 14 eb 0a c1 e9 1a 8d 8c c8 14 0e 00 00 8b 59 04 89 51 04 89 0a 89 5a 04 <89> 13 f6 42 0c 01 75 0e 8b 52 08 3b 50 0c 79 03 89 50 0c ff 40

Message from syslogd@Tower at Fri Jun  7 17:34:01 2013 ...
Tower kernel: EIP: [<c10327e5>] internal_add_timer+0x8d/0xa7 SS:ESP 0068:f332be88

June 7, 201313 yr

Hi

I had this is issue also with one of my servers which normally goes down every night with a /sbin/powerdown command in a cron job

last 3 nights he releases the array but never shuts down

so this evening i put a monitor on the server and caught this picture

have been looking this up and there seems to be some bug in this kernel 3.9 which seems related

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1178720

this is from ubuntu...

not sure if this is the issue with our unraids but thought i would add it... not a linux guru here ... just proficient with google searching

June 7, 201313 yr

Author

Got this problem today.

Doesn't always happens it seems.

Only real difference between this restart and others is the mem=4095M addition.

I did capture some errors tho:


Message from syslogd@Tower at Fri Jun  7 17:34:01 2013 ...
Tower kernel: Stack:

Message from syslogd@Tower at Fri Jun  7 17:34:01 2013 ...
Tower kernel: Process kworker/0:4 (pid: 2266, ti=f332a000 task=f31e8d80 task.ti=f332a000)

Message from syslogd@Tower at Fri Jun  7 17:34:01 2013 ...
Tower kernel: Call Trace:

Message from syslogd@Tower at Fri Jun  7 17:34:01 2013 ...
Tower kernel: Code: 14 0c 00 00 eb 1a 85 db 79 0c 81 e6 ff 00 00 00 8d 4c f0 14 eb 0a c1 e9 1a 8d 8c c8 14 0e 00 00 8b 59 04 89 51 04 89 0a 89 5a 04 <89> 13 f6 42 0c 01 75 0e 8b 52 08 3b 50 0c 79 03 89 50 0c ff 40

Message from syslogd@Tower at Fri Jun  7 17:34:01 2013 ...
Tower kernel: EIP: [<c10327e5>] internal_add_timer+0x8d/0xa7 SS:ESP 0068:f332be88

Yes this is the problem I can reproduce, and currently working on.

June 8, 201313 yr

I tried to duplicate the crash/hang bug today without luck . I'm back to running rc13 again with my standard set of plugins. Additionally I have Barzija's keeplogs script running. I probably only stopped the array about 4 times followed by a reboot and they all went off without a hitch. I thought there might be some sort of time element to this bug, and besides the system was needed to playback some shows. I'll keep testing some more later tonight and tomorrow. But for now it's working great!!

June 9, 201313 yr

Author

This is an issue where the Community can really help to verify what I think the problem is.

I'm pretty sure I've been chasing a "phantom" problem for the last several days. If you see this problem where your server crashes as a result of Stopping the array, please reboot your server, Stop the array (hopefully it won't crash this time), and then Start in "maintenance" mode (if a parity check starts, you can Cancel it). Next, open a telnet window and run "reiserfsck" on all your data disks:

reiserfsck /dev/md1

reiserfsck /dev/md2

:

etc. for each "md" device (that is for each data disk).

It might take a long time for these checks to run, up to an hour per disk, and you can open multiple telnet windows and run them in parallel. If "reiserfsck" indicates there is a problem it will tell you to re-run "reiserfsck" with a particular option (usually "--fix-fixable"). Go ahead and do whatever it says.

Here's what I'm looking for:

a) if you see this "crash/hang when trying to stop array" do you also have one or more disks with problems reported by "reiserfsck"?

b) if above is "yes", does the "crash/hang" seem to not happen now that all file systems have been corrected?

June 9, 201313 yr

I am about to start the series of test that you mention and will report back on completion (with 19 data drives each of 2 or 3TB this will take some time!). However the thing that puzzles me is why with rc12a I do not see this problem, while with rc13 it crashes my system every time. Is there a difference between the releases that can explain this behaviour? This difference suggests there might be an issue even if it takes some sort of file system corruption to trigger it.

On a slightly related issue, is there any way to run a reiserfsck in check mode on a mounted file system so that it could be made part of normal periodic checks without taking the array off line. That way one could get warning of potential problems.

June 9, 201313 yr

I upgraded to rc13 several days ago with no probs.. went to stop the array tonight in order to reboot the server and experienced the 'hang'. After a reboot, the web ui wouldn't come up. Being very time poor at the moment, I reverted to rc12a and no problems whatsoever. Probably should have grabbed logs etc, but as mentioned, very pressed for time at the moment. Will look at upgrading and following Tom's instructions when I have more time.

June 9, 201313 yr

This is an issue where the Community can really help to verify what I think the problem is.

I'm pretty sure I've been chasing a "phantom" problem for the last several days. If you see this problem where your server crashes as a result of Stopping the array, please reboot your server, Stop the array (hopefully it won't crash this time), and then Start in "maintenance" mode (if a parity check starts, you can Cancel it). Next, open a telnet window and run "reiserfsck" on all your data disks:

reiserfsck /dev/md1

reiserfsck /dev/md2

:

etc. for each "md" device (that is for each data disk).

It might take a long time for these checks to run, up to an hour per disk, and you can open multiple telnet windows and run them in parallel. If "reiserfsck" indicates there is a problem it will tell you to re-run "reiserfsck" with a particular option (usually "--fix-fixable"). Go ahead and do whatever it says.

Here's what I'm looking for:

a) if you see this "crash/hang when trying to stop array" do you also have one or more disks with problems reported by "reiserfsck"?

b) if above is "yes", does the "crash/hang" seem to not happen now that all file systems have been corrected?

Currently running on all 6 disks.

Disks 1 and 2 (which are the two disks I expected something to come up with) have zero errors.

I will attach the results when the other 4 disks are done.

June 9, 201313 yr

I ran Reiserfsck on my array this morning following Limetech's reccomendations. I only have two data drives, running the checks in "parallel" from two separate console sessions. The process finished without errors reported on either drive. I copied the report information from each screen for later reference. The 2TB drive with the most data at 57% took a little over an hour. The second drive at 3% took less time, obviously!

Again, I only had one incident of the crash/hang bug since updating to rc13. I hope this helps!

June 9, 201313 yr

Sorry to prove your theory wrong but.. no errors here on any data disk or the cache!!

cheese@cheddar:~# reiserfsck /dev/sdc1
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/sdc1
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Sun Jun  9 12:48:05 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/sdc1' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished                                
Comparing bitmaps..finished
Checking Semantic tree:
finished                                                                       
No corruptions found
There are on the filesystem:
Leaves 42662
Internal nodes 283
Directories 293533
Other files 157877
Data block pointers 9351608 (0 of them are zero)
Safe links 0
###########
reiserfsck finished at Sun Jun  9 12:53:15 2013
###########
cheese@cheddar:~# 





cheese@cheddar:~# reiserfsck /dev/md1
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md1
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Sun Jun  9 11:14:26 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md1' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 477269
        Internal nodes 3041
        Directories 274640
        Other files 915654
        Data block pointers 306269549 (1 of them are zero)
        Safe links 0
###########
reiserfsck finished at Sun Jun  9 12:29:30 2013
###########
cheese@cheddar:~# 





cheese@cheddar:~# reiserfsck /dev/md2
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md2
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Sun Jun  9 11:14:37 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md2' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 298025
        Internal nodes 2002
        Directories 2474
        Other files 254755
        Data block pointers 269099081 (0 of them are zero)
        Safe links 0
###########
reiserfsck finished at Sun Jun  9 12:15:22 2013
###########
cheese@cheddar:~# 





cheese@cheddar:~# reiserfsck /dev/md3
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md3
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Sun Jun  9 11:14:46 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md3' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished                                                                       
No corruptions found
There are on the filesystem:
        Leaves 401464
        Internal nodes 2392
        Directories 72
        Other files 347
        Data block pointers 406246371 (0 of them are zero)
        Safe links 0
###########
reiserfsck finished at Sun Jun  9 13:01:56 2013
###########
cheese@cheddar:~# 





cheese@cheddar:~# reiserfsck /dev/md4
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md4
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Sun Jun  9 11:15:39 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md4' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
/finished                                                        
No corruptions found
There are on the filesystem:
        Leaves 392724
        Internal nodes 2338
        Directories 7
        Other files 108
        Data block pointers 397418694 (0 of them are zero)
        Safe links 0
###########
reiserfsck finished at Sun Jun  9 12:51:22 2013
###########
cheese@cheddar:~# 





cheese@cheddar:~# reiserfsck /dev/md5
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md5
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Sun Jun  9 11:15:05 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md5' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished                                                          
No corruptions found
There are on the filesystem:
        Leaves 392572
        Internal nodes 2336
        Directories 7
        Other files 100
        Data block pointers 397268811 (0 of them are zero)
        Safe links 0
###########
reiserfsck finished at Sun Jun  9 12:52:32 2013
###########
cheese@cheddar:~# 





cheese@cheddar:~# reiserfsck /dev/md6
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/md6
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Sun Jun  9 11:16:00 2013
###########
Replaying journal: Done.
Reiserfs journal '/dev/md6' in blocks [18..8211]: 0 transactions replayed
Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished                                                                       
No corruptions found
There are on the filesystem:
        Leaves 402560
        Internal nodes 2411
        Directories 95
        Other files 960
        Data block pointers 407295054 (0 of them are zero)
        Safe links 0
###########
reiserfsck finished at Sun Jun  9 12:59:05 2013
###########
cheese@cheddar:~#

June 9, 201313 yr

Is the cache drive to be included in the test? I suppose it's optional since it's not part of the array.

June 9, 201313 yr

Is the cache drive to be included in the test? I suppose it's optional since it's not part of the array.

I did it just for completeness.

June 9, 201313 yr

I have now run the reisrefsck check against all my drives, and they all check out as having no problems so that is not the cause of the crash when stopping the array when using rc13.

June 9, 201313 yr

Sorry Tom

Just checked out my 16 drives, they all come up with - No corruptions found.

June 9, 201313 yr

Did re-install RC13 on one unRAID machine just to do this test. After the Stop the machine hung. Did the suggested steps right after the re-boot. All 14 drives did report "No corruption found".

Regards

June 9, 201313 yr

Here are some syslogs that are 100% repeatable on the stop array problem....

http://lime-technology.com/forum/index.php?topic=27720.msg246113#msg246113

June 9, 201313 yr

I now have had my second crash/hang AND this time around I had Barzija's keeplog going at the time of the crash. So it's attached. After the reboot I got another keeplog session going and then got unRAID into maintenance mode after stopping the parity check and successfully stopping the array. I'm running Reiserfsck on both my drives. I have a picture of the screen dump I'll put in an accompanying post.

I hope this helps!

syslog_2013-06-09_07.48.21.txt

June 9, 201313 yr

I have a picture of the screen dump I'll put in an accompanying post.

I hope this helps!

Here's the picture..........

June 9, 201313 yr

I had this occur several times after first installing rc13. I used Barzija's script to produce a crash log that I posted in the announcement thread. Because of the unclean shut down that occurs when rebooting, the messages displayed in the console during boot suggested that I run a chkdsk on the flash drive. I did this and re-installed rc13 by re-copying over bzimage and bzroot. It crashed 2 times again. So I took out the flash drive, backed it up and copied over everything from the rc13 zip file except for the config file, not just the bzimage and bzroot files. Then I ran the "make_bootable.bat. The server has been running for 3 days now without a crash. I have stopped and restarted the array at least 10 times without incident. I am still using unmenu but no other add-ons besides Barzija's script. I spent 5 minutes stressing the system by starting and stopping the array, spinning up, spinning down, starting a parity check to 1 or 2% then cancelling and then stopping the array. No crashes. Maybe I'm just lucky. Previously I could induce the crash by doing exactly what I did during the 5 minute test period. Whenever I think about it, I stop the array to see what will happen. So far, it stops cleanly. I have stopped and restarted the array twice today, before and after a partial parity check. Now I'm doing a pre-clear on a new 3TB Toshiba (54% at 3:50!). As soon as it is done, I will stop and restart the array a couple of times. If it crashes I will post my syslog. For what it's worth....

June 10, 201313 yr

On a quick glance I dodn't see anything wrong in your syslog. Maybe others will find wrong things upon closer inspection. Two observations though...

The last thing in your syslog is emhttp trying to unmount disks. That's apparently when the crash happened. About 120 seconds earlier, cache_dirs has logged the following entry:
Jun  9 07:50:36 Tower cache_dirs: Suspending cache_dirs for 120 seconds to allow for clean shutdown of array
Now, I am not fwamiliar with the innard workings of cache_dirs, and this can be purely a coincidence. But could it be that cache_dirs resumed its activities at the exact moment of the crash and caused it? If you can't reproduce the crash without cache_dirs, then you may probably have to contact the authors of cache_dirs and bring their attention to this.

Second, I notice you were using sickbeard at some time. Was sickbeard running at the time of the crash or not? The reason I'm asking is because sickbeard has been known to crash peoples unRaids. Can you reproduce the crash without installing sickbeard?

I will remove cache_dirs as I'm a little suspicious of some of the messages getting echoed to screen (lately I've been paying attention to unRAID..as you might guess). I'm seeing something like this:

/usr/sbin/cache_dirs: line 1: kill : (11266) - No Such Process
rm: cannot remove '/var/lock/cache_dirs.LCK' : No Such File or Directory

I was using Sickbeard at the time of the crash. I will look into running without SB....but it might take a day or two. Thanks for the suggestions!!

June 10, 201313 yr

Here is a syslog when I get the crash. I cannot see anything in it that springs out to me as being the cause. I do notice that the log at the end does not show all disks being unmounted - I assume this is because they relevant log lines did not get written due to the crash. This is running a vanilla installation with no extras or plugins. If it might help I can take a photo of what shows up on the console at the time giving the end of the Linux crash information, but it looks like the console screenshots posted earlier in the thread.

syslog_2013-06-10_03.11.37.txt.zip

June 10, 201313 yr

I had this occur several times after first installing rc13. I used Barzija's script to produce a crash log that I posted in the announcement thread. Because of the unclean shut down that occurs when rebooting, the messages displayed in the console during boot suggested that I run a chkdsk on the flash drive. I did this and re-installed rc13 by re-copying over bzimage and bzroot. It crashed 2 times again. So I took out the flash drive, backed it up and copied over everything from the rc13 zip file except for the config file, not just the bzimage and bzroot files. Then I ran the "make_bootable.bat. The server has been running for 3 days now without a crash. I have stopped and restarted the array at least 10 times without incident. I am still using unmenu but no other add-ons besides Barzija's script. I spent 5 minutes stressing the system by starting and stopping the array, spinning up, spinning down, starting a parity check to 1 or 2% then cancelling and then stopping the array. No crashes. Maybe I'm just lucky. Previously I could induce the crash by doing exactly what I did during the 5 minute test period. Whenever I think about it, I stop the array to see what will happen. So far, it stops cleanly. I have stopped and restarted the array twice today, before and after a partial parity check. Now I'm doing a pre-clear on a new 3TB Toshiba (54% at 3:50!). As soon as it is done, I will stop and restart the array a couple of times. If it crashes I will post my syslog. For what it's worth....

Interesting you say this because when upgrading from rc12a to rc13 I did something similar. I replaced all the files in the root of the flash drive (but not the folders) and reran make_bootable.bat . I haven't had a problem stopping the array and powering down provided I stop any running plugins (which is the same as in previous versions).

Sent from my Nexus 4 using Tapatalk 2

June 10, 201313 yr

Jun 10 03:04:18 DJW-UNRAID kernel: FAT-fs (sda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
The filesystem on your USB flash disk is corrupted. That is a serious reason for concern, although I can't imagine why thet would be able to crash the OS. Or maybe that's enough to crash emhttp or the md-mod driver if they are unable to write the "super.dat" file back to the USB flash disk? Who knows. But you should definately take your USB disk to a windows machine, and run chkdsk on it, to fix the filesystem corruption.

I reformatted the USB stick and then put back the files and ran make_bootable.bat. The crash still happens, and after it has happened Windows reports again that there are errors that need correcting. Interestingly enough there are no files ending up in lost+found which may give a clue to what type of corruption is occurring.

My guess it is the fact that the system crashed with the USB device still mounted that is causing this 'corruption' to occur. Since the unmounts appear to not be finishing it is also possible that emhttp may be writing something at the moment of crash I guess.

I also tried doing a completely fresh install and then set up the array again to eliminate the chance of it being a side effect of the contents of one of the unRAID configuration files being corrupted, and I still get the crash, and still end up with the USB stick being reported as having errors when checking it on Windows.

I wonder if those having this issue have something common at the hardware level?

June 10, 201313 yr

That screen shot seems to correspond to what I see on the console of my system when the problem occurs.

I really hope we can get some clue as to what is causing this issue as it makes RC13 unusable in practise for those affected as although one does not want to stop the array very frequently when it is required it is normally for an important reason.

I have edited my syslinux.cfg so I can boot any of the last few RC releases, and at the moment I have resets the default to be RC12a although I can still boot RC13 if required by selecting it at boot time.

Crash/hang when trying to Stop array

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)