Jump to content

Mover crashes unRAID 5.0.5 (completely inaccessible)


Recommended Posts

Is there a way to identify which files will cause mover to crash? A directory containing a series of tarballs were being added to the Cache disk (ultimately destined for a share on the array) when I invoked the mover script -- yes, stupid. I suspect these files are the cause of mover now crashing unRAID when I invoke the mover script. I deleted the whole directory, but mover still crashes unRAID.

 

I ran a SMART extended test and reiserfsck on the cache disk, neither had anything that stood out to me as an erroneous disk.

 

 

 

~~~~~~~~~~

  Background

~~~~~~~~~~

For a long time I've used my Cache disk exclusively for running apps, but I recently changed most of my shares to    Use cache disk: Yes

 

I have ~1.3 TB of data on my Cache disk that needs to be transferred to different shares. Based on this size, my guess is the mover never has ran properly in the past month or so.

 

With Dynamix WebGui v2.2.6 (latest version) running, I selected "Move now" to invoke the mover script under Main -> Array Operation. The WebGui then shows "moving now..." but ends up freezing unRAID. By freezing I mean the WebGui is non-responsive, I cannot telnet in, and when I plug in a keyboard and monitor I cannot even see text to copy the syslog onto the flash drive. The keyboard numlock LED doesn't turn green, so it seems the whole unRAID OS is crashed.

 

I waited 5 days thinking the mover might somehow still complete and then had to do a hard shutoff by holding the power button. Immediately after turning back on and booting up, the parity check had 0 errors. I again tried to invoke the mover and unRAID crashed again. Finally, I watched the syslog as I invoked mover on the webGUI one more time with:

tail -f --lines=100 /var/log/syslog 

 

 

 

This is the syslog output:

Aug 15 18:34:01 Tower crond[1228]: exit status 1 from user root /usr/lib/sa/sa1 1 1 & 1>/dev/null 2>&1

Aug 15 18:34:01 Tower kernel: crond[14017]: segfault at 4001e51c ip 4001e51c sp bfccc064 error 15 in ld-2.11.1.so[4001e000+1000]

Aug 15 18:35:01 Tower crond[1228]: exit status 1 from user root /usr/lib/sa/sa1 1 1 & 1>/dev/null 2>&1

Aug 15 18:35:01 Tower kernel: crond[15264]: segfault at 4001e51c ip 4001e51c sp bfccc064 error 15 in ld-2.11.1.so[4001e000+1000]

Aug 15 18:35:17 Tower emhttp: shcmd (605): /usr/local/sbin/mover |& logger &

Aug 15 18:35:17 Tower logger: mover started

Aug 15 18:35:17 Tower logger: skipping CompletedDownloads/

Aug 15 18:35:17 Tower logger: moving DC/

Aug 15 18:35:17 Tower logger: ./DC/._Frosty the Snowman (1969)

Aug 15 18:35:17 Tower logger: .d..t...... ./

Aug 15 18:35:17 Tower logger: rsync: get_xattr_names: llistxattr("DC",1024) failed: Input/output error (5)

Aug 15 18:35:18 Tower logger: .d..t.....x DC/

Aug 15 18:35:18 Tower logger: >f......... DC/._Frosty the Snowman (1969)

Aug 15 18:35:18 Tower logger: rsync: get_xattr_names: llistxattr("DC",1024) failed: Input/output error (5)

Aug 15 18:35:18 Tower logger: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1042) [sender=3.0.7]

Aug 15 18:35:18 Tower logger: ./DC/The Roots of the Matrix/VIDEO_TS/VTS_01_0.BUP

Aug 15 18:35:18 Tower logger: .d..t...... ./

Aug 15 18:35:18 Tower logger: rsync: get_xattr_names: llistxattr("DC",1024) failed: Input/output error (5)

Aug 15 18:35:18 Tower logger: .d..t.....x DC/

Aug 15 18:35:18 Tower logger: .d..t...... DC/The Roots of the Matrix/VIDEO_TS/

Aug 15 18:35:18 Tower logger: >f......... DC/The Roots of the Matrix/VIDEO_TS/VTS_01_0.BUP

Aug 15 18:35:18 Tower logger: rsync: get_xattr_names: llistxattr("DC",1024) failed: Input/output error (5)

Aug 15 18:35:18 Tower logger: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1042) [sender=3.0.7]

Aug 15 18:35:18 Tower logger: ./DC/The Roots of the Matrix/VIDEO_TS/VTS_01_0.IFO

Aug 15 18:35:18 Tower logger: .d..t...... ./

Aug 15 18:35:18 Tower logger: rsync: get_xattr_names: llistxattr("DC",1024) failed: Input/output error (5)

Aug 15 18:35:18 Tower logger: .d..t.....x DC/

Aug 15 18:35:18 Tower logger: >f......... DC/The Roots of the Matrix/VIDEO_TS/VTS_01_0.IFO

Aug 15 18:35:18 Tower logger: rsync: get_xattr_names: llistxattr("DC",1024) failed: Input/output error (5)

Aug 15 18:35:18 Tower logger: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1042) [sender=3.0.7]

Aug 15 18:35:18 Tower logger: ./DC/The Roots of the Matrix/VIDEO_TS/VTS_01_0.VOB

Aug 15 18:35:18 Tower logger: .d..t...... ./

Aug 15 18:35:18 Tower logger: rsync: get_xattr_names: llistxattr("DC",1024) failed: Input/output error (5)

Aug 15 18:35:18 Tower logger: .d..t.....x DC/

Aug 15 18:35:18 Tower logger: >f......... DC/The Roots of the Matrix/VIDEO_TS/VTS_01_0.VOB

Aug 15 18:35:21 Tower logger: rsync: get_xattr_names: llistxattr("DC",1024) failed: Input/output error (5)

Aug 15 18:35:21 Tower logger: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1042) [sender=3.0.7]

Aug 15 18:35:21 Tower logger: ./DC/The Roots of the Matrix/VIDEO_TS/VTS_01_1.VOB

Aug 15 18:35:21 Tower logger: .d..t...... ./

Aug 15 18:35:21 Tower kernel: ------------[ cut here ]------------

Aug 15 18:35:21 Tower kernel: kernel BUG at mm/slub.c:3409!

Aug 15 18:35:21 Tower kernel: invalid opcode: 0000 [#1] SMP

Aug 15 18:35:21 Tower kernel: Modules linked in: md_mod sg acpi_cpufreq mperf r8168(O) fam15h_power atiixp i2c_piix4 i2c_core k10temp hwmon sata_sil24 mvsas libsas scsi_transport_sas ahci libahc

 

 

 

reiserfsck --check /dev/sde1

reiserfsck 3.6.24

 

Will read-only check consistency of the filesystem on /dev/sde1

Will put log info to 'stdout'

 

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes

###########

reiserfsck --check started at Fri Aug 15 17:53:09 2014

###########

Replaying journal: Done.

Reiserfs journal '/dev/sde1' in blocks [18..8211]: 0 transactions replayed

Checking internal tree.. finished                               

Comparing bitmaps..finished

Checking Semantic tree:

finished                                                                     

No corruptions found

There are on the filesystem:

        Leaves 388567

        Internal nodes 2572

        Directories 763024

        Other files 397427

        Data block pointers 303639590 (3027182 of them are zero)

        Safe links 0

###########

reiserfsck finished at Fri Aug 15 18:19:37 2014

###########

 

 

 

Link to comment

Thanks for the link. I had many shares on many disks with extended attributes which were removed. I verified these were removed with the rmattrs script again by using

getfattr -d

 

I then invoked the mover script with the webGUI and again unRAID crashed. When I restart, the parity check indicates 0 errors. Running getfattr -d on the "DC" share for each disk again shows extended attributes for the same disks (2, 3, 5, 6, 7, and 10) despite me never mounting the shares after rmattrs removed them. I've collected syslogs of the crashes, because I've tried this now a few times with the same result -- unRAID crashes within just a few minutes of mover.

 

There's more than just the "DC" and "Movies" directories to be moved with file contents, but only on the 3rd crash did it even get as far as trying some Movies.

 

I have attached 3 syslogs.

crash1.txt

crash2.txt

crash3.txt.zip

Link to comment

 

Possibly a memory problem, I would double check with a memtest FIRST.

 

Do the mover in safe mode.

Something being added to the system may be causing grief with a shared library.

 

 

segfaults are attempts to access memory the application does not have permission to access.

Then there is the kernel bug. This is all suspect of issues with a library, memory or failing cpu (usually not likely).

 

 

 

 


Aug 15 18:34:01 Tower crond[1228]: exit status 1 from user root /usr/lib/sa/sa1 1 1 & 1>/dev/null 2>&1 
Aug 15 18:34:01 Tower kernel: crond[14017]: segfault at 4001e51c ip 4001e51c sp bfccc064 error 15 in ld-2.11.1.so[4001e000+1000]
Aug 15 18:35:01 Tower crond[1228]: exit status 1 from user root /usr/lib/sa/sa1 1 1 & 1>/dev/null 2>&1 
Aug 15 18:35:01 Tower kernel: crond[15264]: segfault at 4001e51c ip 4001e51c sp bfccc064 error 15 in ld-2.11.1.so[4001e000+1000]


...


Aug 15 18:35:21 Tower kernel: ------------[ cut here ]------------
Aug 15 18:35:21 Tower kernel: kernel BUG at mm/slub.c:3409!
Aug 15 18:35:21 Tower kernel: invalid opcode: 0000 [#1] SMP 
Aug 15 18:35:21 Tower kernel: Modules linked in: md_mod sg acpi_cpufreq mperf r8168(O) fam15h_power atiixp i2c_piix4 i2c_core k10temp hwmon sata_sil24 mvsas libsas scsi_transport_sas ahci libahc

Link to comment

Thank you for your response! Your explanations of the errors in the syslogs are really helpful for my understanding.

 

I'll run memtest for 3 cycles, then boot into safemode and try to invoke mover by telnetting and just typing mover I hope this is the correct way to do it? The unRAID Wiki indicates /usr/share/sbin/mover uses find and mv but I noticed only rsync in my syslogs http://lime-technology.com/wiki/index.php/Cache_disk#The_Mover

 

I hope to report back and mark this as solved.

Link to comment

Thanks for clarification that mover uses rsync.

 

Memtest was helpful! I am in the process of RMAing one of my 2 ram sticks, but the other was in its 4th pass with 0 errors.

 

I booted into safemode and invoked mover but unRAID still freezes. When I typed mover in the telnet session, additional info was printed because before I used the webGUI mover (attached as mover_errors.txt) It looks like there's a problem with rsync but I don't understand.

 

Syslog is also attached separately.

 

*edit: mover_errors.txt file added

syslog.txt

mover_errors.txt

Link to comment

These are not normal circumstances.

 

rsync error: error in rsync protocol data stream (code 12) at io.c(1530) [Receiver=3.0.7]
./DC/Dalí (1991)/VIDEO_TS/VTS_01_0.IFO
*** glibc detected *** rsync: free(): invalid next size (normal): 0x080d2420 ***
======= Backtrace: =========
/lib/libc.so.6(+0x705aa)[0x400a45aa]

 

memory is still suspect and/or issues with memory timing.

perhaps they need to be relaxed by some method in the bios.

 

I used to have an abit ab9 pro board that required me to relax the timings for a reliable system.

 

What motherboard and ram set are you using with it?

Link to comment

Mobo:  ASRock 880GM-LE FX AM3+ AMD 880G Micro ATX

RAM:  Patriot Signature 4 GB PC3-10600 (1333 MHz) DDR3 CL9

 

I believe the CL9 indicates a 9-9-9-24 latency timing. By relaxing (underclocking?) are you suggesting to increase the latency numbers, decrease the frequency/Hz, decrease voltage, or a combination of these?

 

Link to comment

Mobo:  ASRock 880GM-LE FX AM3+ AMD 880G Micro ATX

RAM:  Patriot Signature 4 GB PC3-10600 (1333 MHz) DDR3 CL9

 

I believe the CL9 indicates a 9-9-9-24 latency timing. By relaxing (underclocking?) are you suggesting to increase the latency numbers, decrease the frequency/Hz, decrease voltage, or a combination of these?

 

 

I looked at the motherboard manual. Nothing really jumped out at me.

 

 

You can try increasing the latency if it is set, or set to auto and see what happens.

I know one board used to require me to slow it down.

I think I had to adjust the frequency too. It slightly overclocked the cpu and that caused all sorts of grief with reliability.

 

 

So in effect, I would suggest reviewing the bios and using the most relaxed or safest settings possible.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...