Mover causes complete crash (5b14)

ajburnet · February 25, 2012

Hi all,

I first had this problem in December and have made no progress in resolving it. I tried re-building it using the latest plugins etc in the hope that it might be resolved - sadly the issue still exists. Id really like to have a cache drive that I can use!

So, I now have:

5b14 (Hardware: 4GB RAM, X2 2.6GHZ)

Plugins: APCUPSDaemon, Plex, SABnzbd, Couchpotato, SickBeard, rSYNC daemon (standalone) for rsync backups, powerdown script, unMENU

I have tried disabling these one by one and no change. The output of a tail run is included below.

Any ideas??! Thank you all!!

Alex.

tailoutput260212.txt.zip

BRiT · February 25, 2012

Have you ensured your system is stable enough to complete a couple hours run of Memtest?

WeeboTech · February 25, 2012

I think you are running out of ram. Even though you have 4GB, you are running out of low ram.

Are you running cache_dirs?

you could try dropping the cache's before the mover starts.

put this in the start of the mover script and duplicate it at the end of the script.

echo 3 > /proc/sys/vm/drop_caches

When does your rsync backup occur?

You may to do this command to drop the cache before and after it.

I know you have 4gb, but I have the same issue because I have so many files.

If I were to run cache_dirs and one of my system backups without dropping the cache I would see these types of crashes also.

There are other tunings you can try. Download my rsync_linked_backup script from my google code page.

There's an example of what I do before the rsync and after it.

Here's a snippet.

# Save Kernel Options.
swappiness=$(</proc/sys/vm/swappiness)
cachepressure=$(</proc/sys/vm/vfs_cache_pressure)
echo 3              > /proc/sys/vm/drop_caches
echo 100            > /proc/sys/vm/swappiness
echo 200            > /proc/sys/vm/vfs_cache_pressure
nice ionice -c3 rsync -aW ${RSYNCOPTS} ${BACKUPSRC[*]} ${BACKUPDIRD}
echo 3              > /proc/sys/vm/drop_caches
echo $swappiness    > /proc/sys/vm/swappiness
echo $cachepressure > /proc/sys/vm/vfs_cache_pressure

ajburnet · February 26, 2012

Thanks - yes, I've run an overnight memtest and all was okay.

I'll try your script and report back! Thanks.

WeeboTech · February 26, 2012

Thanks - yes, I've run an overnight memtest and all was okay.

I'll try your script and report back! Thanks.

You need to modify your mover script manually.

Then save the changes somewhere on your boot drive.

FWIW, this only bandaid's the problem.

If the root filesystem was on tmpfs and you had a swap file you might not encounter this.

Do your extra programs use /tmp for anything?

If so you may want to reconfigure them to use /mnt/cache/.tmp or something like that.

Do you have a swap partition or swap file on the cache drive anywhere?

ajburnet · February 26, 2012

Just noticed that the folder caching menu in settings (running SimpleFeatures GUI) has the 'suspend during mover process' enabled so I assume it is being disabled before the mover runs, i.e. same outcome as your script?

I've ordered more memory but I assume it just uses an available amount for caching so I won't see any benefit - or is it a set maximum value is utilises limited by the levels / files it caches?

My apps are all running out of user/local - isn't that in RAM? Might be worth running out of a temp file on the cache drive I guess and setting up a swap-file. I'll see if that helps. If it doesn't I'll disable cache dirs to test.

I noticed that Plex was using /tmp so will change that to the cache drive.

Fingers crossed & thanks!!

WeeboTech · February 26, 2012

if all your apps are on /usr/local with plex using /tmp and cache_dirs running, I can totally understand why you are having out of memory issues.

You can add more ram, but you will still have the same issues if you are running cache_dirs and massive rsync at the same time.

Some things you can do.

make a swap partition on your cache drive. make it the full size of ram if possible.

mount a tmpfs on /tmp

This allows /tmp to be used and also swapped out safely.

you mention user/local .

Did you mean /mnt/user/local or /usr/local

if you are talking about /usr/local then yes you are putting everything in ram.

if this is how you are using it then I would

rename /usr/local to /usr/local.d

mkdir /usr/local

mount tmpfs /usr/local

do a rsync -a --remove-sent-files /usr/local.d/ /usr/local

this has the effect of creating a contained ramdisk on /usr/local that can be swapped out if you have memory pressure.

The root initrootfs cannot be swapped out, but tmpfs can.

ajburnet · February 26, 2012

Thanks for that - really appreciate it. I'm about to throw the damn thing away if it wasn't so good when it's working!!

I created a swap file (2.5GB, I tried a 4GB one which it didn't like...) and have moved both plex and sabnzbd so they are running from the cache drive (couch potato and sick beard are running from usr/local as they would not install elsewhere happily - this seems to be a common setup for those apps. They do have data drives set on the cache however.

I invoked the mover once (with nothing to sync) and it was fine. Second time and it's crashed again!

It still seems to be using a lot of RAM, or is this normal - I wasn't expecting it to use the swap when nothing much is happening!

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

0 1 2580 124760 106676 3336028 0 4 336 4001 1499 1904 5 8 31 56

total used free shared buffers cached

Mem: 3887512 3762884 124628 0 106676 3336028

-/+ buffers/cache: 320180 3567332

Swap: 3583996 2580 3581416

Also, as soon as cached_dirs are enabled and the mover script runs it crashed (see below). With cached_dirs disabled it appears to be okay..

Message from syslogd@unRAID at Sun Feb 26 23:23:18 2012 ...

unRAID kernel: Call Trace:

Message from syslogd@unRAID at Sun Feb 26 23:23:18 2012 ...

unRAID kernel: Code: 8b 08 85 c9 89 4d f0 75 16 8b 7d dc 8b 4d ec 50 89 f0 89 fa e8 5a f6 ff ff 89 45 f0 58 eb 30 8b 46 14 8b 5d f0 8b 7d e8 8b 55 e8 <8b> 04 03 47 89 7d e0 89 f9 89 45 e4 89 c3 8b 3e 8b 45 f0 64 0f

Message from syslogd@unRAID at Sun Feb 26 23:23:18 2012 ...

unRAID kernel: Process cache_dirs (pid: 23741, ti=d3062000 task=ef0af600 task.ti=d3062000)

Message from syslogd@unRAID at Sun Feb 26 23:23:18 2012 ...

unRAID kernel: EIP: [<c107d9a0>] __kmalloc+0xbb/0xff SS:ESP 0068:d3063ea8

Message from syslogd@unRAID at Sun Feb 26 23:23:18 2012 ...

unRAID kernel: Stack:

Message from syslogd@unRAID at Sun Feb 26 23:23:18 2012 ...

unRAID kernel: Oops: 0000 [#3] SMP

Appreciate you're help. I've spent many days trying to sort this so a BIG THANK YOU!

Thanks,

Alex.

tail_output.txt

ajburnet · February 26, 2012

I can't reboot now - it says out of memory, no killable processes!

Halp!

WeeboTech · February 26, 2012

Shutdown the other processes. see if you can reboot after that.

do a dh -hs /usr/local (how much space is being used on this directory?)

You may want to try mounting a tmpfs on this directory before you install your apps.

You can also try adding more ram to the system.

I have 8GB.

but if there's nothing to move via mover, it should not crash the system.

The amount of swap being used is small, but then again, you must have rebooted very recently to enable swap and move the apps, so that means right away the kernel sees the need to move pages to swap.

It's a small amount but there's pressure for memory.

ajburnet · February 26, 2012

Thanks. Will try that tonight. I notice that quite often the swap-file doesnt seem to start - it shows:

/etc/rc/d/rc/local: line 33: 9760 killed

Why would that be the case?

Not fluent in linux, to mount a tmpfs is it something like this in the go script?

mv usr/local usr/local.d

mkdir usr/local

mount -t tmpfs -o size=2000m tmpfs /usr/local

rsync -a --remove-sent-files /usr/local.d/ /usr/local

Thanks - I owe you!

Alex

WeeboTech · February 27, 2012

I don't know that those are the exact commands, but that is close to what they are.

The rsync may have to change with the / ending character or not. I cannot remember.

What was the output of your du -hs /usr/local

ajburnet · February 27, 2012

26M, not that big surely!?

Am trying to enable the tmpfs to test impact...will report back

ajburnet · February 27, 2012

swap and tmpfs now running and multiple manual invokes of the mover script have not crashed the array - looking good at this point.

The only cause for a raised eyebrow is that the swap file usage seems to be slowly increasing - I can only assume that the tmpfs (running couch & sick beard since they are the only things running still in /usr/local) have a memory leak or something - or would that be a bad assumption?

details below:

root@unRAID:/usr/local# vmstat && free

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

0 1 48644 116400 173872 2979136 9 27 964 1677 1239 1702 7 7 63 23

total used free shared buffers cached

Mem: 3887512 3771112 116400 0 173872 2979136

-/+ buffers/cache: 618104 3269408

Swap: 2047996 48644 1999352

root@unRAID:/usr/local# vmstat && free

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

0 0 60132 113744 179196 2936360 9 28 909 1651 1221 1660 7 7 64 22

total used free shared buffers cached

Mem: 3887512 3773768 113744 0 179196 2936360

-/+ buffers/cache: 658212 3229300

Swap: 2047996 60132 1987864

root@unRAID:/usr/local# vmstat && free

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

0 0 75252 116924 171500 2959432 9 29 859 1616 1204 1608 9 8 63 20

total used free shared buffers cached

Mem: 3887512 3770712 116800 0 171500 2959432

-/+ buffers/cache: 639780 3247732

Swap: 2047996 75252 1972744

root@unRAID:/usr/local# vmstat && free

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

0 0 80576 115856 164368 2972868 9 30 838 1601 1197 1592 9 8 63 20

total used free shared buffers cached

Mem: 3887512 3771904 115608 0 164368 2972868

-/+ buffers/cache: 634668 3252844

Swap: 2047996 80576 1967420

root@unRAID:/usr/local# vmstat && free

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

0 0 82988 117896 163428 2974148 9 30 808 1585 1187 1568 9 8 64 19

total used free shared buffers cached

Mem: 3887512 3769616 117896 0 163428 2974148

-/+ buffers/cache: 632040 3255472

Swap: 2047996 82988 1965008

root@unRAID:/usr/local# vmstat && free

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

0 0 83728 113500 162300 2979940 9 30 798 1583 1184 1562 10 8 64 19

total used free shared buffers cached

Mem: 3887512 3774012 113500 0 162300 2979940

-/+ buffers/cache: 631772 3255740

Swap: 2047996 83728 1964268

root@unRAID:/usr/local# vmstat && free

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

0 0 85104 115832 163920 2977168 10 31 776 1568 1177 1545 10 8 64 18

total used free shared buffers cached

Mem: 3887512 3771680 115832 0 163920 2977168

-/+ buffers/cache: 630592 3256920

Swap: 2047996 85104 1962892

root@unRAID:/usr/local# vmstat && free

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

0 0 90264 115004 165824 2979940 11 32 737 1554 1164 1514 10 8 64 18

total used free shared buffers cached

Mem: 3887512 3772640 114872 0 165824 2979940

-/+ buffers/cache: 626876 3260636

Swap: 2047996 90264 1957732

I will test overnight but assume it may crash once it reaches the limit of the swapfile?

Thanks - making some real progress I think!

Alex.

WeeboTech · February 27, 2012

There are some ps command line arguments that will show a process memory size. you may want to explore that.

ajburnet · February 27, 2012

Thanks.

I tried that and the top 5 are sabnzbd, plex, python-sickbeard which I guess is what I'd expect. Also the memory usage for these processes (top 5) isn't really changing.

Anyway - I'll leave overnight and see if it crashes!

Thanks.

Alex.

ajburnet · March 1, 2012

Hi,

After a few days it's crashed again. I tried running the mover script and noticed the output below on the tail I'm running.

Something storage is still going on. I am now running with 8GB, 2GB swap and 2GB tmpfs so surely not a memory shortage anymore?! There is clearly an issue with rsync I think in general, what I don't know!

I then ran my nightly backup (rsync based, synology to unRAID) which started and then killed the array completely (tail attached). A memory issue again! I can increase the swap from 2GB but surely it's enough as it is!! Still a mystery!

Mar 1 20:58:47 unRAID login[1212]: ROOT LOGIN on '/dev/pts/1' from 'HackPro.home.seedling.net.au'

Mar 1 20:59:07 unRAID kernel: Adding 2047996k swap on /mnt/cache/.unraid.swapfile. Priority:-1 extents:104 across:10094760k

Mar 1 21:03:37 unRAID emhttp: shcmd (59): /usr/local/sbin/mover |& logger &

Mar 1 21:03:37 unRAID logger: mover started

Mar 1 21:03:37 unRAID logger: moving Shared Drive/

Mar 1 21:03:37 unRAID logger: ./Shared Drive/Software Library/Windows/Video Editing and Media/CyberLink.PowerDirector.Ultra.v10.0.0.1129b.Multilingual.Incl-Simkey.zip/Key/Simkey

Mar 1 21:03:37 unRAID logger: .d..t...... ./

Mar 1 21:03:37 unRAID logger: rsync: get_xattr_names: llistxattr("Shared Drive",1024) failed: Input/output error (5)

Mar 1 21:03:37 unRAID logger: .d..t.....x Shared Drive/

Mar 1 21:03:37 unRAID logger: .d..t...... Shared Drive/Software Library/

Mar 1 21:03:37 unRAID logger: .d..t...... Shared Drive/Software Library/Windows/

Mar 1 21:03:37 unRAID logger: cd+++++++++ Shared Drive/Software Library/Windows/Video Editing and Media/

Mar 1 21:03:37 unRAID logger: cd+++++++++ Shared Drive/Software Library/Windows/Video Editing and Media/CyberLink.PowerDirector.Ultra.v10.0.0.1129b.Multilingual.Incl-Simkey.zip/

Mar 1 21:03:37 unRAID logger: cd+++++++++ Shared Drive/Software Library/Windows/Video Editing and Media/CyberLink.PowerDirector.Ultra.v10.0.0.1129b.Multilingual.Incl-Simkey.zip/Key/

Mar 1 21:03:37 unRAID logger: >f+++++++++ Shared Drive/Software Library/Windows/Video Editing and Media/CyberLink.PowerDirector.Ultra.v10.0.0.1129b.Multilingual.Incl-Simkey.zip/Key/Simkey

Mar 1 21:03:37 unRAID logger: rsync: get_xattr_names: llistxattr("Shared Drive",1024) failed: Input/output error (5)

Mar 1 21:03:37 unRAID logger: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1040) [sender=3.0.4]

Mar 1 21:03:37 unRAID kernel: BUG: unable to handle kernel paging request at 0060eea3

Mar 1 21:03:37 unRAID kernel: IP: [<c1092787>] mntget+0x7/0xf

Mar 1 21:03:37 unRAID kernel: *pdpt = 00000000135ee001 *pde = 0000000000000000

Mar 1 21:03:37 unRAID kernel: Oops: 0000 [#1] SMP

Mar 1 21:03:37 unRAID kernel: Modules linked in: md_mod xor mvsas libsas scsi_transport_sas asus_atk0110 hwmon atiixp r8169 ahci libahci

Mar 1 21:03:37 unRAID kernel:

Mar 1 21:03:37 unRAID kernel: Pid: 5346, comm: fuser Not tainted 3.1.1-unRAID #1 System manufacturer System Product Name/M5A78L-M LX

Mar 1 21:03:37 unRAID kernel: EIP: 0060:[<c1092787>] EFLAGS: 00210206 CPU: 0

Mar 1 21:03:37 unRAID kernel: EIP is at mntget+0x7/0xf

Mar 1 21:03:37 unRAID kernel: EAX: 0060ee8b EBX: ee8c004a ECX: 00000000 EDX: ee8c0042

Mar 1 21:03:37 unRAID kernel: ESI: 0060ee8b EDI: 0001ee8c EBP: d34e3df8 ESP: d34e3df8

Message from syslogd@unRAID at Thu Mar 1 21:03:37 2012 ...

unRAID kernel: Oops: 0000 [#1] SMP

Message from syslogd@unRAID at Thu Mar 1 21:03:37 2012 ...

unRAID kernel: Process fuser (pid: 5346, ti=d34e2000 task=e20a9e60 task.ti=d34e2000)

Mar 1 21:03:37 unRAID logger: find: `fuser' terminated by signal 9

Mar 1 21:03:37 unRAID kernel: DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

Mar 1 21:03:37 unRAID kernel: Process fuser (pid: 5346, ti=d34e2000 task=e20a9e60 task.ti=d34e2000)

Mar 1 21:03:37 unRAID kernel: Stack:

Mar 1 21:03:37 unRAID kernel: d34e3e04 c1086db6 eea9b100 d34e3e28 c10b2e8f 00000000 ee8c0042 00000000

Mar 1 21:03:37 unRAID kernel: d34e3eac d13a62f0 e3710280 e20a9e60 d34e3e30 c10b2f24 d34e3e40 c10b286d

Mar 1 21:03:37 unRAID kernel: d34e3eac d34e3eac d34e3e80 c108891d 000280da d34e3eb4 d34e3ec0 ee269400

Mar 1 21:03:37 unRAID kernel: Call Trace:

Mar 1 21:03:37 unRAID kernel: [<c1086db6>] path_get+0xd/0x25

Mar 1 21:03:37 unRAID kernel: [<c10b2e8f>] proc_fd_info+0xb3/0xfc

Mar 1 21:03:37 unRAID kernel: [<c10b2f24>] proc_fd_link+0xa/0xc

Message from syslogd@unRAID at Thu Mar 1 21:03:37 2012 ...

unRAID kernel: Stack:

Mar 1 21:03:37 unRAID kernel: [<c10b286d>] proc_pid_follow_link+0x2e/0x32

Mar 1 21:03:37 unRAID kernel: [<c108891d>] path_lookupat+0x235/0x4f9

Mar 1 21:03:37 unRAID kernel: [<c1088bfd>] do_path_lookup+0x1c/0x4e

Mar 1 21:03:37 unRAID kernel: [<c1088cc3>] user_path_at_empty+0x3e/0x69

Mar 1 21:03:37 unRAID kernel: [<c10b3d5e>] ? proc_readfd_common+0x118/0x15b

Mar 1 21:03:37 unRAID kernel: [<c1088cfb>] user_path_at+0xd/0xf

Mar 1 21:03:37 unRAID kernel: [<c10831bd>] vfs_fstatat+0x40/0x67

Mar 1 21:03:37 unRAID kernel: [<c10832b5>] vfs_stat+0x13/0x15

Mar 1 21:03:37 unRAID kernel: [<c10832cb>] sys_stat64+0x14/0x28

Mar 1 21:03:37 unRAID kernel: [<c108c0cd>] ? vfs_readdir+0x6d/0x7e

Mar 1 21:03:37 unRAID kernel: [<c108be24>] ? generic_block_fiemap+0x43/0x43

Mar 1 21:03:37 unRAID kernel: [<c108c176>] ? sys_getdents64+0x98/0xa5

Mar 1 21:03:37 unRAID kernel: [<c130b92d>] syscall_call+0x7/0xb

Mar 1 21:03:37 unRAID kernel: Code: 00 00 00 8b 00 e8 81 ff ff ff 85 c0 74 06 8b 50 18 64 ff 02 ba 80 49 4a c1 64 03 15 14 40 4a c1 fe 02 5d c3 55 85 c0 89 e5 74 06 <8b> 50 18 64 ff 02 5d c3 55 ba 80 49 4a c1 89 e5 53 8b 58 48 64

Mar 1 21:03:37 unRAID kernel: EIP: [<c1092787>] mntget+0x7/0xf SS:ESP 0068:d34e3df8

Mar 1 21:03:37 unRAID kernel: CR2: 000000000060eea3

Mar 1 21:03:37 unRAID kernel: ---[ end trace c8138580b227af99 ]---

Message from syslogd@unRAID at Thu Mar 1 21:03:37 2012 ...

unRAID kernel: EIP: [<c1092787>] mntget+0x7/0xf SS:ESP 0068:d34e3df8

Message from syslogd@unRAID at Thu Mar 1 21:03:37 2012 ...

unRAID kernel: CR2: 000000000060eea3

Message from syslogd@unRAID at Thu Mar 1 21:03:37 2012 ...

unRAID kernel: Code: 00 00 00 8b 00 e8 81 ff ff ff 85 c0 74 06 8b 50 18 64 ff 02 ba 80 49 4a c1 64 03 15 14 40 4a c1 fe 02 5d c3 55 85 c0 89 e5 74 06 <8b> 50 18 64 ff 02 5d c3 55 ba 80 49 4a c1 89 e5 53 8b 58 48 64

Message from syslogd@unRAID at Thu Mar 1 21:03:37 2012 ...

unRAID kernel: Call Trace:

Mar 1 21:04:37 unRAID kernel: INFO: rcu_sched_state detected stall on CPU 0 (t=6000 jiffies)