October 23, 20178 yr Hi All, My unRAID system has been randomly hanging for the past few weeks, resolution is a hard reset. I've captured the logs from one of the events via remote syslog... what other information would be helpful to get some assistance identifying the issue? The hang isn't a complete hang, just seems like resource starvation preventing any type of connectivity. (i'm still able to ping) Any assistance would be greatly appreciated. Thanks Edit: Seeing a lot of this in the syslog I've captured...if that's helpful at all Quote Oct 19 14:31:50 Tower kernel: Call Trace: Oct 19 14:31:50 Tower kernel: [<ffffffff813a4a1b>] dump_stack+0x61/0x7e Oct 19 14:31:50 Tower kernel: [<ffffffff810cb5b1>] warn_alloc+0x102/0x116 Oct 19 14:31:50 Tower kernel: [<ffffffff810d7980>] ? try_to_free_pages+0x9e/0xa5 Oct 19 14:31:50 Tower kernel: [<ffffffff810cbb67>] __alloc_pages_nodemask+0x541/0xc71 Oct 19 14:31:50 Tower kernel: [<ffffffff810d133c>] ? __page_cache_release+0x1d0/0x1df Oct 19 14:31:50 Tower kernel: [<ffffffff810e95c2>] ? wp_page_copy+0x560/0x586 Oct 19 14:31:50 Tower kernel: [<ffffffff81103997>] alloc_pages_vma+0x183/0x1f5 Oct 19 14:31:50 Tower kernel: [<ffffffff810e90f7>] wp_page_copy+0x95/0x586 Oct 19 14:31:50 Tower kernel: [<ffffffff810ebdc0>] ? alloc_set_pte+0x322/0x490 Oct 19 14:31:50 Tower kernel: [<ffffffff810ea3e3>] do_wp_page+0x17a/0x5c8 Oct 19 14:31:50 Tower kernel: [<ffffffff810ee516>] handle_mm_fault+0xc72/0xf96 Oct 19 14:31:50 Tower kernel: [<ffffffff81042252>] __do_page_fault+0x24a/0x3ed Oct 19 14:31:50 Tower kernel: [<ffffffff81042438>] do_page_fault+0x22/0x27 Oct 19 14:31:50 Tower kernel: [<ffffffff81680f18>] page_fault+0x28/0x30 Oct 19 15:13:59 Tower kernel: [<ffffffff81117aef>] ? get_mem_cgroup_from_mm+0x9c/0xa4 Oct 19 15:13:59 Tower kernel: [<ffffffff81102d82>] alloc_pages_current+0xbe/0xe8 Oct 19 15:13:59 Tower kernel: [<ffffffff810c92d4>] __get_free_pages+0x9/0x37 Oct 19 15:13:59 Tower kernel: [<ffffffff81046693>] pgd_alloc+0x16/0xf8 Oct 19 15:13:59 Tower kernel: [<ffffffff8104a40b>] mm_init+0x15f/0x1bc Oct 19 15:13:59 Tower kernel: [<ffffffff8104b98f>] copy_process.part.4+0xc1d/0x1822 Oct 19 15:13:59 Tower kernel: [<ffffffff81122f1b>] ? get_empty_filp+0x4e/0x162 Oct 19 15:13:59 Tower kernel: [<ffffffff8110b189>] ? __slab_alloc.isra.15+0x26/0x39 Oct 19 15:13:59 Tower kernel: [<ffffffff8104c72f>] _do_fork+0xb7/0x2af Oct 19 15:13:59 Tower kernel: [<ffffffff81123047>] ? alloc_file+0x18/0x95 Oct 19 15:13:59 Tower kernel: [<ffffffff8104c999>] SyS_clone+0x14/0x16 Oct 19 15:13:59 Tower kernel: [<ffffffff81002dbb>] do_syscall_64+0x157/0x1c7 Oct 19 15:13:59 Tower kernel: [<ffffffff8113984e>] ? fd_install+0x20/0x22 Oct 19 15:13:59 Tower kernel: [<ffffffff81573404>] ? SyS_socketpair+0x148/0x1a0 Oct 19 15:13:59 Tower kernel: [<ffffffff8167f5eb>] entry_SYSCALL64_slow_path+0x25/0x25 Edited October 23, 20178 yr by Soup
October 23, 20178 yr Author So, it looks like this happens when the mover kicks in to migrate from the cache drive to the array I identified that rsync was using a large % of CPU.... found 3 processes attempting to move the same file to the array.... killed them and the system restored to normal. Any thoughts?
October 23, 20178 yr Author Doesn't look like the right syslog was included in there... syslog-2017-10-23.tgz
October 24, 20178 yr This same thing with the same errors has been happening to me. The mover triggers or I manually run it and rsync goes haywire and uses all of the CPU on the box (or close to). Then the OOM thread reaper starts killing off threads and eventually all shares disappear, the WebUI slows to a complete crawl and everything connected to the array gets IO errors. I caught it in the act today and attempted to run diagnostics while it was happening but the diagnostics script kept getting killed by the system. I was able to run a 'killall rsync' and everything became responsive again and was able to complete a diagnostics that I have attached for inspection. I am just getting started with unraid, having moved from freenas for the flexibility but this is going to make me move back in short order. I have disabled the cache completely, so mover shouldn't get me again, hopefully. newnas-diagnostics-20171024-0938.zip
October 24, 20178 yr Community Expert This should help with OOM errors when running the mover with v6.3.5:
October 29, 20178 yr Author I disabled my cache drive on the shares I was experiencing the issue on and it hasn't happened since. I'll re-enable the cache drive on one of them, adjust these settings as mentioned in the linked post and see if it happens again. (32G of RAM in the system, other post mentions the issue common with systems using greater then 8G)
Archived
This topic is now archived and is closed to further replies.