System Hanging - Need assistance identifying cause - General Support

October 23, 20178 yr

Hi All,

My unRAID system has been randomly hanging for the past few weeks, resolution is a hard reset. I've captured the logs from one of the events via remote syslog... what other information would be helpful to get some assistance identifying the issue?

The hang isn't a complete hang, just seems like resource starvation preventing any type of connectivity. (i'm still able to ping)

Any assistance would be greatly appreciated.

Thanks

Edit: Seeing a lot of this in the syslog I've captured...if that's helpful at all

Quote

Oct 19 14:31:50 Tower kernel: Call Trace:
Oct 19 14:31:50 Tower kernel: [<ffffffff813a4a1b>] dump_stack+0x61/0x7e
Oct 19 14:31:50 Tower kernel: [<ffffffff810cb5b1>] warn_alloc+0x102/0x116
Oct 19 14:31:50 Tower kernel: [<ffffffff810d7980>] ? try_to_free_pages+0x9e/0xa5
Oct 19 14:31:50 Tower kernel: [<ffffffff810cbb67>] __alloc_pages_nodemask+0x541/0xc71
Oct 19 14:31:50 Tower kernel: [<ffffffff810d133c>] ? __page_cache_release+0x1d0/0x1df
Oct 19 14:31:50 Tower kernel: [<ffffffff810e95c2>] ? wp_page_copy+0x560/0x586
Oct 19 14:31:50 Tower kernel: [<ffffffff81103997>] alloc_pages_vma+0x183/0x1f5
Oct 19 14:31:50 Tower kernel: [<ffffffff810e90f7>] wp_page_copy+0x95/0x586
Oct 19 14:31:50 Tower kernel: [<ffffffff810ebdc0>] ? alloc_set_pte+0x322/0x490
Oct 19 14:31:50 Tower kernel: [<ffffffff810ea3e3>] do_wp_page+0x17a/0x5c8
Oct 19 14:31:50 Tower kernel: [<ffffffff810ee516>] handle_mm_fault+0xc72/0xf96
Oct 19 14:31:50 Tower kernel: [<ffffffff81042252>] __do_page_fault+0x24a/0x3ed
Oct 19 14:31:50 Tower kernel: [<ffffffff81042438>] do_page_fault+0x22/0x27
Oct 19 14:31:50 Tower kernel: [<ffffffff81680f18>] page_fault+0x28/0x30
Oct 19 15:13:59 Tower kernel: [<ffffffff81117aef>] ? get_mem_cgroup_from_mm+0x9c/0xa4
Oct 19 15:13:59 Tower kernel: [<ffffffff81102d82>] alloc_pages_current+0xbe/0xe8
Oct 19 15:13:59 Tower kernel: [<ffffffff810c92d4>] __get_free_pages+0x9/0x37
Oct 19 15:13:59 Tower kernel: [<ffffffff81046693>] pgd_alloc+0x16/0xf8
Oct 19 15:13:59 Tower kernel: [<ffffffff8104a40b>] mm_init+0x15f/0x1bc
Oct 19 15:13:59 Tower kernel: [<ffffffff8104b98f>] copy_process.part.4+0xc1d/0x1822
Oct 19 15:13:59 Tower kernel: [<ffffffff81122f1b>] ? get_empty_filp+0x4e/0x162
Oct 19 15:13:59 Tower kernel: [<ffffffff8110b189>] ? __slab_alloc.isra.15+0x26/0x39
Oct 19 15:13:59 Tower kernel: [<ffffffff8104c72f>] _do_fork+0xb7/0x2af
Oct 19 15:13:59 Tower kernel: [<ffffffff81123047>] ? alloc_file+0x18/0x95
Oct 19 15:13:59 Tower kernel: [<ffffffff8104c999>] SyS_clone+0x14/0x16
Oct 19 15:13:59 Tower kernel: [<ffffffff81002dbb>] do_syscall_64+0x157/0x1c7
Oct 19 15:13:59 Tower kernel: [<ffffffff8113984e>] ? fd_install+0x20/0x22
Oct 19 15:13:59 Tower kernel: [<ffffffff81573404>] ? SyS_socketpair+0x148/0x1a0
Oct 19 15:13:59 Tower kernel: [<ffffffff8167f5eb>] entry_SYSCALL64_slow_path+0x25/0x25

Edited October 23, 20178 yr by Soup

Quote

October 23, 20178 yr

Author

So, it looks like this happens when the mover kicks in to migrate from the cache drive to the array

I identified that rsync was using a large % of CPU.... found 3 processes attempting to move the same file to the array.... killed them and the system restored to normal.

Any thoughts?

Quote

October 23, 20178 yr

Community Expert

Tools - Diagnostics. Post complete zip

Quote

October 23, 20178 yr

Author

Attached

tower-diagnostics-20171023-1104.zip

Quote

October 23, 20178 yr

Author

Doesn't look like the right syslog was included in there...

syslog-2017-10-23.tgz

Quote

October 24, 20178 yr

This same thing with the same errors has been happening to me. The mover triggers or I manually run it and rsync goes haywire and uses all of the CPU on the box (or close to). Then the OOM thread reaper starts killing off threads and eventually all shares disappear, the WebUI slows to a complete crawl and everything connected to the array gets IO errors. I caught it in the act today and attempted to run diagnostics while it was happening but the diagnostics script kept getting killed by the system. I was able to run a 'killall rsync' and everything became responsive again and was able to complete a diagnostics that I have attached for inspection. I am just getting started with unraid, having moved from freenas for the flexibility but this is going to make me move back in short order. I have disabled the cache completely, so mover shouldn't get me again, hopefully.

newnas-diagnostics-20171024-0938.zip

Quote

October 24, 20178 yr

Community Expert

This should help with OOM errors when running the mover with v6.3.5:

Quote

October 24, 20178 yr

I will give that a try and see if I can cause it to happen again.

Quote

October 29, 20178 yr

Author

I disabled my cache drive on the shares I was experiencing the issue on and it hasn't happened since.

I'll re-enable the cache drive on one of them, adjust these settings as mentioned in the linked post and see if it happens again.

(32G of RAM in the system, other post mentions the issue common with systems using greater then 8G)

Quote

System Hanging - Need assistance identifying cause

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)