Random unresponsiveness after large data movements

TheSkaz · September 7, 2021

My server freezes and requires a hard reset at random times. It seems that creating/moving large data sets triggers this. Ill attempt to see if I can recreate it now that I have syslog writing to flash.

tower-diagnostics-20210907-0907.zip

TheSkaz · September 8, 2021

Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [crit] 26160#26160: ngx_slab_alloc() failed: no memory
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: shpool alloc failed
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: nchan: Out of shared memory while allocating message of size 3559. Increase nchan_max_reserved_memory.
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: *436728 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/var?buffer_length=1 HTTP/1.1", host: "localhost"
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: MEMSTORE:00: can't create shared message for channel /var
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [crit] 26160#26160: ngx_slab_alloc() failed: no memory
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: shpool alloc failed
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: nchan: Out of shared memory while allocating message of size 2892. Increase nchan_max_reserved_memory.
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: *436730 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devs?buffer_length=1 HTTP/1.1", host: "localhost"
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: MEMSTORE:00: can't create shared message for channel /devs
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [crit] 26160#26160: ngx_slab_alloc() failed: no memory
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: shpool alloc failed
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: nchan: Out of shared memory while allocating message of size 2289. Increase nchan_max_reserved_memory.
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: *436731 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/shares?buffer_length=1 HTTP/1.1", host: "localhost"
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: MEMSTORE:00: can't create shared message for channel /shares
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [crit] 26160#26160: ngx_slab_alloc() failed: no memory
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: shpool alloc failed
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: nchan: Out of shared memory while allocating message of size 3233. Increase nchan_max_reserved_memory.
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: *436732 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/cpuload?buffer_length=1 HTTP/1.1", host: "localhost"
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: MEMSTORE:00: can't create shared message for channel /cpuload

System froze again. here is something of use I think.

ChatNoir · September 8, 2021

On 9/7/2021 at 5:17 PM, TheSkaz said:

It seems that creating/moving large data sets triggers this.

How are you moving / creating this data ?

Is it through the network or within Unraid ?

trurl · September 8, 2021

df in your diagnostics shows a few mounts that aren't the usual that Unraid would create:

Filesystem      Size  Used Avail Use% Mounted on
datastore       3.6T  2.3G  3.6T   1% /datastore
vmstorage       1.1T  916G  159G  86% /vmstorage
fast            7.1T  492G  6.6T   7% /fast

so I assume you must have done that yourself.

Are these involved in your problem?

TheSkaz · September 8, 2021

Those are ZFS pool that I did create myself. Usually its writing to those pools at 2TB/s or greater that seems to cause this. I could be wrong. and upon reboot, the vmstorage and fast pools have corrupt files on them. I delete the files, scrub, and clean the pools and we are good to go. I dont know if the system crashing is causing the corruption, or the corruption is causing the system to crash.

fast is a raidz array of 2TB nvme drives

vmstorage is a raidz array of 240GB ssd drives.

TheSkaz · September 8, 2021

28 minutes ago, ChatNoir said:

How are you moving / creating this data ?

Is it through the network or within Unraid ?

within a docker or vm in unraid

ChatNoir · September 8, 2021

3 hours ago, TheSkaz said:

within a docker or vm in unraid

Since your log extract mentions out of memory issues, are you sure that your are not using a wrong path somehow, writing your data to memory instead thus crashing the server ?

TheSkaz · September 8, 2021

that may be possible... let me see.

on a different front, It crashed again. So what happends after a crash is I reboot, get a kernel panic, reboot again and it boots up:

this happens consistently.

TheSkaz · September 8, 2021

did it again. this time I have 0 dockers running, and the VM service stopped. it is showing that I am using 85ish GB of RAM...

tower-diagnostics-20210908-1451.zip

TheSkaz · September 8, 2021

looking at processes, shows that I have 0% memory usage over all 2367 processes

TheSkaz · September 9, 2021

after a ton of googling, I ran

echo 3 > /proc/sys/vm/drop_caches

this morning, and that cleared up a lot of the ram. system still crashed overnight.

trurl · September 9, 2021

On 9/8/2021 at 11:52 AM, TheSkaz said:

ZFS pool

Is that using the ZFS plugin? Do you have any problems without those?

You can go directly to the correct support thread for any of your plugins by selecting its Support Link on the Plugins page.

TheSkaz · September 10, 2021

20 hours ago, trurl said:

Is that using the ZFS plugin? Do you have any problems without those?

You can go directly to the correct support thread for any of your plugins by selecting its Support Link on the Plugins page.

I have been testing that. I have been doing writes only to the cache drive (nvme 1TB drive) and its "more" stable but will still crash

JorgeB · September 10, 2021

You should run memtest.

TheSkaz · September 16, 2021

On 9/10/2021 at 9:56 AM, JorgeB said:

You should run memtest.

Memtest resulted in 0 errors.

what I observed was that if RAM usage shot up (like filling up a ramdrive or similar) the "cached ram" would not release fast enough causing system crash. also, ZFS seems to be a culprit. Effectively having 4 Sabrent Gen4 2TB NVME drives in a Raid0 equiv array seems to cause issues with high write speeds. at the moment, I think I have the RAM under control, and researching ZFS.

TheSkaz · September 22, 2021

On 9/10/2021 at 9:56 AM, JorgeB said:

You should run memtest.

I was still having issues, and am rerunning memtest:

it takes about 24 hours for 1 pass, is it good to go, or do I need to let all the passes complete?

Also, i finally found the zfs settings that are best for nvme drives (as proposed by LTT), and have them implemented. I still get weird memory errors.

primarycache=metadata

autortrim=on

atime=off

TheSkaz · September 22, 2021

side note, absolutely killing me that its using 1 core....

Random unresponsiveness after large data movements

Recommended Posts

TheSkaz

Link to comment

TheSkaz

Link to comment

ChatNoir

Link to comment

trurl

Link to comment

TheSkaz

Link to comment

TheSkaz

Link to comment

ChatNoir

Link to comment

TheSkaz

Link to comment

TheSkaz

Link to comment

TheSkaz

Link to comment

TheSkaz

Link to comment

trurl

Link to comment

TheSkaz

Link to comment

JorgeB

Link to comment

TheSkaz

Link to comment

TheSkaz

Link to comment

TheSkaz

Link to comment

Join the conversation