Random unresponsiveness after large data movements

September 7, 20214 yr

My server freezes and requires a hard reset at random times. It seems that creating/moving large data sets triggers this. Ill attempt to see if I can recreate it now that I have syslog writing to flash.

tower-diagnostics-20210907-0907.zip

Quote

September 8, 20214 yr

Author

Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [crit] 26160#26160: ngx_slab_alloc() failed: no memory
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: shpool alloc failed
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: nchan: Out of shared memory while allocating message of size 3559. Increase nchan_max_reserved_memory.
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: *436728 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/var?buffer_length=1 HTTP/1.1", host: "localhost"
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: MEMSTORE:00: can't create shared message for channel /var
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [crit] 26160#26160: ngx_slab_alloc() failed: no memory
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: shpool alloc failed
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: nchan: Out of shared memory while allocating message of size 2892. Increase nchan_max_reserved_memory.
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: *436730 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devs?buffer_length=1 HTTP/1.1", host: "localhost"
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: MEMSTORE:00: can't create shared message for channel /devs
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [crit] 26160#26160: ngx_slab_alloc() failed: no memory
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: shpool alloc failed
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: nchan: Out of shared memory while allocating message of size 2289. Increase nchan_max_reserved_memory.
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: *436731 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/shares?buffer_length=1 HTTP/1.1", host: "localhost"
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: MEMSTORE:00: can't create shared message for channel /shares
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [crit] 26160#26160: ngx_slab_alloc() failed: no memory
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: shpool alloc failed
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: nchan: Out of shared memory while allocating message of size 3233. Increase nchan_max_reserved_memory.
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: *436732 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/cpuload?buffer_length=1 HTTP/1.1", host: "localhost"
Sep  8 02:28:40 Tower nginx: 2021/09/08 02:28:40 [error] 26160#26160: MEMSTORE:00: can't create shared message for channel /cpuload

System froze again. here is something of use I think.

Quote

September 8, 20214 yr

On 9/7/2021 at 5:17 PM, TheSkaz said:

It seems that creating/moving large data sets triggers this.

How are you moving / creating this data ?

Is it through the network or within Unraid ?

Quote

September 8, 20214 yr

Community Expert

df in your diagnostics shows a few mounts that aren't the usual that Unraid would create:

Filesystem      Size  Used Avail Use% Mounted on
datastore       3.6T  2.3G  3.6T   1% /datastore
vmstorage       1.1T  916G  159G  86% /vmstorage
fast            7.1T  492G  6.6T   7% /fast

so I assume you must have done that yourself.

Are these involved in your problem?

Quote

September 8, 20214 yr

Author

Those are ZFS pool that I did create myself. Usually its writing to those pools at 2TB/s or greater that seems to cause this. I could be wrong. and upon reboot, the vmstorage and fast pools have corrupt files on them. I delete the files, scrub, and clean the pools and we are good to go. I dont know if the system crashing is causing the corruption, or the corruption is causing the system to crash.

fast is a raidz array of 2TB nvme drives

vmstorage is a raidz array of 240GB ssd drives.

Quote

September 8, 20214 yr

Author

28 minutes ago, ChatNoir said:

How are you moving / creating this data ?

Is it through the network or within Unraid ?

within a docker or vm in unraid

Quote

September 8, 20214 yr

3 hours ago, TheSkaz said:

within a docker or vm in unraid

Since your log extract mentions out of memory issues, are you sure that your are not using a wrong path somehow, writing your data to memory instead thus crashing the server ?

Quote

September 8, 20214 yr

Author

that may be possible... let me see.

on a different front, It crashed again. So what happends after a crash is I reboot, get a kernel panic, reboot again and it boots up:

this happens consistently.

Quote

September 8, 20214 yr

Author

did it again. this time I have 0 dockers running, and the VM service stopped. it is showing that I am using 85ish GB of RAM...

tower-diagnostics-20210908-1451.zip

Quote

September 8, 20214 yr

Author

looking at processes, shows that I have 0% memory usage over all 2367 processes

Quote

September 9, 20214 yr

Author

after a ton of googling, I ran

echo 3 > /proc/sys/vm/drop_caches

this morning, and that cleared up a lot of the ram. system still crashed overnight.

Quote

September 9, 20214 yr

Community Expert

On 9/8/2021 at 11:52 AM, TheSkaz said:

ZFS pool

Is that using the ZFS plugin? Do you have any problems without those?

You can go directly to the correct support thread for any of your plugins by selecting its Support Link on the Plugins page.

Quote

September 10, 20214 yr

Author

20 hours ago, trurl said:

Is that using the ZFS plugin? Do you have any problems without those?

You can go directly to the correct support thread for any of your plugins by selecting its Support Link on the Plugins page.

I have been testing that. I have been doing writes only to the cache drive (nvme 1TB drive) and its "more" stable but will still crash

Quote

September 10, 20214 yr

Community Expert

You should run memtest.

Quote

September 16, 20214 yr

Author

On 9/10/2021 at 9:56 AM, JorgeB said:

You should run memtest.

Memtest resulted in 0 errors.

what I observed was that if RAM usage shot up (like filling up a ramdrive or similar) the "cached ram" would not release fast enough causing system crash. also, ZFS seems to be a culprit. Effectively having 4 Sabrent Gen4 2TB NVME drives in a Raid0 equiv array seems to cause issues with high write speeds. at the moment, I think I have the RAM under control, and researching ZFS.

Quote

September 22, 20214 yr

Author

On 9/10/2021 at 9:56 AM, JorgeB said:

You should run memtest.

I was still having issues, and am rerunning memtest:

it takes about 24 hours for 1 pass, is it good to go, or do I need to let all the passes complete?

Also, i finally found the zfs settings that are best for nvme drives (as proposed by LTT), and have them implemented. I still get weird memory errors.

primarycache=metadata

autortrim=on

atime=off

Quote

September 22, 20214 yr

Author

side note, absolutely killing me that its using 1 core....

Quote

Random unresponsiveness after large data movements

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)