(Bypassed, not 'solved') Frequent stalls requiring hard power down

Frank1940 · August 16, 2018

This problem could be caused by the way that this massive list of files for deletion is generated or compiled. Remember that unRAID runs completely in memory and does not have a hard drive swap disk file to work with. Your system could simply be running out of memory to store the list of files. And this will cause problems as the OS will then begin dumping 'stale' processes to get more memory.

MarkUK · August 17, 2018

Hey Frank, appreciate the reply - I had considered this, but both the memory usage is unaffected, plus the rm can run for many hours (happily deleting files at the time - not just enumerating the list of files to delete) before it crashes! The files are also structured in 5 - 8 folders deep, so each folder only has a couple of thousand files in or less.

pwm · August 17, 2018

It's possible that shfs/FUSE gets into troubles by the large number of files accessed in quite short time - but it doesn't feel like it should be worse to delete a huge number of files compared to programs that computes hashes for the files.

MarkUK · August 18, 2018

My guess is that some resource is being exhausted within either XFS or, indeed, one of the supporting techs interfacing either the raid array or the docker/VM's (although I'm pretty certain I've had a crash with a regular Unraid-only rm - no virtual mount / 9p / etc). I had hoped to find the culprit by examining the XFS stats during a crash and seeing an exhaustion of inodes or something, but nothing really looked obviously wrong from that. Honestly, I'm at a loss as to what's causing this - my next step, though, is to try and make this reproducible (and consistent) and remove all the extra factors such as software (ZM), using a VM, a docker, etc etc. Just try and boil it down to the minimum steps to cause the problem... You're right that the level of file access doesn't seem to be abnormal - although the level of deletions may be. The hang is definitely total - even if the system is left for most of the day to recover, it never does. The hang is also purely IO (from what I can see) as the system is still trying to work (e.g., I can run simple SSH commands until it tries to do any IO and then that SSH session will also hang).

MarkUK · August 18, 2018

The discussion looks to be very similar in nature to what I've seen (except their problem only lasts 10 - 15 minutes - although, it could just be magnitude; perhaps mine would free up after 10 - 20 hours or more?! Never left it that long so far)...

https://www.spinics.net/lists/linux-xfs/msg06058.html

Towards the end the discussion moves towards the mass deletion of file structures, similar to what I've seen. Their solution was, effectively, to slow down deletions by reducing how parallel the deletions were...

(Bypassed, not 'solved') Frequent stalls requiring hard power down

Recommended Posts

Frank1940

Link to comment

MarkUK

Link to comment

pwm

Link to comment

MarkUK

Link to comment

MarkUK

Link to comment

Join the conversation