(Bypassed, not 'solved') Frequent stalls requiring hard power down

MarkUK · July 30, 2018

Latest update: Last post. Problem has been bypassed but not resolved. Will try to resolve if it comes back in the future!

Latest update: This problem has come and gone over the past few weeks - the upshot is that Unraid is stalling (with xfsaild in D state) and, within a few minutes, the entire machine is unresponsive. A hard power-off is required. New diagnostics will be added to the last post. Also, we use Zoneminder on this machine - it periodically deletes large volumes of files. I believe, with some certainty, that the crashes are occurring because of the large scale deletes. (We could be talking 10,000 - 50,000 files deleted per run; not obscene, but high).

I have tried the following (plus lots forgotten about)

- Changing entire motherboard, CPU and RAM

- Removing the LSI 2008 controller (when changing the MB/CPU)

- Extended SMART tests; nothing bad reported

- Moving the Zoneminder installation from a Docker to a full VM; same problem

Hi all, have been lurking for a couple of months since getting Unraid but sadly have a little issue!

Recently (last couple of weeks - not, though, since first starting using Unraid) I've been getting system hangs that require a hard power off. You can occasionally SSH in, can sometimes run some commands, but everything is backing up and stalled; nothing unlocks and generally the web UI is inaccessible.

This may well be a more generic Linux issue, perhaps, but I think I've narrowed some things down. It looks like a number of processes end up in the D (uninterruptible sleep) indefinitely, so I began running a command to just show the D state processes continuously (so, when it crashes, the last commands could have been related). There are three processes that are consistent between the last three crashes that I was able to directly observe:

kworker/u32:0 (identifier varies)

kswapd0 (no swap usage is actually happening, by the way, but I have since installed the swap plugin in case it was trying to allocate more than the relatively ample memory it had)

xfsaild/md4 - this is the one I think could be the real culprit.

One of our Dockers is Zoneminder - this has relatively high CPU and disk usage, but most importantly creates a large volume of small JPEG files. Periodically, it has to delete these JPEG's - I've seen a single reference on the web to xfsaild and large volumes of deletes causing such a crash, but obviously I cannot easily change kernel, xfs version, etc etc. I can reduce (slightly - and have) the rate at which Zoneminder deletes by limiting the per-job events deleted, but this hasn't fixed it sadly.

CPU usage (when a crash is occurring) is virtually zero. IO wait is also virtually zero; memory usage is minimal, etc etc. Below is one of the more 'full' crash scenarios - one where multiple processes appeared to back-up:

D [kswapd0]
D [mdrecoveryd]
D [nfsd]
D [nfsd]
D avahi-daemon: running [unraid.local]
D /bin/bash /usr/local/emhttp/webGui/scripts/diskload
D [xfsaild/md4]
D [kworker/2:0]
D [kworker/5:0]
D /usr/bin/perl -wT /usr/bin/zmfilter.pl
D /usr/bin/perl -wT /usr/bin/zmwatch.pl
D /usr/bin/php -f /usr/local/emhttp/webGui/include/UpdateDNS.php
D /usr/bin/php -q /usr/local/emhttp/plugins/dynamix/scripts/monitor

(Yes, it was running a parity rebuild at the time due to a previous crash!).

I've tried stopping Docker / stopping Docker containers / stopping the webserver / etc etc (to try to free up the block). I've tried manually kill -9'ing various processes that were the parent or child of processes that had stalled in D state. Absolutely nothing seems to unlock this stall.

Edited to add: Based on the disk usage and the setting in Zoneminder to delete events after 80% disk usage has been reached, I'm almost certain that this is why it has started recently, as the disks have now reached that percentage usage. So, I'm relatively confident that it is Zoneminder running a delete cycle that is triggering this - so lots of rm's, I presume, causing the issue. The other (non-Unraid) thread simply ended up reducing the speed and parallelisation of the rm calls and this fixed their problem. This isn't doable here - so I'm left with "how can I stop a Docker container, or any virtual machine, from killing the entire server doing something that is relatively normal"?!

Any help would be hugely appreciated! I absolutely LOVE Unraid and am about to consolidate my other server into it by way of a VM - just need to fix this niggling crash!

Thanks!

unraid-diagnostics-20180730-1919.zip

Frank1940 · July 30, 2018

What format are you using for your data disks---XFS, reiserfs, or btrfs?

ken-ji · July 30, 2018

The D state simply means your processes are definitely in contention for disk IO.

processes that are bracketed like [xfsaild] are kernel processes they show up on the process table, but you can't really do anything about them. They are a good sign of the disk contention issue as you have kernel processes blocked and waiting for disk IO

I guess ZM is generating loads of disk IO by deleting and probably crawling the drives for stuff to delete

The type of file system is important asfor example ReiserFS can definitely slow to a crawl once its relatively full (deletes will be very slow)

MarkUK · July 30, 2018

The drives are all XFS (and btrfs for the cache drive) - the server never regains responsiveness, as today it was left (whilst we were out) for about 5 hours in that state and never recovered.

I appreciate xfsaild is a kernel process, the drives aren't even close to full (about 75% full on all data drives), and although ZM is generating a lot of IO (deleting thousands of files) it is still within non-extreme limits. Also, the same level of IO has been operating on a non-Docker/non-virtualised machine for a long time without problem (on weaker drives) - it all seems to point towards a high number of deleted files causing some sort of problem within XFS that doesn't resolve. The load also doesn't 'back up' after this point - simply as it cannot - so it's not struggling to keep up with newer events piling on top of older events.

I've been logging iotop but didn't have it running today (with the long 5+ hour outage) - I've restarted logging on this in case it shows any IO activity after the point of apparent failure!

Thanks for any suggestions you can think of!

pwm · July 31, 2018

5 hours ago, MarkUK said:

I've tried stopping Docker / stopping Docker containers / stopping the webserver / etc etc (to try to free up the block). I've tried manually kill -9'ing various processes that were the parent or child of processes that had stalled in D state. Absolutely nothing seems to unlock this stall.

This is often caused by a bug somewhere or a damaged file system or similar. So one thread waits in uninterruptible sleep for a condition to end. But that condition never does end, meaning the thread waits indefinitely. And any other thread that needs the same resources waits for the previous thread and so also ends up waiting forever.

Do you have any remote mounts (NFS, CIFS) that might fail because of network issues or the remote machine rebooting? That, or use of USB disks, are the only causes I have had myself for threads ending up permanently in the D state where only a reboot have been able to solve the issue. In those situations, any "ls", "df" etc accessing the stuck share has resulted in one more process being permanently stuck.

In high-load situations, a number of threads can show up in the D state but will then normally one-by-one unroll as the load goes down and thread-by-thread gets the required resource lock so they can finish their task and exit their wait. A couple of days ago, someone on this forum had a huge number of threads hanging in the D state while waiting to compute hashes for files - but when the mother process was ended so no more hashing requested was started then the backlog of hanging threads cleared up in maybe 10 minutes.

MarkUK · August 1, 2018

Thanks, pwm. I should clarify that I didn't mean to imply that D state conditions themselves are bad / cannot be unrolled when the wait condition has been satisfied / etc. Perhaps I'm looking in the wrong place with believing an XFS issue to be at the root of it all. I also don't necessarily think that xsfaild is, itself, causing the problem - but believe that it is at the center of the problem. E.g., is it struggling to clear an IO backlog/buffers/etc.

Today, however, one of my disks has presented an error - I have no idea why (all have had extended Smart tests run and all came back fine just the day before) but there it is. Is this the cause? Honestly don't know - I hope so! I don't think it necessarily is, but I'll update this in the next few days once I've resolved the problematic disk! Thanks again

MarkUK · August 1, 2018

Right... before I start pulling drives / etc (whilst I'm taking a fresh backup of the important stuff, too!)...

I have two controllers on this system - one built in (with 4 SATA ports) and one PCI-e controller, a SAS2008 LSI SAS 9201-8i. This has the 'failed' drive on, plus another drive that I feel may be having some errors (syslog has started, today, saying "attempting task abort", "device reset", and a few other bits, relating to the other drive on this one controller). Fair to presume there may be some issue going on with the controller...

Any advice as to where to look / things to try? I don't have any other SAS cables, sadly, nor other SAS controllers, so replacing either will be the final task if that's the only option! This board also doesn't have the capacity to run additional hard drives (currently 6 in use, the board only supports 4) so I can't even run the array without some kind of controller.

Frank1940 · August 1, 2018

Post up your diagnostics

MarkUK · August 1, 2018

Attached is a new diag report (different from the above, obviously, in that one of the data drives has failed)

unraid-diagnostics-20180801-1924.zip

MarkUK · August 2, 2018

Last update for now:

I'm not 100% sure but I feel that the LSI SAS2008 card may have, between a few moves, become slightly loose in the socket. I've re-seated it, have asked Unraid to rebuild that drive, and will post an update in a few days if no crashes happen before then! Thanks, all!

MarkUK · August 4, 2018

I've now moved the entire Unraid setup back to my previous mobo&cpu (one with 8 SATA ports on board) - the processor is about half the overall speed, but no additional controller is needed. Despite some issues it has gone back now and is working - the faster motherboard (using the LSI card) is no longer in use. I may well build a temporary LSI-only setup with a few spare/smaller drives and see if I can replicate and/or fix the issue - but the card cost about £70 and I don't know how much energy I want to put into solving this. So, the problem is 'fixed'/side-stepped rather than fully solved, but will have to do for now! Will re-visit this if the problem starts again on the test setup. Thanks for all the help

MarkUK · August 6, 2018

New update:

The problem is BACK.

Problem is, I've changed motherboard (and thus CPU and RAM) and I'm no longer using the PCI-e LSI controller, which I had started to suspect as the problem.

Attached are my new diagnostics. Drive 3 has a single error on SMART that, I believe, was the result of a bad cable (I changed cabling when I moved machines again) and I haven't experienced anything before or since. This happened immediately after changing systems and didn't occur again after changing for a new cable.

I'm at a complete loss here - I still think this is likely to be caused by Zoneminder deleting large volumes of files, but now the only commonality between the two setups are the hard drives and the software installation (and, I suppose, the USB stick... ).

Any thoughts?!

unraid-diagnostics-20180807-0016.zip

III_D · August 11, 2018

How old is your power supply?

I recently had a similar problem of unexplainable freezing/crashing, usually requiring hard-boots.

Long story short, after trying many possible fixes without success, replacing the PSU solved the problem.

Some good threads when looking to purchase a PSU:

https://lime-technology.com/wiki/PSU

https://lime-technology.com/forums/topic/11568-the-power-supply-thread/?tab=comments#comment-113993&searchlight=1

https://lime-technology.com/forums/topic/25307-power-supply-calculator/

Best of luck

III_D

MarkUK · August 11, 2018

Hey III_D, thanks for your reply. I have been keeping an eye on eBay for some good priced PSU's fearing the same! I've (as of yesterday) replaced the Docker version of Zoneminder with a VM and installed it directly on there - perhaps some odd behaviour is happening. If the problem comes back I may have to follow your advice and replace the PSU! As it happens, it's a few years old, but relatively alright when it was new, but definitely could be a culprit! Cheers!

MarkUK · August 11, 2018

Sadly, this is back AGAIN.

So... before I do something like replace further hardware (namely, PSU, possibly USB stick - I don't think it's this as it's a new stick, but nonetheless)...

What can I do to see what's happening here? I've checked /proc/*/stack which didn't exist (not really familiar with process debugging, so any thoughts would help). iostat shows very little activity - no disk activity, only tiny (< 2%) iowait, mostly in idle (> 90%).

cat /proc/meminfo (mid-crash)

MemTotal:       16376368 kB
MemFree:          212004 kB
MemAvailable:   10024248 kB
Buffers:           10472 kB
Cached:         10778832 kB
SwapCached:            0 kB
Active:          6876176 kB
Inactive:        8830668 kB
Active(anon):    5383916 kB
Inactive(anon):   135492 kB
Active(file):    1492260 kB
Inactive(file):  8695176 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                20 kB
Writeback:             0 kB
AnonPages:       4917608 kB+
Mapped:           132476 kB
Shmem:            602156 kB
Slab:             243832 kB
SReclaimable:     125428 kB
SUnreclaim:       118404 kB
KernelStack:       10144 kB
PageTables:        26628 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8188184 kB
Committed_AS:    6774596 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
AnonHugePages:   4577280 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
DirectMap4k:      343504 kB
DirectMap2M:    16312320 kB

If I'm SSH'ed in at the time I'll usually get a few minutes where I can run a few commands (if I happen to get the outage text in time - I usually do, but it can be in the middle of the night). I will have to wait between 6 and 48 hours for another failure, based on typical failure rates.

I'm left with the options - move Zoneminder to another server (very much don't want to do this - this single server EASILY can cope with the load, it operates at ~ 10% CPU usage for most of the time), try 'randomly' replacing hardware, or fixing this damn issue. Any ideas at all that would steer this towards a fix are appreciated - thank you!

MarkUK · August 11, 2018

Latest diagnostics

unraid-diagnostics-20180811-1822.zip

Squid · August 11, 2018

Aug 11 18:20:28 unraid emhttpd: shcmd (57): mount -t btrfs -o noatime,nodiratime /dev/sdb1 /mnt/cache

Aug 11 18:20:36 unraid rc.swapfile[6009]: Starting swap file during array mount ...

SwapTotal:             0 kB
SwapFree:              0 kB

FYI, as per the comment within the Apps tab where you (presumably) installed the swap file plugin,

Quote

Note that btrfs formatted drives do NOT support having a swap file placed onto them

Beyond that, without particularly following the thread, you've got 212MB free memory on your system (mid-crash), which might as well be 0MB

MemFree:          212004 kB

If this is Zoneminder causing the issues for you, why not limit it's memory footprint? https://lime-technology.com/forums/topic/57181-real-docker-faq/?page=2#comment-566088

MarkUK · August 11, 2018

Hi Squid, thanks for your reply. It was my understanding that the total free memory on Linux was (roughly) Available + Buffers + Cache (minus some of that which can't be freed immediately) in which case there's 10GB free?

Good spot on the swapfile plugin; I'll disable that right away! The problem was definitely happening beforehand, but I doubt it can be doing it any good if it's not supported (nor, I now see, updated in 3 years!). Cheers!

pwm · August 11, 2018

33 minutes ago, MarkUK said:

Hi Squid, thanks for your reply. It was my understanding that the total free memory on Linux was (roughly) Available + Buffers + Cache (minus some of that which can't be freed immediately) in which case there's 10GB free?

Good spot on the swapfile plugin; I'll disable that right away! The problem was definitely happening beforehand, but I doubt it can be doing it any good if it's not supported (nor, I now see, updated in 3 years!). Cheers!

Yes, it's the MemAvailable figure and not the MemFree figure that counts:

5 hours ago, MarkUK said:

MemFree: 212004 kB

MemAvailable: 10024248 kB

MarkUK · August 12, 2018

Right, yet another drastic change to try and resolve this.

Firstly, I wrote a test case to generate tens of thousands of images repeatedly and delete them continuously, to try and provoke 'the problem' above. No joy - even after peaking above 20 load, it still wasn't pushing it to the same problem.

So, for now, we're going to use Motion for our CCTV. I don't like it as much (partly because we're so familiar with ZM) but it stores events as video files rather than images, so far fewer deletes required (ZM supports this only in the test branch; when it comes to mainstream I'll try this whole thing again!).

Thanks all for your input - I'll update this, again, if the problem comes back even without Zoneminder! Cheers

MarkUK · August 13, 2018

I can't believe I'm back on this - again...

I changed to Motion/Motioneye (which, by the way, I really didn't get on with - another story, though!). So NO Zoneminder in the mix whatsoever.

And it happened again. But - I was mass-deleting files (ironically, the Zoneminder files - a couple of million 100Kb files). I am - nearly - certain that this error is being caused by the mass deletion of millions of files on XFS (or a parity-protected XFS array or some other factor of the setup). To 'check' this I've disabled all Dockers and VM's and am just mass-deleting files. I'll update this if that crashes - and, if so, I'll just run my regular setup (including CCTV - NOT ZM though) and no mass deleting - in theory, the server should stay up indefinitely.........

pwm · August 13, 2018

So run a constantly updating top sorting on memory size - is any process consuming lots of resources during the delete?

MarkUK · August 14, 2018

Hey pwm - I've got various logging happening (namely, I run a continuous "ps" to only show D-state blocking; plus a separate server monitors the xfs_stat output; the CPU load and CPU usage; and memory (free/total/etc)).

Before last nights crash the last recorded values were:

Memory free: 85%

Load: 13 (1 minute), 12 (5 minute), 9 (15 minute)

CPU Usage: 2%

I didn't (because it's done using Putty, which wasn't running at the time) get the last ps output. I've also got the XFS_Stat output if it's useful, but it seems the metrics in that are just growing statistics rather than a decent snapshot of state. So, short answer: no, nothing is consuming any memory (85% is roughly the normal amount with no deletes running whatsoever!).

MarkUK · August 14, 2018

Further to add: the logging server can accept any command that can be executed over SSH; so if there's any good logging commands I could be running during/before a crash let me know and I'll add them to the test!

MarkUK · August 16, 2018

Final update for a while.

I've reverted back to Zoneminder. I couldn't get on with Motion/Motioneye at all - it was not going to work for us long-term.

I've deleted all the old Zoneminder image files - there were about 50 million files and the server crashed once whilst deleting them.

I've reduced the storage of ZM images to just events, plus added extra filters to periodically delete older events to remove the 'bulk deleting' that happens when it starts running out of space.

Lastly, I'm going to create one (or two) test Unraid boxes running a parity-protected XFS array and try creating tens of millions of tiny files and subsequently deleting them; I'm fairly sure this is where the problem is happening and (for my own sanity, if nothing else) want to prove/disprove this and possibly even create something reproducible from it. Until then, this machine shouldn't fill up for 10 - 20 weeks so it could be a while before I naturally experience this issue again - thanks for your input!

(Bypassed, not 'solved') Frequent stalls requiring hard power down

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived