System Hangups

mathomas3 · December 19, 2023

Hello all... I have been having an issue with my system for a while now... The system randomly hangs at various times but I suspect that it has something to do with btrfs disk pool... Recently I deployed a VM to this share thinking that the issue might have something to do with my dockers and I was going to move them to a VM but in doing so I increased the IO on the disks, causing the issue to happen more often

I have tried shutting down all of the dockers(limiting IO to the btrfs disks) and I did observe the issue persist

Is this normal? Im running all of this on 2x 20core cpus... I know they dont have the fastest clock speed but it should be more then enough to handle what little load I have on this system...

Any suggestions are very welcome...

Dec 18 20:20:37 Tower kernel: traps: lsof[26886] general protection fault ip:147908d89c6e sp:242fdb11bfcc0dd5 error:0 in libc-2.37.so[147908d71000+169000]

image.png.6a21e7943cef95a21ef5993de0226d72.png

tower-diagnostics-20231219-0822.zip

Edited December 19, 2023 by mathomas3

mathomas3 · December 19, 2023

I shut down the server and did a memory test and everything came back good... oddly on boot I was missing a disk... I replaced it and it's building again

mathomas3 · December 19, 2023

any input would be very welcome... my system is getting worse by the hour... I just updated the system and rebooted... now the system is nearly unresponsive even at the physical console... logging into it took 5 mins and every command that I give it take another five eeks!

JorgeB · December 20, 2023

Post new diags.

mathomas3 · January 9

Hello,

Sorry for the long delay but after reading a number of different comments about dockers being at fault, I decided to play around with them to try and narrow down which one might be at fault. I never did find one to be the cause. As time went on, I began to suspect that my BTRFS share to be at fault. Seemingly any time there were heavy IO to/from that share the CPU would spike. I also had errors when trying to update different dockers, errorring out when it failed to remove the docker and couldnt install due to a docker having the same name already being there. After this the docker would be orphaned and requiring a reinstall.

So last night I moved as much as I can with mover to the array after disabling docker and VMs. I have 500mb of files that refuse to move. Any input as to completely move them over would be welcome.

My thoughts are to move everything off and completely rebuild this disk pool. I recall a comment years ago that, due to how I upgraded this pool that I would have problems rebuilding? I disk if I had a failure...

After moving things around I have been able to update a number of dockers that previously were failing and I havent seen CPU spikes(too early to judge perhaps).

Anyways, I look forward to your wise input

tower-diagnostics-20240109-0723.zip

JorgeB · January 9

20 minutes ago, mathomas3 said:

I have 500mb of files that refuse to move. Any input as to completely move them over would be welcome.

Enable the mover logging, run the mover, post new diags.

mathomas3 · January 9

7 minutes ago, JorgeB said:

Enable the mover logging, run the mover, post new diags.

Will do that tonight

mathomas3 · January 9

I made a little progress... I after updating a few dockers and noticed that new files were being written to the 'Services' disk pool I changed the share setup from

Services>Array

to

Cache>Array

This cleared out around 500mb and what is left is 20mb of data... I might be willing to let this go... there is a DB file that I would like to save

Anyway Logs are attached to include the Mover logs

tower-diagnostics-20240109-0829.zip

JorgeB · January 9

Jan  9 08:28:56 Tower move: move_object: //..g/... File exists
Jan  9 08:28:56 Tower move: file: //..b/...
Jan  9 08:28:56 Tower move: move_object: //..b/... File exists

Moven won't move existing files, see which ones are older and delete them.

trurl · January 9

None of this should cause hangups, but is less than ideal configuration.

From your docker.cfg

DOCKER_IMAGE_FILE="/mnt/user/system/docker/docker.img"
DOCKER_APP_CONFIG_PATH="/mnt/user/Plex/"

so your container executable code (docker.img) is in the system share, which is standard, and the default location for your container working storage (usually referred to as "appdata") is the Plex share, which is not standard.

Plex                              shareUseCache="yes"     # Share exists on disk1, disk2, disk3, disk4, disk17, services

Possibly you have other appdata on this anonymized share:

a-----a                           shareUseCache="yes"     # Share exists on disk4, disk8, disk17, services

From your domain.cfg

IMAGE_FILE="/mnt/user/domains/libvirt/libvirt.img"
DOMAINDIR="/mnt/user/domains/"

So your libvirt.img (VM definitions) is in system share, which is standard, and your vdisks will be stored in the domains share, which is also standard.

Your domains and system shares

domains                           shareUseCache="yes"     # Share exists on disk8
system                            shareUseCache="yes"     # Share exists on disk17

Ideally, appdata (and in your case, Plex), domains, and system shares would have all of their files on fast pool such as cache (or services) with none of their files on the array, and they would be configured to stay on fast pool.

If these are on the array, Docker/VM performance will be impacted by slower array, and array disks can't spin down since these files are always open.

Move ignores shares with no Secondary storage, which is not the case for any of these so that's fine. But Mover also ignores files not on the designated pool for the share, so the settings you changed will not get these off the services pool if that's what you had in mind.

Do you want Plex and a-----a to be on cache or services?

Nothing can move open files so leave Docker and VM Manager disabled until you get things moved where they need to be.

And, Mover won't replace files, so there may be some manual cleanup required if any files exist on both pool and array.

mathomas3 · January 9

14 minutes ago, trurl said:

Do you want Plex and a-----a to be on cache or services?

Yes I would like to have everything on the Services disk pool

The problem is that I have been receiving errors when trying to update dockers due to a number of errors... cant stop docker, docker with the same name already exists which often times orphaned the docker. I have some dockers that I havent been able to update due to this. After moving everything off of the Services DP, I have been able to update everything.

I would also randomly get system hang ups which would cause everything to hang for around 5 min and would happen 5-15 times a day.

I highly suspect that these errors was due to the services DP. I think it was you who mentioned a problem with how I upgraded my DP would cause problems later.

23 minutes ago, JorgeB said:
Jan  9 08:28:56 Tower move: move_object: //..g/... File exists
Jan  9 08:28:56 Tower move: file: //..b/...
Jan  9 08:28:56 Tower move: move_object: //..b/... File exists
Moven won't move existing files, see which ones are older and delete them.

Ill Cleanup those files now. Thanks

trurl · January 9

You can see how much of each disk or pool is used by each user share by clicking Compute... for the share on the User Shares page. Or the Compute All button, but you may have to wait a while and refresh browser to get them all done.

Install Dynamix File Manager plugin. It will let you work with folders and files directly on the server. Might be simpler than trying to get things configured so Mover can do that work for you, and Mover, as we have seen, won't do it all anyway.

mathomas3 · January 9

Just for grins I recreated the Services DP and invoked mover to move things back

image.png.a9075a738578a561cfd7a1a275fd245c.png

Edited January 9 by mathomas3

mathomas3 · January 9

Could this be a bug of sorts?

mathomas3 · January 10

11 hours ago, trurl said:

And, Mover won't replace files, so there may be some manual cleanup required if any files exist on both pool and array.

I feel as if you have ignored my concerns on this thread... I have posted about this CPU issue a number of times in a number of different topics... Here I am trying to figure out as to why the services DP is causing CPU spikes... I even removed this DP and recreated it... but when mover was writing data to the DP, the cpu again spiked for over an hour...

until there is a hint/suggestion as to why this is happening... I will be avoiding a DP in the furture

JorgeB · January 10

It's normal for the CPU usage to spike during the mover or any other i/o, since the dashboard graph includes i/o wait.

mathomas3 · January 10

3 hours ago, JorgeB said:

It's normal for the CPU usage to spike during the mover or any other i/o, since the dashboard graph includes i/o wait.

So it's normal for the system to hang for 2h while mover is running... and when ever a docker wants to write data to a DP? If that's the case I would rather have everything running off of the array because these issues dont exist there

Though that wouldnt explain why I am unable to update around 50% of the dockers while they are hosted on the DP. Since moving them off of it, I have been able to update them. There is something wrong with a btrfs DP... Dont know what's going on but... I dont think it would be normal nor acceptable for a CPU to spike/hang a system while writing to a DP

JorgeB · January 10

Depending on the hardware and configuration, for some users heavy i/o can make the server, or some services, look unresponsive, since this only happens to some not always easy to see what's causing it, some users see better results after changing pool devices for example, using exclusive shares whenever possible can also help, failing that you can try to schedule the mover to run during the night, when the server is not being used.

trurl · January 10

4 minutes ago, JorgeB said:

schedule the mover to run during the night, when the server is not being used.

This is the default schedule anyway. Cache should be large enough to hold as much data as you would write until the server has some idle time for Mover.

mathomas3 · January 10

2 minutes ago, trurl said:

This is the default schedule anyway. Cache should be large enough to hold as much data as you would write until the server has some idle time for Mover.

Maybe I am being a stick in the mud, but if the only thing that is causing the server to hang is a btrfs DP, why would anyone use it? Yes reading data from it is quick, but writing data to it is slow and not to mention the hangups/docker issues. When mover was writing data to the DP, it would only write about 25-50 gigs of data... In my 30 years in IT, I have never seen anything like this

It's unacceptable

JorgeB · January 10

20 minutes ago, mathomas3 said:

if the only thing that is causing the server to hang is a btrfs DP, why would anyone use it?

Because most users are not affected, like mentioned some see a different using different devices, you can also try zfs to see if it performs better for you.

Vr2Io · January 10

I don't agree problem cause by disk pool, I use several btrfs/zfs pool and they work really well even in heavy load.

But I don't like cache ( mover ). I run docker in ram disk ( small total docker size ), this not because the storage can't handle it, just because spinner disk ( noise and heat ) or SSD ( endurance ) concern.

System Hangups

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation