Load Average Climbs - Can't Stop Array


Recommended Posts

Hello,

 

I'm having an issue where the load average/CPU utilization climbs after a day or so and renders the server unable to shut down cleanly.  At the time of writing (2:41 CDT-Apr.7), the load average is at 110.  This morning it was at 4.  I've had to power it off twice over the past few days since it can't shut down on its own.

 

The highest I saw it was over 200.  I could still SSH in, but nothing would work.  VMs and Dockers wouldn't respond.  Luckily, I managed to get the diagnostics downloaded.

 

Thanks in advance!

unbeast-diagnostics-20200407-1354.zip

Link to comment

I realize there's not a lot of info there, but I'm scratching my head.   I should have mentioned that this happened on 6.8.3 first, then I downgraded back to 6.8.2.  I tried to install the NVIDIA kernel in the past few days, too.

 

I'm just hoping the diagnostics help shed some light on the issue.

Link to comment
2 hours ago, 83749020 said:

I've had to power it off twice over the past few days since it can't shut down on its own.

I'm no expert but the only things that jump out at me are below, in short I suspect your cache but this is outside of my ability, hopefully someone else can better decipher it.

Quote

MemTotal:       65909084 kB
MemFree:         3774828 kB
MemAvailable:   54814708 kB
Buffers:             636 kB
Cached:         50836172 kB

I'm not sure but is it normal to have that much memory cached?

Quote

Apr  6 08:49:54 unBEAST apcupsd[8458]: Communications with UPS lost.

This is probably not related but this communication error repeats throughout the syslog, likely a loose cable....

 

Below is where I think your troubles lie....

Quote

Apr  6 08:49:38 unBEAST emhttpd: cache TotDevices: 3
Apr  6 08:49:38 unBEAST emhttpd: cache NumDevices: 2
Apr  6 08:49:38 unBEAST emhttpd: cache NumFound: 2
Apr  6 08:49:38 unBEAST emhttpd: cache NumMissing: 1

Apr  6 08:49:38 unBEAST kernel: BTRFS warning (device sdd1): devid 2 uuid be0f0270-cc3b-4ed2-802c-1a6f5dd5ac11 is missing

Quote

Apr  6 08:49:45 BTRFS warning (device sdd1): csum failed root -9 ino 257 off 955629568 csum 0xd67cf70e expected csum 0xdbab11f0 mirror 1
Apr  6 08:49:45 BTRFS warning (device sdd1): csum failed root -9 ino 257 off 955633664 csum 0x10fd1c19 expected csum 0xffff888f mirror 1
Apr  6 08:49:45 BTRFS warning (device sdd1): csum failed root -9 ino 257 off 955629568 csum 0xd67cf70e expected csum 0xdbab11f0 mirror 1
Apr  6 08:49:45 BTRFS warning (device sdd1): csum failed root -9 ino 257 off 955633664 csum 0x10fd1c19 expected csum 0xffff888f mirror 1
Apr  6 08:49:45 BTRFS warning (device sdd1): csum failed root -9 ino 257 off 955629568 csum 0xd67cf70e expected csum 0xdbab11f0 mirror 1

I suspect there is something wrong with your cache pool, looks like one device (sdd1) is missing or something, you also have a ton of your shares set to use cache and the floor was pretty high if I understand that right. Also, it looks like you may have your VMs on the array and not on the cache, which is also unusual.

Quote

[V..s] => Array
        (
            [name] => V..s
            [nameOrig] => V..s
            [comment] =>

To reiterate, I suspect that between your cache and possibly the mover, (see warning about the deprecated plugin below) is where you should start troubleshooting. Maybe someone else will see something  more definitive but hopefully this will give you an aha moment...

Quote

Apr  6 08:59:01 unBEAST root: Fix Common Problems: Warning: Share datastore set to cache-only, but files / folders exist on the array
Apr  6 08:59:01 unBEAST root: Fix Common Problems: Warning: Share Backups set to not use the cache, but files / folders exist on the cache drive
Apr  6 08:59:01 unBEAST root: Fix Common Problems: Warning: Deprecated plugin ca.mover.tuning.plg

 

Edited by Dissones4U
Link to comment

Thanks a lot for the insight.

 

I recently had to remove a failing cache drive, so I removed 2 500GB drives and added a 1TB drive.  How can I get the cache pool to look normal?  I went from 2x500GB+1TB to 2x1TB cache drives.  I was afraid to switch the numbering around on the cache pool, so I left it as-is.

 

I'll see about replacing the UPS cable, too.  That's been annoying me.  Once I get the cache pool drive numbering sorted, I'll move the files around.

Link to comment
48 minutes ago, 83749020 said:

Ok, so do you mean it will automatically change from “drive 1 missing, drives 2 and 3 present” to “drives 1 and 2 present”?

Yes, the missing drive needs to be removed.

 

49 minutes ago, 83749020 said:

Can I run some kind of repair in maintenance mode?

You can try running a scrub, with the pool mounted, but might not work during the balance.

Link to comment

 

5 hours ago, johnnie.black said:

if it still doesn't work best bet is to backup, re-format and restore the data.

I don't fully understand the checksum error even after googling it. I think that because it's a mirrored pool that files on one side are different from the files on the other. How would you know what to backup? Could he make the shares cache yes, run mover and then reinstall the cache and then make the shares cache prefer or would this lead to the same bad checksums.

 

If he can do that, then here is a pretty good video for the process

 

Link to comment
15 minutes ago, Dissones4U said:

I don't fully understand the checksum error even after googling it. I think that because it's a mirrored pool that files on one side are different from the files on the other.

In this case both mirros are corrupt, or it would be automatically fixed.

 

16 minutes ago, Dissones4U said:

How would you know what to backup?

That looks more like corrupt metadata, but any corrupt data can be identified by a scrub, it will also result in an i/o error if you try to read/copy that file.

  • Thanks 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.