Load Average Rocketed

dellorianes · November 10, 2022

Hi,

Since some days I suffer from high load average.

Some minutes or hours after booting the load starts to increase until the system is overload and it is unreachable (nor shh, gui...) and forced me to do an unsafe shootdown

Today I can take a photo when it starts to increase (another day I can see kworker rocketed too).

Any idea how to deal?

Thank you

nasdiego-syslog-20221110-2223.zip

Edited November 18, 2022 by dellorianes

JorgeB · November 11, 2022

Please post the complete diagnostics when the load starts increasing.

dellorianes · November 11, 2022

Ok, I will try, but it is difficult to catch it. I have a window of some minutes since it starts increasing load until I can not access to the gui nor command line (and therefore I can not download the diagnostics file) and it random starts to increase.

Since yesterday I was running on safe mode (running parity) and everything is running ok. May It be a docker issue?

Edited November 11, 2022 by dellorianes

JorgeB · November 11, 2022

In that case maybe the best way to find the culprit might be to shutdown all services and containers then start enabling them one by one to try and find the culprit.

dellorianes · November 14, 2022

Being imposible to know wich container was the culprit (they fail randongly on each reboot) I proceed to remove the docker image and built it again.

Sistem is working fine since then.

dellorianes · November 18, 2022

I return to the topic because this morning the server return to high cpu load values.

Core 4 is continously at 100% load. I have 8 cores i7-9700 3000 MHz and 4x16 DDR4 3200MHz

I tryed to get diagnostics but file doesn´t download ,as well as logs.

I give attached the error and warning messages from logs (copy paste from the gui).

after command: ~# btrfs dev stats /mnt/cache
[/dev/nvme1n1p1].write_io_errs 0
[/dev/nvme1n1p1].read_io_errs 0
[/dev/nvme1n1p1].flush_io_errs 0
[/dev/nvme1n1p1].corruption_errs 48
[/dev/nvme1n1p1].generation_errs 0
[/dev/nvme0n1p1].write_io_errs 0
[/dev/nvme0n1p1].read_io_errs 0
[/dev/nvme0n1p1].flush_io_errs 0
[/dev/nvme0n1p1].corruption_errs 32
[/dev/nvme0n1p1].generation_errs 0

warning+errors_18-11-22.txt

Edited November 18, 2022 by dellorianes

dellorianes · November 20, 2022

After performing memtest86 appeared thousands of errors.

Repeating the test with 2 from the 4 RAM cards, no errors during a compleate test.

Repeting it with the other 2 cards, same result.

Again repeating the test with the 4 cards, errors appears in less than an hour.

My motherboard is:

4 x Memoria DIMM, Max. 64GB, DDR4 4266(O.C.)/4133(O.C.)/4000(O.C.)/3866(O.C.)/3733(O.C.)/3600(O.C.)/3466(O.C.)/3400(O.C.)/3333(O.C.)/3300(O.C.)/3200(O.C.)/3000(O.C.)/2800(O.C.)/2666/2400/2133 MHz Non-ECC, Un-buffered

I was running the RAM at 3200, so, after changing them to 2666 MHz, I could complete 1 test without errors.

This is the diagnostics file after changing.

What should I do to stabilize the system??

Thank you again

nasdiego-diagnostics-20221120-1017.zip

Edited November 20, 2022 by dellorianes

JorgeB · November 20, 2022

1 hour ago, dellorianes said:

What should I do to stabilize the system??

Not overclocking the RAM is a good place to start.

Squid · November 20, 2022

Should be noted that 2666 is also an overclock.

The actual speed of the memory is 2133 (https://www.corsair.com/ca/en/Categories/Products/Memory/VENGEANCE-LPX/p/CMK32GX4M2E3200C16#tab-tech-specs) and that is the speed that it should be run at. Anything over that speed will always introduce instability into any system

dellorianes · November 20, 2022

OK, thank you for the advice.

Once changed to 2133, how I stabilize the system??

Is diagnostics ok?

Edited November 20, 2022 by dellorianes

dellorianes · November 21, 2022

Once finished with parity, that was running, I have allready set up the RAM to 2133 Hz.

After that, logs shows the following warnings:

Nov 21 07:17:15 NASDiego kernel: Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA
Nov 21 07:17:15 NASDiego kernel: ACPI: Early table checksum verification disabled
Nov 21 07:17:15 NASDiego kernel: floppy0: no floppy controllers found
Nov 21 07:17:15 NASDiego kernel: i915 0000:00:02.0: [drm] failed to retrieve link info, disabling eDP
Nov 21 07:17:21 NASDiego  mcelog: failed to prefill DIMM database from DMI data
Nov 21 07:17:32 NASDiego kernel: ACPI Warning: SystemIO range 0x0000000000000295-0x0000000000000296 conflicts with OpRegion 0x0000000000000290-0x0000000000000299 (\AMW0.SHWM) (20220331/utaddress-204)
Nov 21 07:17:54 NASDiego  rpc.statd[8814]: Failed to read /var/lib/nfs/state: Success
Nov 21 07:28:01 NASDiego kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 182

Diagnostics included.

Do I need to do something with tohose warnings?

Thank you

nasdiego-diagnostics-20221121-1018.zip

JorgeB · November 21, 2022

You can ignore those, they are harmless.

Load Average Rocketed

Recommended Posts

dellorianes

Link to comment

JorgeB

Link to comment

dellorianes

Link to comment

JorgeB

Link to comment

dellorianes

Link to comment

dellorianes

Link to comment

dellorianes

Link to comment

JorgeB

Link to comment

Squid

Link to comment

dellorianes

Link to comment

dellorianes

Link to comment

JorgeB

Link to comment

Join the conversation