dellorianes Posted November 10, 2022 Share Posted November 10, 2022 (edited) Hi, Since some days I suffer from high load average. Some minutes or hours after booting the load starts to increase until the system is overload and it is unreachable (nor shh, gui...) and forced me to do an unsafe shootdown Today I can take a photo when it starts to increase (another day I can see kworker rocketed too). Any idea how to deal? Thank you nasdiego-syslog-20221110-2223.zip Edited November 18, 2022 by dellorianes Quote Link to comment
JorgeB Posted November 11, 2022 Share Posted November 11, 2022 Please post the complete diagnostics when the load starts increasing. Quote Link to comment
dellorianes Posted November 11, 2022 Author Share Posted November 11, 2022 (edited) Ok, I will try, but it is difficult to catch it. I have a window of some minutes since it starts increasing load until I can not access to the gui nor command line (and therefore I can not download the diagnostics file) and it random starts to increase. Since yesterday I was running on safe mode (running parity) and everything is running ok. May It be a docker issue? Edited November 11, 2022 by dellorianes Quote Link to comment
JorgeB Posted November 11, 2022 Share Posted November 11, 2022 In that case maybe the best way to find the culprit might be to shutdown all services and containers then start enabling them one by one to try and find the culprit. Quote Link to comment
dellorianes Posted November 14, 2022 Author Share Posted November 14, 2022 Being imposible to know wich container was the culprit (they fail randongly on each reboot) I proceed to remove the docker image and built it again. Sistem is working fine since then. 1 Quote Link to comment
dellorianes Posted November 18, 2022 Author Share Posted November 18, 2022 (edited) I return to the topic because this morning the server return to high cpu load values. Core 4 is continously at 100% load. I have 8 cores i7-9700 3000 MHz and 4x16 DDR4 3200MHz I tryed to get diagnostics but file doesn´t download ,as well as logs. I give attached the error and warning messages from logs (copy paste from the gui). after command: ~# btrfs dev stats /mnt/cache [/dev/nvme1n1p1].write_io_errs 0 [/dev/nvme1n1p1].read_io_errs 0 [/dev/nvme1n1p1].flush_io_errs 0 [/dev/nvme1n1p1].corruption_errs 48 [/dev/nvme1n1p1].generation_errs 0 [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 0 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 32 [/dev/nvme0n1p1].generation_errs 0 warning+errors_18-11-22.txt Edited November 18, 2022 by dellorianes Quote Link to comment
dellorianes Posted November 20, 2022 Author Share Posted November 20, 2022 (edited) After performing memtest86 appeared thousands of errors. Repeating the test with 2 from the 4 RAM cards, no errors during a compleate test. Repeting it with the other 2 cards, same result. Again repeating the test with the 4 cards, errors appears in less than an hour. My motherboard is: 4 x Memoria DIMM, Max. 64GB, DDR4 4266(O.C.)/4133(O.C.)/4000(O.C.)/3866(O.C.)/3733(O.C.)/3600(O.C.)/3466(O.C.)/3400(O.C.)/3333(O.C.)/3300(O.C.)/3200(O.C.)/3000(O.C.)/2800(O.C.)/2666/2400/2133 MHz Non-ECC, Un-buffered I was running the RAM at 3200, so, after changing them to 2666 MHz, I could complete 1 test without errors. This is the diagnostics file after changing. What should I do to stabilize the system?? Thank you again nasdiego-diagnostics-20221120-1017.zip Edited November 20, 2022 by dellorianes Quote Link to comment
JorgeB Posted November 20, 2022 Share Posted November 20, 2022 1 hour ago, dellorianes said: What should I do to stabilize the system?? Not overclocking the RAM is a good place to start. 1 Quote Link to comment
Squid Posted November 20, 2022 Share Posted November 20, 2022 Should be noted that 2666 is also an overclock. The actual speed of the memory is 2133 (https://www.corsair.com/ca/en/Categories/Products/Memory/VENGEANCE-LPX/p/CMK32GX4M2E3200C16#tab-tech-specs) and that is the speed that it should be run at. Anything over that speed will always introduce instability into any system 1 1 Quote Link to comment
dellorianes Posted November 20, 2022 Author Share Posted November 20, 2022 (edited) OK, thank you for the advice. Once changed to 2133, how I stabilize the system?? Is diagnostics ok? Edited November 20, 2022 by dellorianes Quote Link to comment
dellorianes Posted November 21, 2022 Author Share Posted November 21, 2022 Once finished with parity, that was running, I have allready set up the RAM to 2133 Hz. After that, logs shows the following warnings: Nov 21 07:17:15 NASDiego kernel: Warning: PCIe ACS overrides enabled; This may allow non-IOMMU protected peer-to-peer DMA Nov 21 07:17:15 NASDiego kernel: ACPI: Early table checksum verification disabled Nov 21 07:17:15 NASDiego kernel: floppy0: no floppy controllers found Nov 21 07:17:15 NASDiego kernel: i915 0000:00:02.0: [drm] failed to retrieve link info, disabling eDP Nov 21 07:17:21 NASDiego mcelog: failed to prefill DIMM database from DMI data Nov 21 07:17:32 NASDiego kernel: ACPI Warning: SystemIO range 0x0000000000000295-0x0000000000000296 conflicts with OpRegion 0x0000000000000290-0x0000000000000299 (\AMW0.SHWM) (20220331/utaddress-204) Nov 21 07:17:54 NASDiego rpc.statd[8814]: Failed to read /var/lib/nfs/state: Success Nov 21 07:28:01 NASDiego kernel: EDID block 0 (tag 0x00) checksum is invalid, remainder is 182 Diagnostics included. Do I need to do something with tohose warnings? Thank you nasdiego-diagnostics-20221121-1018.zip Quote Link to comment
JorgeB Posted November 21, 2022 Share Posted November 21, 2022 You can ignore those, they are harmless. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.