High CPU Usage/Lockup even with simple processes


Recommended Posts

Hi all,

 

My server is running extremely slowly (100% CPU usage) for reasons I cannot understand. It idles at 20% with no dockers active and even updating a docker will cause CPU usage to spike to 100%. This is extremely uncharacteristic of my system (it can usually handle 5+ concurrent Plex streams).

This started happening when I wasn't monitoring my downloads and accidentally filled up the cache array to 100% capacity (which obviously causes problems). Since then I have run a balance and a btrfs scrub with multiple restarts. I have also run the docker new permissions tool. This has not solved the problem and I am not sure what else to do. I assume it is filesystem related, but maybe it could be something else?

mastertower-diagnostics-20190126-1206.zip

 

EDIT: Forgot to mention that "top" does not show any processes that are taking up 20%+ of the CPU. So quite perplexing.

Edited by ZipsServer
Link to comment

I don't see anything that's not expected from the process/syslogs.

 

At the time of the diagnostics capture it shows you have 2 "rsync" processing running that's causing the load from the "unbalance" plugin.

 

nobody   15606  0.4  0.1 110180 14136 ?        Sl   11:04   0:18 /usr/local/emhttp/plugins/unbalance/unbalance -port 6237
nobody   27569 21.4  0.0  20164  3308 ?        D    12:03   0:39  \_ /usr/bin/rsync -avPR -X Video/Action Cam /mnt/disk2/
nobody   27570  0.0  0.0  19556  2556 ?        S    12:03   0:00      \_ /usr/bin/rsync -avPR -X Video/Action Cam /mnt/disk2/
nobody   27571 15.3  0.0  19996  2384 ?        S    12:03   0:28          \_ /usr/bin/rsync -avPR -X Video/Action Cam /mnt/disk2/

 

From your SMARTS on your cache drive(s), I'm not sure if the RAW value has any meaning for SSDs, but for most spinners, it does for field 187. Do you have a lot of power outages? If not, perhaps your power plane for the cache drives needs to be corrected. The larger ones report a number of "unexpected power loss" as well.

 

Model Family:     SandForce Driven SSDs
Device Model:     SanDisk SDSSDA240G
Serial Number:    154836404031
LU WWN Device Id: 5 001b44 f188d033f
Firmware Version: Z22000RL
User Capacity:    240,057,409,536 bytes [240 GB]

 

174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    24
187 Reported_Uncorrect      -O--CK   100   100   000    -    5872

 

0x04  0x008  4            5872  ---  Number of Reported Uncorrectable Errors

 

---

 

Model Family:     SandForce Driven SSDs
Device Model:     SanDisk SDSSDA240G
Serial Number:    161337404732
LU WWN Device Id: 5 001b44 4a4758b33
Firmware Version: Z22000RL
User Capacity:    240,057,409,536 bytes [240 GB]

 

174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    16
187 Reported_Uncorrect      -O--CK   100   100   000    -    2

 

0x000a  2            4  Device-to-host register FISes sent due to a COMRESET

 

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 1
    CR     = Command Register
    FEATR  = Features Register
    COUNT  = Count (was: Sector Count) Register
    LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
    LH     = LBA High (was: Cylinder High) Register    ]   LBA
    LM     = LBA Mid (was: Cylinder Low) Register      ] Register
    LL     = LBA Low (was: Sector Number) Register     ]
    DV     = Device (was: Device/Head) Register
    DC     = Device Control Register
    ER     = Error register
    ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 [0] log entry is empty
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

---

 

Model Family:     SandForce Driven SSDs
Device Model:     TS64GSSD320
Serial Number:    A2910611949538650037
LU WWN Device Id: 0 0232d0 000000000
Firmware Version: 5.0.2
User Capacity:    64,023,257,088 bytes [64.0 GB]

 

174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    122

 

0x0009  2            1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET

 

Link to comment

Thanks BRiT,

 

I was running unBalance on an array drive, but that was unrelated to the current problem.

 

I didn't notice the power outage metric. My server runs on a UPS so I doubt it is an issue with mains voltage. However, it could be possible that I need a second PSU or a larger PSU? Maybe when all drives spin up it causes brown-out?

 

Still doesn't explain why the system/cache drives are so slow though :/

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.