High CPU Usage/Lockup even with simple processes

ZipsServer · January 26, 2019

Hi all,

My server is running extremely slowly (100% CPU usage) for reasons I cannot understand. It idles at 20% with no dockers active and even updating a docker will cause CPU usage to spike to 100%. This is extremely uncharacteristic of my system (it can usually handle 5+ concurrent Plex streams).

This started happening when I wasn't monitoring my downloads and accidentally filled up the cache array to 100% capacity (which obviously causes problems). Since then I have run a balance and a btrfs scrub with multiple restarts. I have also run the docker new permissions tool. This has not solved the problem and I am not sure what else to do. I assume it is filesystem related, but maybe it could be something else?

mastertower-diagnostics-20190126-1206.zip

EDIT: Forgot to mention that "top" does not show any processes that are taking up 20%+ of the CPU. So quite perplexing.

Edited January 26, 2019 by ZipsServer

ZipsServer · January 26, 2019

I am seeing some errors on the SMART report for the cache drives. I have a feeling there could be a bad SATA cable or poor connection since I recently put my cache drives in a back plane module. Not sure how to diagnose or correct this problem if it even exists.

BRiT · January 26, 2019

I don't see anything that's not expected from the process/syslogs.

At the time of the diagnostics capture it shows you have 2 "rsync" processing running that's causing the load from the "unbalance" plugin.

nobody 15606 0.4 0.1 110180 14136 ? Sl 11:04 0:18 /usr/local/emhttp/plugins/unbalance/unbalance -port 6237
nobody 27569 21.4 0.0 20164 3308 ? D 12:03 0:39 \_ /usr/bin/rsync -avPR -X Video/Action Cam /mnt/disk2/
nobody 27570 0.0 0.0 19556 2556 ? S 12:03 0:00 \_ /usr/bin/rsync -avPR -X Video/Action Cam /mnt/disk2/
nobody 27571 15.3 0.0 19996 2384 ? S 12:03 0:28 \_ /usr/bin/rsync -avPR -X Video/Action Cam /mnt/disk2/

From your SMARTS on your cache drive(s), I'm not sure if the RAW value has any meaning for SSDs, but for most spinners, it does for field 187. Do you have a lot of power outages? If not, perhaps your power plane for the cache drives needs to be corrected. The larger ones report a number of "unexpected power loss" as well.

Model Family: SandForce Driven SSDs
Device Model: SanDisk SDSSDA240G
Serial Number: 154836404031
LU WWN Device Id: 5 001b44 f188d033f
Firmware Version: Z22000RL
User Capacity: 240,057,409,536 bytes [240 GB]

174 Unexpect_Power_Loss_Ct -O--CK 100 100 000 - 24
187 Reported_Uncorrect -O--CK 100 100 000 - 5872

0x04 0x008 4 5872 --- Number of Reported Uncorrectable Errors

---

Model Family: SandForce Driven SSDs
Device Model: SanDisk SDSSDA240G
Serial Number: 161337404732
LU WWN Device Id: 5 001b44 4a4758b33
Firmware Version: Z22000RL
User Capacity: 240,057,409,536 bytes [240 GB]

174 Unexpect_Power_Loss_Ct -O--CK 100 100 000 - 16
187 Reported_Uncorrect -O--CK 100 100 000 - 2

0x000a 2 4 Device-to-host register FISes sent due to a COMRESET

SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
Device Error Count: 1
   CR = Command Register
   FEATR = Features Register
   COUNT = Count (was: Sector Count) Register
   LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
   LH = LBA High (was: Cylinder High) Register ] LBA
   LM = LBA Mid (was: Cylinder Low) Register ] Register
   LL = LBA Low (was: Sector Number) Register ]
   DV = Device (was: Device/Head) Register
   DC = Device Control Register
   ER = Error register
   ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 [0] log entry is empty
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]

---

Model Family: SandForce Driven SSDs
Device Model: TS64GSSD320
Serial Number: A2910611949538650037
LU WWN Device Id: 0 0232d0 000000000
Firmware Version: 5.0.2
User Capacity: 64,023,257,088 bytes [64.0 GB]

174 Unexpect_Power_Loss_Ct ----CK 000 000 000 - 122

0x0009 2 1 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 1 Device-to-host register FISes sent due to a COMRESET

ZipsServer · January 26, 2019

Thanks BRiT,

I was running unBalance on an array drive, but that was unrelated to the current problem.

I didn't notice the power outage metric. My server runs on a UPS so I doubt it is an issue with mains voltage. However, it could be possible that I need a second PSU or a larger PSU? Maybe when all drives spin up it causes brown-out?

Still doesn't explain why the system/cache drives are so slow though

High CPU Usage/Lockup even with simple processes

Recommended Posts

ZipsServer

Link to comment

ZipsServer

Link to comment

BRiT

Link to comment

ZipsServer

Link to comment

Join the conversation