Jump to content

CPU Spikes and Unresponsive Dockers/Web GUI - H2OKing


Recommended Posts

Hello everyone,

I'm experiencing some issues with my Unraid server and could use some help. Here are the details:

1. Intermittent CPU Spikes:
   - This causes dockers to become unresponsive for approximately a minute.

   - Certain CPU cores (individual threads) experience intermittent spikes.

   - The overall CPU usage monitor does not spike up when the individual ones do.

 

2. Web GUI Unresponsiveness:
   - The Unraid web GUI sometimes becomes unresponsive.
   - It can remain unresponsive for up to 30 minutes before becoming accessible again.

 

3. Docker Unresponsiveness:
   - During the CPU spikes, dockers become unresponsive.
   - I have to wait around a minute before I can access them again.

 

4. Disk Missing from Array:
   - During a severe hang-up, the web GUI indicated that a disk was missing from the array.
   - This issue occurred after the web GUI became responsive again.

 

5. Mounting:

    - Disk 22 takes around 20m to mount

 

Logs and Errors Found:

- Hardware Compatibility and TPM Errors:
  - tsc: Fast TSC calibration failed
  - tpm_crb: probe of MSFT0101:00 failed with error -16

- ACPI Warnings:
  - ACPI Warning: SystemIO range conflicts with OpRegion

- Time Synchronization Issues:
  - kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

- Unclean Shutdowns:
  - Notification of unclean shutdowns and potential data corruption.

- nginx Alerts:
  - Open sockets and connection issues: nginx: alert open socket left in connection

- Unraid API Warnings and Critical Alerts:
  - Placeholder entries for [warning] and [critical] without detailed information.

- Potential Disk Issues:
  - SMART logs showing that error logging is supported but no errors logged. Some mention of mechanical start failures and CRC errors.

 

- General Protection Faults:
  - Several instances of general protection faults in the libc-2.37.so library:

Aug  6 17:02:51 UNRAID kernel: traps: lsof[22308] general protection fault ip:1509830bac6e sp:add40251cfe82e76 error:0 in libc-2.37.so[1509830a2000+169000]
    Aug  6 17:14:49 UNRAID kernel: traps: lsof[12988] general protection fault ip:1489b20cec6e sp:41f850543809bd50 error:0 in libc-2.37.so[1489b20b6000+169000]

 

- Network Interface Issues:
  - Several network interfaces entered disabled state and promiscuous mode:   

 Aug  6 17:21:49 UNRAID kernel: br-f7d9c5297b05: port 36(vethe2acfd7) entered disabled state
    Aug  6 17:24:49 UNRAID kernel: br-f7d9c5297b05: port 25(veth106540b) entered disabled state
    Aug  6 18:17:43 UNRAID kernel: br-f7d9c5297b05: port 24(veth938038d) entered disabled state

 

- Possible SYN Flooding: 

 - TCP warning about possible SYN flooding on a port:
    Aug  6 18:21:03 UNRAID kernel: TCP: request_sock_TCP: Possible SYN flooding on port 53852. Sending cookies.  Check SNMP counters.

 

- SSH Connection Resets:
  - Multiple SSH connection resets by peer:   

 Aug  6 23:12:53 UNRAID sshd-session[86932]: Read error from remote host 10.1.60.22 port 55739: Connection reset by peer
    Aug  6 23:12:54 UNRAID sshd-session[127246]: Read error from remote host 10.1.60.22 port 54639: Connection reset by peer

 

I've checked the logs and found some hardware-related warnings and errors, but I'm not sure how to proceed. Any advice or suggestions on how to troubleshoot and resolve these issues would be greatly appreciated.

Thank you!

Screenshot 2024-08-06 171843.png

Screenshot 2024-08-06 182241.png

Screenshot 2024-08-07 180948.png

Screenshot 2024-08-07 181000.png

Screenshot 2024-08-07 181021.png

unraid-diagnostics-20240806-1535.zip unraid-diagnostics-20240806-1540.zip SyslogCatchAll-2024-08-07.txt SyslogCatchAll-2024-08-06.txt

Edited by H2OKing
Link to comment

Which disk is missing? I don't see any disabled disk in the diags.

 

Disk22 mounting time, suggest trying xfs, btrfs and zfs can be much slower to mount, though 22 minutes is a lot, what type of data is on that disk, large media files or small files?

Link to comment
1 hour ago, JorgeB said:

Which disk is missing? I don't see any disabled disk in the diags.

 

Disk22 mounting time, suggest trying xfs, btrfs and zfs can be much slower to mount, though 22 minutes is a lot, what type of data is on that disk, large media files or small files?

I do not have the Diagnostic file for when that disk dropped out.  I'll grab one if it happens again. The wried part was it didn't show it missing.  Is just wasn't reading like temps.  

 

Disk 22 is set up like Space invader, tutorial for snapshots of my app.Data. there maybe some other media stuff also. 

Link to comment
15 minutes ago, H2OKing said:

Disk 22 is set up like Space invader, tutorial for snapshots of my app.Data

That's probably why it takes a long time to mount, lots of snapshots of small files, you can try deleting the older ones if no longer needed, and see if that helps.

Link to comment
On 8/8/2024 at 9:01 AM, JorgeB said:

Does it still happen if you stop the docker and VM services?

I think so. I read something about ZFS issues. So I upgraded to 7.0.0-beta.2 with a Uptime 4 days 14 hours 16 minutes. so far not seeing the same issue. will keep an eye on it and will report with further findings 

Link to comment

I'm still not seeing any issues, but I have a question regarding ZFS. Is it expected that ZFS should use more than one thread? When I monitor the system using top, I notice that ZFS is only utilizing 100% of a single thread, with just one thread being maxed out. Is this normal behavior, or should I be seeing multi-threaded performance? 

Link to comment
Posted (edited)

**Update: Unclean Shutdowns and read lockups - Potential Issues Identified in Logs**

Hey everyone,

I’ve been continuing to investigate the unclean shutdowns on my Unraid 7.0.0 beta 2 system, and I wanted to share some potential issues we've identified from the logs.

**System Specs:**
- **Motherboard:** Gigabyte X399 DESIGNARE EX-CF
- **BIOS:** Version F13a, dated 11/30/2021
- **CPU:** AMD Ryzen Threadripper 2970WX 24-Core
- **Memory:** 128 GiB DDR4

**Potential Issues Identified:**

1. **RCU Stalls:**
   - The logs show multiple instances of `rcu_preempt detected stalls on CPUs/tasks`. These errors suggest that some tasks are blocking the RCU (Read-Copy-Update) mechanism, which is critical for kernel synchronization. This could be contributing to the unclean shutdowns.

2. **OOM Killer Invocations:**
   - The Out-Of-Memory (OOM) killer has been triggered multiple times. This indicates that the system is running out of memory, leading the kernel to forcibly terminate processes. This might be another factor causing the shutdowns to be unclean.

3. **High Memory Usage:**
   - There’s significant memory usage by various kernel structures and caches, indicating the system is under considerable memory pressure. This could be exacerbating the OOM events.

4. **NVIDIA Driver Issues:**
   - The logs also include warnings related to the NVIDIA GPU driver, including firmware loading issues. These might be contributing to system instability.

These are some of the issues we've possibly found while reviewing the logs. If anyone has experienced similar issues or has insights on how to address these, I’d appreciate any advice or suggestions.

Thanks for your help!

---

Feel free to use this as an update in your existing thread. Let me know if you need any more changes!

SyslogCatchAll-2024-08-28.txt

Edited by H2OKing
Link to comment

**Follow-Up: Observations on Unclean Shutdowns and GUI Issues**

 

**Additional Observations:**

After further monitoring and analysis, we’ve encountered some new issues that might be related to the unclean shutdowns:

1. **GUI Disk Issues:**
   - There was an instance where the Unraid GUI displayed all disks as gone or blank. This issue prevented a clean reboot via the GUI’s restart button. We had to use SSH to execute the shutdown command manually.

2. **High Activity in kworker Processes:**
   - We observed significant CPU activity related to `kworker` threads, particularly those tied to cryptographic tasks (`kcryptd`). This suggests heavy disk encryption or decryption operations, which might be straining system resources.

3. **OOM Reaper Activity:**
   - The `oom_reaper` process was active, aligning with earlier logs showing Out-Of-Memory (OOM) killer invocations. This confirms that the system is under considerable memory pressure.

---

**Sharing Our Findings:**

We’re sharing these observations to see if others in the community have encountered similar issues or have insights into what might be causing them. Specifically, we’re curious about:

- **GUI Issues:** Has anyone else experienced their disks appearing blank in the GUI, and if so, what was the root cause?
- **Cryptographic Tasks:** Are there known issues with `kcryptd` processes consuming excessive CPU, and could this be linked to the system's instability?
- **OOM and Memory Pressure:** Any tips on managing memory more effectively to avoid OOM killer invocations?

Any feedback or suggestions based on these findings would be greatly appreciated. We’re continuing to investigate and will share more as we find it.

Thanks again for the community's help!

processes.txt

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...