H2OKing Posted August 8 Share Posted August 8 (edited) Hello everyone, I'm experiencing some issues with my Unraid server and could use some help. Here are the details: 1. Intermittent CPU Spikes: - This causes dockers to become unresponsive for approximately a minute. - Certain CPU cores (individual threads) experience intermittent spikes. - The overall CPU usage monitor does not spike up when the individual ones do. 2. Web GUI Unresponsiveness: - The Unraid web GUI sometimes becomes unresponsive. - It can remain unresponsive for up to 30 minutes before becoming accessible again. 3. Docker Unresponsiveness: - During the CPU spikes, dockers become unresponsive. - I have to wait around a minute before I can access them again. 4. Disk Missing from Array: - During a severe hang-up, the web GUI indicated that a disk was missing from the array. - This issue occurred after the web GUI became responsive again. 5. Mounting: - Disk 22 takes around 20m to mount Logs and Errors Found: - Hardware Compatibility and TPM Errors: - tsc: Fast TSC calibration failed - tpm_crb: probe of MSFT0101:00 failed with error -16 - ACPI Warnings: - ACPI Warning: SystemIO range conflicts with OpRegion - Time Synchronization Issues: - kernel reports TIME_ERROR: 0x41: Clock Unsynchronized - Unclean Shutdowns: - Notification of unclean shutdowns and potential data corruption. - nginx Alerts: - Open sockets and connection issues: nginx: alert open socket left in connection - Unraid API Warnings and Critical Alerts: - Placeholder entries for [warning] and [critical] without detailed information. - Potential Disk Issues: - SMART logs showing that error logging is supported but no errors logged. Some mention of mechanical start failures and CRC errors. - General Protection Faults: - Several instances of general protection faults in the libc-2.37.so library: Aug 6 17:02:51 UNRAID kernel: traps: lsof[22308] general protection fault ip:1509830bac6e sp:add40251cfe82e76 error:0 in libc-2.37.so[1509830a2000+169000] Aug 6 17:14:49 UNRAID kernel: traps: lsof[12988] general protection fault ip:1489b20cec6e sp:41f850543809bd50 error:0 in libc-2.37.so[1489b20b6000+169000] - Network Interface Issues: - Several network interfaces entered disabled state and promiscuous mode: Aug 6 17:21:49 UNRAID kernel: br-f7d9c5297b05: port 36(vethe2acfd7) entered disabled state Aug 6 17:24:49 UNRAID kernel: br-f7d9c5297b05: port 25(veth106540b) entered disabled state Aug 6 18:17:43 UNRAID kernel: br-f7d9c5297b05: port 24(veth938038d) entered disabled state - Possible SYN Flooding: - TCP warning about possible SYN flooding on a port: Aug 6 18:21:03 UNRAID kernel: TCP: request_sock_TCP: Possible SYN flooding on port 53852. Sending cookies. Check SNMP counters. - SSH Connection Resets: - Multiple SSH connection resets by peer: Aug 6 23:12:53 UNRAID sshd-session[86932]: Read error from remote host 10.1.60.22 port 55739: Connection reset by peer Aug 6 23:12:54 UNRAID sshd-session[127246]: Read error from remote host 10.1.60.22 port 54639: Connection reset by peer I've checked the logs and found some hardware-related warnings and errors, but I'm not sure how to proceed. Any advice or suggestions on how to troubleshoot and resolve these issues would be greatly appreciated. Thank you! unraid-diagnostics-20240806-1535.zip unraid-diagnostics-20240806-1540.zip SyslogCatchAll-2024-08-07.txt SyslogCatchAll-2024-08-06.txt Edited August 8 by H2OKing Quote Link to comment
JorgeB Posted August 8 Share Posted August 8 Which disk is missing? I don't see any disabled disk in the diags. Disk22 mounting time, suggest trying xfs, btrfs and zfs can be much slower to mount, though 22 minutes is a lot, what type of data is on that disk, large media files or small files? Quote Link to comment
H2OKing Posted August 8 Author Share Posted August 8 1 hour ago, JorgeB said: Which disk is missing? I don't see any disabled disk in the diags. Disk22 mounting time, suggest trying xfs, btrfs and zfs can be much slower to mount, though 22 minutes is a lot, what type of data is on that disk, large media files or small files? I do not have the Diagnostic file for when that disk dropped out. I'll grab one if it happens again. The wried part was it didn't show it missing. Is just wasn't reading like temps. Disk 22 is set up like Space invader, tutorial for snapshots of my app.Data. there maybe some other media stuff also. Quote Link to comment
JorgeB Posted August 8 Share Posted August 8 15 minutes ago, H2OKing said: Disk 22 is set up like Space invader, tutorial for snapshots of my app.Data That's probably why it takes a long time to mount, lots of snapshots of small files, you can try deleting the older ones if no longer needed, and see if that helps. Quote Link to comment
H2OKing Posted August 8 Author Share Posted August 8 Will look into it. My biggest issue is the never ending hang ups. It's none stop. It happens around once or twice every 10 minutes Quote Link to comment
JorgeB Posted August 8 Share Posted August 8 49 minutes ago, H2OKing said: It happens around once or twice every 10 minutes Does it still happen if you stop the docker and VM services? Quote Link to comment
H2OKing Posted August 16 Author Share Posted August 16 On 8/8/2024 at 9:01 AM, JorgeB said: Does it still happen if you stop the docker and VM services? I think so. I read something about ZFS issues. So I upgraded to 7.0.0-beta.2 with a Uptime 4 days 14 hours 16 minutes. so far not seeing the same issue. will keep an eye on it and will report with further findings Quote Link to comment
H2OKing Posted August 23 Author Share Posted August 23 I'm still not seeing any issues, but I have a question regarding ZFS. Is it expected that ZFS should use more than one thread? When I monitor the system using top, I notice that ZFS is only utilizing 100% of a single thread, with just one thread being maxed out. Is this normal behavior, or should I be seeing multi-threaded performance? Quote Link to comment
JorgeB Posted August 24 Share Posted August 24 If it's a single operation it's normal. Quote Link to comment
H2OKing Posted August 26 Author Share Posted August 26 well i just noted this. i don't think i have enough snapshot @SpaceInvaderOne Quote Link to comment
H2OKing Posted August 28 Author Share Posted August 28 (edited) **Update: Unclean Shutdowns and read lockups - Potential Issues Identified in Logs** Hey everyone, I’ve been continuing to investigate the unclean shutdowns on my Unraid 7.0.0 beta 2 system, and I wanted to share some potential issues we've identified from the logs. **System Specs:** - **Motherboard:** Gigabyte X399 DESIGNARE EX-CF - **BIOS:** Version F13a, dated 11/30/2021 - **CPU:** AMD Ryzen Threadripper 2970WX 24-Core - **Memory:** 128 GiB DDR4 **Potential Issues Identified:** 1. **RCU Stalls:** - The logs show multiple instances of `rcu_preempt detected stalls on CPUs/tasks`. These errors suggest that some tasks are blocking the RCU (Read-Copy-Update) mechanism, which is critical for kernel synchronization. This could be contributing to the unclean shutdowns. 2. **OOM Killer Invocations:** - The Out-Of-Memory (OOM) killer has been triggered multiple times. This indicates that the system is running out of memory, leading the kernel to forcibly terminate processes. This might be another factor causing the shutdowns to be unclean. 3. **High Memory Usage:** - There’s significant memory usage by various kernel structures and caches, indicating the system is under considerable memory pressure. This could be exacerbating the OOM events. 4. **NVIDIA Driver Issues:** - The logs also include warnings related to the NVIDIA GPU driver, including firmware loading issues. These might be contributing to system instability. These are some of the issues we've possibly found while reviewing the logs. If anyone has experienced similar issues or has insights on how to address these, I’d appreciate any advice or suggestions. Thanks for your help! --- Feel free to use this as an update in your existing thread. Let me know if you need any more changes! SyslogCatchAll-2024-08-28.txt Edited August 28 by H2OKing Quote Link to comment
JorgeB Posted August 28 Share Posted August 28 Lots of out of memory issues, java was the process killed but there are a lot of strange processes, likely one or more containers are causing the issue, try to identify which one is the problem, by running they one or a few at a time. Quote Link to comment
H2OKing Posted August 28 Author Share Posted August 28 **Follow-Up: Observations on Unclean Shutdowns and GUI Issues** **Additional Observations:** After further monitoring and analysis, we’ve encountered some new issues that might be related to the unclean shutdowns: 1. **GUI Disk Issues:** - There was an instance where the Unraid GUI displayed all disks as gone or blank. This issue prevented a clean reboot via the GUI’s restart button. We had to use SSH to execute the shutdown command manually. 2. **High Activity in kworker Processes:** - We observed significant CPU activity related to `kworker` threads, particularly those tied to cryptographic tasks (`kcryptd`). This suggests heavy disk encryption or decryption operations, which might be straining system resources. 3. **OOM Reaper Activity:** - The `oom_reaper` process was active, aligning with earlier logs showing Out-Of-Memory (OOM) killer invocations. This confirms that the system is under considerable memory pressure. --- **Sharing Our Findings:** We’re sharing these observations to see if others in the community have encountered similar issues or have insights into what might be causing them. Specifically, we’re curious about: - **GUI Issues:** Has anyone else experienced their disks appearing blank in the GUI, and if so, what was the root cause? - **Cryptographic Tasks:** Are there known issues with `kcryptd` processes consuming excessive CPU, and could this be linked to the system's instability? - **OOM and Memory Pressure:** Any tips on managing memory more effectively to avoid OOM killer invocations? Any feedback or suggestions based on these findings would be greatly appreciated. We’re continuing to investigate and will share more as we find it. Thanks again for the community's help! processes.txt Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.