JeremiePlam Posted June 7, 2023 Share Posted June 7, 2023 (edited) Hello everyone, I've been encountering some issues with my Unraid server recently and could use some assistance. Here's a brief summary of the problem: I experienced corrupted data following a power loss incident. Numerous BTRFS errors have been occurring since then. Initially, I managed to keep things running by restarting the system occasionally and recreating the docker.img file. However, I decided to address the problem this weekend after receiving my UPS to protect against future power losses. Here's what I did: Backed up my data using the "CA Backup Appdata" tool after deleting any corrupted temporary files to ensure a clean backup. Completely wiped the BTRFS pool and formatted the two cache drives. Started fresh by restoring the appdata and recreating the docker.img file. Unfortunately, the BTRFS pool started exhibiting errors again just two days after the cleanup. Additionally, the Docker service intermittently fails to start, although the containers themselves remain responsive in their respective WebUIs. However, communication between containers becomes impossible. I have temporarily made the docker.img 40GB to remove storage space issues from the equation, but the problems still persist. The only way to stop the containers now is to reboot the server, which results in an unclean shutdown because Unraid fails to stop everything properly. I suspect that the issue might be related to my RAM, but memtest did not detect any errors. I searched on other forum threads but couldn't find a solution. I would greatly appreciate any guidance or suggestions to help resolve these ongoing issues. Thank you in advance for your assistance! This is what the Docker tab looks after a few minutes/hours of working properly Log looks like this: All of the drives still have plenty of space left: unraid-diagnostics-20230607-1110.zip Edited June 7, 2023 by JeremiePlam Quote Link to comment
JorgeB Posted June 7, 2023 Share Posted June 7, 2023 That does look like a RAM issue, memtest doesn't always detect everything, try using just of the of RAM sticks, re-format cache and see how it goes, if the same try the other stick, that would basically rule out RAM. 1 Quote Link to comment
JeremiePlam Posted June 23, 2023 Author Share Posted June 23, 2023 I changed the RAM sticks and didn't get any btrfs errors for a week. I then upgraded to Unraid 6.12 and switched my cache pool to zfs. I haven't had any issues since then, but I cannot confirm for sure that the RAM solved it since too many factors changed. However, if anyone else has this problem, I'd recommend trying another RAM kit for a few weeks, and return it if it doesn't fix the issue. 1 Quote Link to comment
JeremiePlam Posted July 10, 2023 Author Share Posted July 10, 2023 (edited) I don't know why and I can't seem to find the info anywhere, but now the Unraid server is randomly shutting down (unclean) and restarting. Second time it happens this week with brand new RAM (I have done memtest86+ for about 30 hours with no error). If RAM is still the problem, I have until today to return the kit. I have also started getting Hardware Errors for some reason: Jul 10 04:06:41 Unraid kernel: mce: [Hardware Error]: Machine check events logged Jul 10 04:06:41 Unraid kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: baa000000002080b Jul 10 04:06:41 Unraid kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 5d020002 IPID 1002e00000500 Jul 10 04:06:41 Unraid kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1688976370 SOCKET 0 APIC 0 microcode 8701021 Here's the diag, your help would be much appreciated as I'm struggling to understand what is going on. unraid-diagnostics-20230710-0659.zip Edited July 10, 2023 by JeremiePlam Quote Link to comment
JorgeB Posted July 10, 2023 Share Posted July 10, 2023 Those errors and the symptoms point more to a hardware problem, also make sure this is done. 1 Quote Link to comment
JeremiePlam Posted July 10, 2023 Author Share Posted July 10, 2023 (edited) 2 hours ago, JorgeB said: Those errors and the symptoms point more to a hardware problem, also make sure this is done. I've updated the BIOS, set Power Supply Idle Control to Typical Current Idle and disabled Global C-States. I have also disabled any form of overclocking on the CPU and RAM. We'll see how it goes. However, no kit seem to support 3200MT/s but rather 2133MT/s with an XMP profile (which is a form of OC), so should I apply the profile or not? I'm confused. I'm returning the 3600 (XMP) kit and buying a 3200 (XMP) kit instead. Edited July 10, 2023 by JeremiePlam Quote Link to comment
JorgeB Posted July 10, 2023 Share Posted July 10, 2023 3 minutes ago, JeremiePlam said: no kit supports 3200MT/s There is RAM that supports 3200MT/s without XMP, at slower timings, but using the default voltage, unlike XMP kits which usually overvolt the RAM, i.e. overclocks it. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.