hades Posted March 20, 2022 Share Posted March 20, 2022 (edited) Hello Everyone! I'm struggling with some data corruption happening on multiple servers (glorified desktops). In total I have 8 UnRAID installations. 2 older ones which have been running fine for years, with multiple disks and VMs, no problems. This post is not about these two. More recently however, I've bought 6 almost-identical Ryzen-based servers (it's a business environment). All of them are: - Ryzen CPU, 12 or 16 core - 128GIG RAM - 3x 6TB drives for storage (xfs) in the array, one parity - 2x 2TB NVMe drives, Samsung 970EVO Plus (in a cache-pool as BTRFS) - weak GPU used for computer booting only All of them contain data on the arrays, and run ~10 VMs (mainly Windows) on which people do remote work. Recently on 4 of these Ryzen machines I've been plagued by BTRFS errors, usually this: Sometimes this results in the filesystem going into read-only mode, effectively taking down the dockers and VMs, forcing me to reboot, after which everything works for a day or so and then this repeats. Two of my Ryzen servers are installed in one location, they're running perfectly. In another location, I had 1 server running for 6 months without issues, then bought Server #2 due to capacity issues, which ran fine for a few months. Then Server #1 started having issues, and I urgently needed things to work, so I bought Server #3 to move the data/VMs onto it. Pretty quickly Server #3 started experiencing the same thing. Because I was planning on buying Server #4 anyway, I did, just installed it a few days ago, and it's already experiencing this problem. The motherboards and RAM are not completely identical. All NVMe-s were Samsung 970EVO Plus 2TB, and because this error is on the NVMe, I bought different NVMe-s for Server #4, it is using WD 2TB NVMe-s, but it's experiencing the same problem. Given the variation in hardware, and the chances of a component failing (which is somewhat rare) I highly doubt this is a hardware problem. But I'm now at a loss as to what is happening. It's eating up my hours and days, trying to stabilize things, make sure employees are able to do their jobs. Everyone works remotely, they need this to work on the data and execute long-running jobs. I'm in the process of clearing out one of the servers, so I can remove it from the network and do some tests on it without any data/VMs. Logs for all 4 servers are attached. Any help would be appreciated! Thank you. Server4-diagnostics-20220319-1738.zip Server3-diagnostics-20220319-1738.zip Server2-diagnostics-20220319-1738.zip Server1-diagnostics-20220319-1738.zip Edited March 20, 2022 by hades Quote Link to comment
trurl Posted March 20, 2022 Share Posted March 20, 2022 I suspect you're running your memory out-of-spec Quote Link to comment
hades Posted March 20, 2022 Author Share Posted March 20, 2022 43 minutes ago, trurl said: I suspect you're running your memory out-of-spec Thank you for the quick response. I already did this (from the FAQ): "find "Power Supply Idle Control" (or similar) and set it to "typical current idle" (or similar)." I just checked the server I'm concentrating on fixing, it's running at 2666Mhz, which is what the RAM is designed for. Should I slow the RAM down? I'm already getting these errors 12 minutes after bootup. Once these errors occur, would they go away if the underlying problem (let's say RAM out-of-spec) is fixed? Or would the pool need to be formatted for the errors to go away? Thank you! Quote Link to comment
JorgeB Posted March 21, 2022 Share Posted March 21, 2022 On 3/20/2022 at 1:36 AM, hades said: Should I slow the RAM down? You should set it to the max officially supported speed for that config, I only checked server4, that one has 4 dual ranked DIMMs @ 3200MT/s, max for that config is 2666MT/s. On 3/20/2022 at 1:36 AM, hades said: Once these errors occur, would they go away if the underlying problem (let's say RAM out-of-spec) is fixed? No, but you can run a scrub, delete the affected files then reset the pool stats. Quote Link to comment
hades Posted March 22, 2022 Author Share Posted March 22, 2022 On 3/21/2022 at 11:50 AM, JorgeB said: You should set it to the max officially supported speed for that config, I only checked server4, that one has 4 dual ranked DIMMs @ 3200MT/s, max for that config is 2666MT/s. No, but you can run a scrub, delete the affected files then reset the pool stats. Good catch on Server4. Thank you. I went through all 4 servers, set the RAM speed to 2666 (disabled the XMProfiles), and made sure that the Power Limit something is set to "Typical". I then ran scrub, which identified a few files as corrupted. Those files have been deleted. Just focusing on Server4 for now. I ran scrub a few times yesterday and no errors. It ran fine for 1.5 days, and is now showing the same types of errors again. The tutorial says to run RAM at 2666, check. Power Limit something = Typical, check. VMs which so far seem to have been unaffected are now showing up as corrupted. Is this a Ryzen thing ONLY? If so, I am good to go and buy an Intel-based MB & CPU as replacement. It's going to come out cheaper than the productivity I and the team are losing out on due to issues like this... Thank you. Server4-diagnostics-20220322-1517.zip Quote Link to comment
hades Posted March 22, 2022 Author Share Posted March 22, 2022 (edited) Also, the logs are now full of these types of errors. This is server4: Is this related? EDIT 20 minutes later: Awesome... The entire server died. Had to do a hard-reboot. This is the longest, most frustrating, and most expensive, experience with UnRAID/Linux/whatever... Edited March 22, 2022 by hades Quote Link to comment
JorgeB Posted March 23, 2022 Share Posted March 23, 2022 Run memtest, could just be bad RAM, or other hardware issue. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.