B1scu1T Posted March 13, 2023 Share Posted March 13, 2023 Hi all, I have been wresting with this server since I built it with "random" reboots and a whole host of other annoying issues. Starting to get worried about my data now tbh. Specs: ASRock Rack X570-D4U AMD Ryzen 5 PRO 4650G 2x 32GB DDR4 ECC Unbuffered RAM NVidia T400 I had a powercut this morning and it of course co-coincided with me having my UPS offline due to a battery replacement... typical. Just to double up on my misfortune, I had also made a mistake in a .conf file whilst setting up an external rsyslog server to help me trace the issue at the root of my general server issues so after all the effort, I didn't even get any off-board log data... typical. I have been suspicious that the problems could be related to the chipset on the motherboard so as a troubleshooting measure I installed an LSI 9211-8i which is connected via SAS cables to a SAS backplate. The power cut of course means I had to run a parity check. While it was running (probably didn't make it to 10%) I got an email from the server saying there was a disk error. When I logged in, I noticed that drive was disabled, and all the other drives had read errors. Out of panic, I just stopped the parity check as I noticed that I had left on "write corrections" and was concerned I would going to be writing garbage to my array. I have no idea where I stand at the moment, so open to suggestions to check my data and/or test my hardware. I have run a short SMART test, and am currently running an extended test of the drive that unraid says has failed. I also ordered a replacement drive which will arrive tomorrow, although I'm not actually think there is anything wrong with it. Diagnostics attached. tower-diagnostics-20230313-1415.zip Quote Link to comment
JorgeB Posted March 13, 2023 Share Posted March 13, 2023 Diags are after rebooting so we can't see what happened, but the disk looks healthy. Quote Link to comment
B1scu1T Posted March 13, 2023 Author Share Posted March 13, 2023 3 minutes ago, JorgeB said: Diags are after rebooting so we can't see what happened, but the disk looks healthy. I was concerned that might be a problem. A couple of questions for moving forward: 1. Is there anything other than an external syslog server that can help me trace my reboots? 2. Is there anything I can do to test the validity of the data on my array? 3. How do I see where the "fail" flag is and "acknowledge" it so the array can be started with this disk? Due to the power failure and subsequent failed parity check, I am genuinely not confident in my parity data to rebuild the array from. Quote Link to comment
JorgeB Posted March 13, 2023 Share Posted March 13, 2023 Start the array to see if the emulated disk mounts and contents look correct, if yes you can rebuild on top, another option, assuming nothing was written to that disk once it got disabled would be to do a new config and re-sync parity instead. 35 minutes ago, B1scu1T said: Is there anything I can do to test the validity of the data on my array? Only if you have pre-existing checksums for all data (or are using btrfs). Quote Link to comment
B1scu1T Posted March 13, 2023 Author Share Posted March 13, 2023 I should probably point out the reason the log is wiped is because the server rebooted itself when i stopped the array. Guess I will have to have a think. About ready to chuck this POS motherboard in the bin, never had as many issues as this with my Supermicro Intel systems. Quote Link to comment
B1scu1T Posted March 14, 2023 Author Share Posted March 14, 2023 (edited) Extended parity check on that disk came back ok. I decided that I have more trust in the disks data than the parity, given I saw the disks writing data to the parity when one of the disks was disabled, so I created a new config and tried to build a new parity, but It didn't make it. Sadly I don't think my remote syslog helps us but grateful for any feedback. unraid_155.zip tower-diagnostics-20230314-0942.zip Edited March 14, 2023 by B1scu1T Quote Link to comment
JorgeB Posted March 14, 2023 Share Posted March 14, 2023 There's nothing relevant logged, this and the symptoms to me point to a hardware issue. 1 Quote Link to comment
B1scu1T Posted March 14, 2023 Author Share Posted March 14, 2023 (edited) 40 minutes ago, JorgeB said: There's nothing relevant logged, this and the symptoms to me point to a hardware issue. Definitely. This I am already pretty sure of, but I was hoping there would be an indicator here as to what is. I have run memtest for a few days when installing this so I have no clue where the hardware issues are coming from. Edited March 14, 2023 by B1scu1T Quote Link to comment
JorgeB Posted March 14, 2023 Share Posted March 14, 2023 By the symptoms Board, RAM or PSU with be the mains suspects, I would start by trying with just one of the RAM sticks, if it still crashes try with just the other one, that would basically rule out the RAM. Quote Link to comment
B1scu1T Posted March 14, 2023 Author Share Posted March 14, 2023 6 minutes ago, JorgeB said: By the symptoms Board, RAM or PSU with be the mains suspects, I would start by trying with just one of the RAM sticks, if it still crashes try with just the other one, that would basically rule out the RAM. The PSU is pretty new, and its a decent unit (Corsair TX850M). I could swap in my older CS450M though, that ran for years without fault so it would at least one of the variables removed. IPMI is also of absolutely no help and isn't reporting any issues other than clock sync, but it also isn't dropping out when the sever drops out which probably suggests there isnt a significant power issue. I have grabbed the debug logs from there. Some errors around the same time as the reboot but they don't, IMO, point to anything useful... bit beyond my expertise at this point The issue with tracing the individual components is that I have fiddled with settings previously, thinking I had figured out the problem (e.g. RAM settings/voltage, CPU settings, GPU settings), and then had the system stable for literally months with no crashes or reboots, so then I presume it is ok until it randomly happens again. At the moment it appears to be totally borked at exactly the wrong moment. I'm at breaking point tbh, I need this server to run at some capacity. I think im going to pick up an v6 E3 Xeon and supporting board from Ebay then I can experiment with this X570 and get to the bottom of its issues as a separate project (yipee). Might even just sell the board on and/or discuss with the retailer if it looks like there is a problem with it. var.zip Quote Link to comment
trurl Posted March 14, 2023 Share Posted March 14, 2023 21 hours ago, B1scu1T said: 2x 32GB DDR4 ECC Unbuffered RAM 1 hour ago, B1scu1T said: I have run memtest memtest on the Unraid boot menu can't detect errors with ECC RAM. You have to get memtest86.com 1 Quote Link to comment
B1scu1T Posted March 14, 2023 Author Share Posted March 14, 2023 6 minutes ago, trurl said: memtest on the Unraid boot menu can't detect errors with ECC RAM. You have to get memtest86.com I honestly cant remember how I tested it, but good shout. No harm in running it again. Quote Link to comment
B1scu1T Posted March 14, 2023 Author Share Posted March 14, 2023 Man, no matter what I do, this motherboard will not boot from an EFI based USB image. Ended up having to clear my CMOS as I fiddled with so many things I eventually couldn't get it to post at all. Quote Link to comment
B1scu1T Posted March 14, 2023 Author Share Posted March 14, 2023 I got it to finally boot the Memtest USB drive but somehow in the process I have bricked my IPMI 😄 It's been like this since my last post pretty much Seriously f**k this motherboard. Quote Link to comment
B1scu1T Posted March 16, 2023 Author Share Posted March 16, 2023 So on the one hand, I ran the 4 pass memtest with ECC polling enabled twice and no errors... so dont think it's that. On the other hand, the ASRock Rack support team supplied me with some SOC flash tools, but they don't recognise the BMC image to fix my IPMI. Quote Link to comment
B1scu1T Posted April 25, 2023 Author Share Posted April 25, 2023 (edited) I managed to get the board alive again through what seems to be nothing but sheer luck, however I didn't get to the bottom of why there is constant crashes. I just bit the bullet and bought new hardware. Now running Intel and Supermicro and it's been flawless. Despite the lack of Memtest reported failures, I'm inclined to lean towards it being a Memory issue, slight possibility there is something wrong with the 4650G CPU or UEFI settings. Edited April 25, 2023 by B1scu1T Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.