vmax5000 Posted June 4 Posted June 4 Server is a new system, I had bad ram the last time and have now replaced the ram and tested ok. Server has been running great for several days and then tonight it locked up. I could not access it from the server with monitor keyboard and mouse, I tried a short press on the power button to see if it would shut down but nothing worked. I have the syslog, I had a look at it and I see i/o errors on disk 3. I started the server and disk 3 disabled, do you think that is what caused the lockup? I replaced disk 3 with a known good drive and it is doing the parity sync. If possible can you look at the syslog and let me know if it was because of disk 3 or am I wrong and something else caused the lockup. Thanks syslog-127.0.0.1.log Quote
JorgeB Posted June 4 Posted June 4 Nothing relevant logged that I can see, suggesting a hardware issue, the disk errors should not make the server crash, but need to be resolved of course. Quote
vmax5000 Posted June 4 Author Posted June 4 I don't know where to look for a hardware issue, it's a completely new build. Any idea where to begin looking? Thanks Quote
JorgeB Posted June 4 Posted June 4 You can start by running memtest, but since memtest is only definitive if it finds errors, if you have multiple sticks try running the server with just one, if the same try with a different one, that will basically rule out bad RAM. Quote
vmax5000 Posted June 4 Author Posted June 4 I had a crash before, so I did take out one stick. I will wait for the parity sync to finish and then run another memtest on the single stick, I hope it's not memory I just had them replaced! I wonder if it's DDR5 memory that is the cause. Maybe I should not use the xmp profile on the ram. Anyway, thanks I will carry on and hopefully find an answer. The server will run for a few weeks before it locks up, so it may take some time. Quote
itimpi Posted June 4 Posted June 4 11 minutes ago, vmax5000 said: Maybe I should not use the xmp profile on the ram I would definitely try without this as an XMP profile by definition is an overclock. Quote
vmax5000 Posted June 12 Author Posted June 12 Just to update this issue. I tested ram extensively and it passed without issue. What I noticed was that everytime the system locked up drive 3 had contentious sata resets, that was always the last shown message in syslog. So I took drive 3 out of the server and rigged a temporary setup to test with, I plugged the drive directly into the motherboard instead of using the drive tray in the server and since then the server has run without issue and the drive does not have any resets. So I am pretty sure I have a bad sata connector in the server and for whatever reason all of the sata resets would halt unraid. So just for laughs here is my temp setup for testing! LOL I have since setup a better option with an external drive cage. Now I am able to add an additional 3 drives to my array, I added the drives and they show up in unraid but I can't find a way to add them to the array. When I look at Global Share Setttings it only shows the current 10 drives, I can't see any setting that lets me add the extra 3 drives to the array, am I missing something? Thanks ultimate-diagnostics-20240612-0956.zip Quote
JorgeB Posted June 12 Posted June 12 Do you mean you cannot assign them to the array, or cannot assign them to shares? For the former make sure you increase the number of array slots. P.S. IMHO not much point of including all disks in global shares, if you then add new ones they won't show up in shares, unless you don't forget to add them. Quote
vmax5000 Posted June 12 Author Posted June 12 Ok, I was trying to assign them to shares. How do I add to the number of array slots? Thanks Quote
JorgeB Posted June 12 Posted June 12 With the array stopped, click the slot number and increase it. Quote
vmax5000 Posted June 12 Author Posted June 12 I not quite sure I understand, when the array is stopped, what slot number do I click to increase it? I only see the 10 array slots when I stop the array. Thanks Quote
vmax5000 Posted June 12 Author Posted June 12 I don't know how I missed that! Thanks to all the help from everyone, I'm pretty sure my server is stable again! Quote
vmax5000 Posted June 19 Author Posted June 19 Well, I thought everything was good now, but it has locked up several times again, this last one after only 7 hours of up time. I set the syslog back on and this time I got some data pertaining to ngenx. I don't understand the error but maybe you could see something that will point to an issue? syslog-127.0.0.1.log ultimate-diagnostics-20240618-2046.zip Quote
vmax5000 Posted June 19 Author Posted June 19 After looking into it a bit closer, it looks like the worker error is a gmail error, it could not do the 2 factor. But I am still stumped as to why the server keeps locking up. When it halts all of the fans on the cpu, hard drives and gpu just ramp up and stay there, I tried pushing the power button once to do a proper shut down, but it doesn't work, the system is completely halted. I have run the ram test several times and it always passes. The only thing I haven't done yet is install the latest bios update, so tomorrow I will try and install it and then see what happens. Thanks Quote
JorgeB Posted June 19 Posted June 19 Constant call traces, start by running memtest, if nothing is found, and because memtest is only definitive if it finds errors, try with just one stick of RAM, if the same try with a different one, that will basically rule out bad RAM, if issues persist, another thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote
vmax5000 Posted June 19 Author Posted June 19 I have run the memtest with one stick over and over and it passes so I put both sticks back and ran the memtest again and it passed again. It seems to be getting worse, it crashed again last night and I did disable the email. Again I got constant worker errors just before it crashed. Ok, I will run the memtest again with one stick and if it passes again I will try running in safe mode as you suggest. I'm going to update the bios today as well. Thanks Just in case here is another syslog from last night. syslog-127.0.0.1.log.1 Quote
JorgeB Posted June 19 Posted June 19 1 hour ago, vmax5000 said: I have run the memtest with one stick over and over That's not what I meant, run Unraid with just one stick, if the same try a different one. There are still constant call traces. Quote
vmax5000 Posted June 19 Author Posted June 19 Sorry, that's what I am going to do, I just got a pass from both sticks, now I'm going to test with one stick. I updated the bios, it was down by 3 versions so now it's up to date. After all of the testing is done I will post a new syslog when or if it crashes again. Thanks again Quote
vmax5000 Posted June 19 Author Posted June 19 I meant to say that I will run unraid with one stick as you suggested Quote
vmax5000 Posted June 21 Author Posted June 21 New problem. When I started up the system on June 19 around 4pm it went into a parity sync so I let it finish and it ran with no errors (both ram sticks) so I thought I would let it run with both ram sticks since I updated the bios to bring it up to current. Then for no reason that I can see in the syslog the system just powered off, no lockup just powered down. Log shows nothing, so I removed one ram stick and powered back up at 7:47 and everything came up but I cancelled the parity sync and let the system run. The system powered down again at 7:56, I didn't notice until around 10 pm, so at 10:47 I powered up the system again and stopped the parity sync. At 2 PM on June 20 I powered up the system again and the last entry for June 20 is a t 10:30 pm when I logged it to see if the system was running and it was. When I got up in the morning the system was powered off again. So I powered up and went into the bios and set defaults just in case I had something configured wrong ( I made one other change, I set the OS from Windows to Other OS). So I powered up again and now it's running I stopped the parity sync again. I can't see anything in the logs that would make the system power down. Here are the logs Thanks syslog-127.0.0.1.log ultimate-diagnostics-20240621-0828.zip Quote
JorgeB Posted June 21 Posted June 21 55 minutes ago, vmax5000 said: syslog the system just powered off That can only be a hardware issue, assuming the power din't fail, PSU or board would be the main suspects. Quote
vmax5000 Posted June 21 Author Posted June 21 Well all the hardware is brand new so I don't know what I can do. I can't really go back to the motherboard manufacture with no way to show a problem and the psu is attached to a ups. The only thing that really changed since the lockup issue is a bios update. Maybe I can roll back the bios and try again. Anyway, there doesn't seem to be any issues with unraid according to the logs. I will keep working on the issue and if I come up with an solution I will post. Thanks for your help. Quote
vmax5000 Posted July 2 Author Posted July 2 I found the issue! I didn't want to pose until I tested the fix. My server has been running without issue for over a week, the fix came in a new beta bios for my motherboard, see the attachment. I'm still having some hard drive issues, but those issues are caused by my server. I am getting read errors on a drive and it keeps disabling. I took out the drive and ran a test on the drive and it has no issues so the connection in the server is going, this is the second drive that this is happening with. I got an external drive cage and put the "bad drive" in the cage and no more drive issues. So I am replacing this old server which will fix the drive issues. Anyway the solution for my original problem is the bios setting. Thanks for all the help 1 Quote
vmax5000 Posted August 15 Author Posted August 15 I just thought I would give you an update on my system. After doing this I still had random reboots so I went back into the bios and made all the changes shown below and my server has been completely stable now for over 23 days. I went back in and set my memory profile back to xmp 1 and my server has been stable as well so you can run with the xmp profile. I don't know which of these settings fixed the issue. I know there is a problem with intel 13th and 14th gen cpu's, so I'm guessing that has something to do with the issue. I am currently running a beta version of the bios F11b, but since my server is completely stable I will wait for the final version to come out before updating. I hope this helps if someone is having the same issues I was having. My motherboard is an Asus z790 UD AC with a 14700K core i7. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.