me160 Posted March 17, 2022 Share Posted March 17, 2022 bit of a puzzler this, my system has been plagued by random crashes and various problems since i built it 4 years ago, from my understanding and previous posts its due to both my lack of proper setup or simply that im running on ryzen. that aside my system had been up for about a month with no issues till last week when it crashed, and unfortunately i am working out of town right now and cant access the hardware to check anything other than having a buddy hit the restart button and send a picture of the console display on my attached monitor when he gets around to it, but since that one crash it appears to crash immediately after rebooting and letting docker start a few containers, sometimes its up for a good hour. each crash seems a different cause but ive seen this one 3 times now (see attached picture), can anyone help me with this or atleast let me know if this is a hardware failure? Quote Link to comment
Squid Posted March 17, 2022 Share Posted March 17, 2022 First thing to check on any random crashing is to run Memtest from the boot menu (You'll have to temporarily switch to Legacy boot for it to work - UEFI booting will not let Memtest run) for a minimum of a pass or 2 After than, refer to the Ryzen FAQ: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/#comment-819173 Quote Link to comment
me160 Posted March 17, 2022 Author Share Posted March 17, 2022 alright ill give it a try this weekend when i get back home, ive already done everything i could in the ryzen faq about setting it up and it didn't help except to make the crashes not quite as frequent, as well as updating to 9.10.0rc1, though i might also try updating to rc3 as well to try a newer kernel Quote Link to comment
me160 Posted March 19, 2022 Author Share Posted March 19, 2022 ok so i got around to trying to do a memtest, for some reason it wont let me boot into legacy mode, my motherboard just says the media is unbootable but it works just fine in uefi mode. did i forget to do something? i did however notice that when i upgraded my ram (from 32gb to 64gb) i guess i forgot to check whether the board set the speed correctly and it was trying to run 3200mhz ram at 2400mhz, as far as im aware that shouldn't be a problem as it wasn't overclocked just that id be loosing a bit of performance but please correct me if im wrong or if that could have been the problem Quote Link to comment
Squid Posted March 19, 2022 Share Posted March 19, 2022 Your alternative if you can't boot legacy is to set up a new flash drive with https://www.memtest86.com/ Quote Link to comment
me160 Posted March 20, 2022 Author Share Posted March 20, 2022 ok, i let the memtest run overnight and it passed all tests 4 times with 0 errors. what would my next steps in diagnosing a problem be or could it just be the ram speed being set low causing problems Quote Link to comment
itimpi Posted March 20, 2022 Share Posted March 20, 2022 RaM speed being set too low is unlikely to be a problem. What is more frequent is that users try to run the RAM at the max speed quoted for the RAM not realising that the motherboard + CPU combination may have a much lower maximum speed they can run stably. Quote Link to comment
me160 Posted March 20, 2022 Author Share Posted March 20, 2022 thats what i figured, as far as i can tell my cpu can handle the ram ive got at max frequency so i have set it to the max of the ram just to see if it does help. Quote Link to comment
me160 Posted April 1, 2022 Author Share Posted April 1, 2022 hey, i haven't heard anything for a week, just wondering if anyone had any further ideas on whether this is a hardware problem or software or how to narrow it down to witch? Quote Link to comment
me160 Posted May 10, 2022 Author Share Posted May 10, 2022 for anyone who is interested in this still, i figured out the problem. it turns out my power supply was either failing or i was drawing too low power from it for it to be able to do its job efficiently and supply stable power. i was using a corsair sfx-750, on a system that draws a max of like 250W (unable to determine the exact usage as my only number is from my ups that has my unraid server, anther system running pfsense, security cameras and 3 routers on it, and in unraid my ups reports an average usage of around 255w, with the highest i saw it at was about 300w) so i installed a smaller (again corsair as it was the only one i could get, and it is also a platinum 80+, and all the other sfx power supplies i could find were only bronze or silver) sfx-400 and the system has been up for 3 weeks, only turning off when i manually reboot or shut it down to do other things and haven't had any problems since swapping the power supply. if anyone else cares to know the science behind what i found out or if it could be useful to someone here it is, power supplies have efficiency curves as well as stability curves. the power supply is *most* efficient and stable at about 40-55% load, higher than that you only loose a few efficiency percent (on my particular one its 94% efficient at 45% load, its peak, and only drops to 92% efficient at 100%load), but efficiency isn't what im talking about here, its the stability curve. voltage regulators are a bit of silicon that take an unstable voltage and, wait for it, regulate it to a constant voltage that is generally very stable and has little or no fluctuations, it does this by effectively working like a light switch turning the power on and off very fast. they are very good at it when they are within their specified stability range, meaning there must be at least X amount of amperage being drawn for it to make stable, clean volts. if it gets too high of a draw it over heats and may fail, or it causes power "flickering" because the rate it turns on/off slows down to account for the current draw so you end up with a burst of high current and voltage, then it turns off and repeats till the load goes down or the reg. burns up. however, if there isn't *enough* load on a regulator i can also create voltage fluctuations, now this will be far less harmful as were talking as low as 0.1--0.2v or even lower depending on the rail voltage its on. but to a cpu or memory module a 0.2v difference where its voltage should be, say 3.3v, one second it could get 3.7v, then the next 3.1v and memory modules do not like that much voltage difference, they can accept that difference one decimal over (3.32v-3.28v for ex.). and if your power supply is like how mine was where i had a 750w power supply serving a 250w peak load, that's about a 30%load, witch in theory should be ok and if the server ran at peak all the time it probably would be, but i suspect it often dipped down to around 200w or lower witch would be in the 20%'s, and i believe most power supplies stability curve starts being stable enough for computers to be, well stable, at around 30-35% ish. now that I've lowered my power supply rating (by going to the 400w) ive increased my max load to just over 60% of the power supply instead of 30%, that should mean at the lowest my system runs i should still be at about the 40% load range. thats enough of that lecture on power supplies, if anyone wants more on this Ltt did a great video on it a while ago link here tl;dr: power supplies cause problems too, don't throw the biggest power supply available in your system. if any info above is incorrect (please don't tell me I'm totally wrong, I'm not i do have background with building electronics, not Shure why i didn't think about my power supply till now though) or if anyone has more data than i shared please correct me i like learning new things Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.