Jump to content

Server keeps going offline requiring hard reset (button press) on average once a day. Need help


ANJ_
Go to solution Solved by ANJ_,

Recommended Posts

I'm new to unraid and have recently built my first server but for the past 1-2 months trying to work out the kinks there is one issue I can not get any sort of a grasp on, and that is that my server "goes offline" regularly, like once a day, sometimes more, if I'm lucky I can get it to stay online for around 2 days. To access my server again I have to hold in the power button and restart the tower, I hate doing this so much and it's getting to a point that just feels unhealthy for the PC. So to be clear the tower itself remains powered on, I can hear the flans blowing, the discs spinning, the lights are on, looking at it you wouldn't know anything is up, but accessing the GUI isn't possible, on this unraid website when I go to 'My Servers' it says my server is offline. I have no idea why it's doing this and don't know how to diagnose the cause. It tends to do this almost always when I'm sleeping, I barely ever catch it happen I simply wake up and my server is offline, it just happened right now, maybe even right when midnight hit. I thought at first maybe it was caused by the system sleeping and not waking on LAN, but if that's the case I don't really know how to test for it, I disabled global C states and thought that solved it at one point, which is when my server lasted about 2 days. I also don't think it's a sleep issue because I had just accessed dockers like 10 minutes before it seemed to happen, and I've let the server sit for much longer without activity and it's been fine. Oddly I have noticed that when this happens and I go reset the system, it often happens again in a very short amount of time after the reset.

 

So the process usually looks like: Notice server is offline --> Hard reset --> log back into server --> 30-60 min later server offline again --> Hard reset --> 1-2 days later Server offline --> ( Loop) .

 

Some users on reddit said to consider the power supply but again, I don't think this would be the issue as the power supply is brand new, has plenty of wattage for the system, and the tower itself appears to still be stable and powered when this happens. I also would rather not buy another power supply to test it, being that I really don't think it's the issue.

I could really use a hand here. If anyone knows how I can diagnose this it would be appreciated. I'll post the diagnostics since I know that is typically requested. Since I have just reset my server I expect it will off itself again in the next 30 min before I reset it again and it stays up for a day or two. Will edit post to see if this remains true and is indeed most likely a pattern... Edit: So it did not go offline again shortly after however 24 hours later it did go offline, I reset it, and within 30 min it went offline again.

anjnas-diagnostics-20230121-0042.zip

Edited by ANJ_
Link to comment
22 hours ago, JorgeB said:

You can enable the simplest option: Mirror to flash drive.

 

Ok so sometime in the last 4 hours my server went offline. I set a syslog on the local server and chose a drive but it didn't seem to create anything. I also did mirror to flash and the only file I see is simply labled syslog in the logs folder, so I'll post that in hopes that it helps.

 

I don't know what to look for in it though but it looks like it only goes as far back as 10 min ago, which is probably when I hard reset the server. So it doesn't seem to have any information when the server went offline.

 

Also as an update, the server did in fact do the thing again where it went offline again shortly after resetting, it's offline right now again, so I'll hard reset and it should be good for another 2 days before it starts again.

syslog

Edited by ANJ_
Link to comment
9 hours ago, JorgeB said:

There are several crashes logged, look more like a hardware problem, start by running memtest

Yeah by now it has probably racked up quite a few crashes. I'll run a memtest however attempting to do so from the boot menu just causes a reboot, I read up that this has to do with booting in UEFI but since the flash drive is set as UEFI I see no real way around that which seems a bit odd but I see that others are getting a newer memtest online so I'll search up that now.

 

Is there any particular piece of hardware it would be suggesting the issue may be? I know that what I'm about to say doesn't mean everything is always working perfectly but nearly all the hardware is brand new and it is all compatible with each other, and I've never received DOA or defective hardware yet in my maybe 15 years of building computers, so I'm just curious if there's a specific component that seems most likely the culprit.
 

Would ruling out BIOS settings be in the cards as well? Because I really probably could have used a personal advisor on optimal BIOS settings for an unraid server, I typically don't change much in BIOS when building standard desktop pcs but it seemed like there were more suggested configurations to make in BIOS for a server, I'm still not confident I switched over every setting I should have or that I switched unnecessary settings...

 

Will update once I've got a memtest going for some 24-48 hours

 

*MEMTEST CURRENTLY RUNNING*

 

Edit: Also while trying to get memtest working I ran into this video and right off the bat it is humorous how identical my situation is, even the box of ram he is holding is the ram I have, so I'm about to check this out: 

 

Edited by ANJ_
Link to comment

So I ran memtest but it's limiting me to only 4 passes, I don't know how to run it longer than that/24 hours. 4 passes only took about 5 hours but it passed with 0 errors. I'll run it again, is running it multiple times at all equivalent to just running it for an extended amount of time? since I'm limited to 4 passes that's all I can really see to do.

 

4/4 passes 0 errors

8/8 passes 0 errors

12/12 passes 0 errors

16/16 passes 0 errors

Edited by ANJ_
Link to comment

4 passes is probably enough and if it is staying powered on probably not the power supply. i would check the cpu next although in your case you may need another hdd or sdd and boot into windows to do so. could be a motherboard or cpu problem. or since you have just built it check your temps and make sure they are good and reseat the cpu. Even if you are positive you did it right I have found taking the cpu out and reseating it fixes issues like that.

 

Link to comment
6 hours ago, philliphartmanjr said:

4 passes is probably enough and if it is staying powered on probably not the power supply. i would check the cpu next although in your case you may need another hdd or sdd and boot into windows to do so. could be a motherboard or cpu problem. or since you have just built it check your temps and make sure they are good and reseat the cpu. Even if you are positive you did it right I have found taking the cpu out and reseating it fixes issues like that.

 

 

If it were anything hardware related I would have to put the odds on it being the cpu being that I did pick it up used, so I will definitely check this next though whats the best way to put the cpu to the test? Most I did before dropping it into the rig was boot it on a test bench into BIOS. Also I can do a windows install to test it if thats the best way, but theres no good way to test on linux? I actually thought Memtest also tested cpu.

And I haven't had any warnings on thermals within unraid but I won't rule it out entirely. My HDD thermals have all been fine after some reconfiguring from the start when I did get 1 temp warning, the CPU I would think is doing fine on thermals being that it's almost always running at pretty low usage and I did drop a noctua cooler on the thing.

Some additional information in case someone finds it may be pertinent to the problem:

My mobo does not have usb 2.0 so the flash drive is plugged into usb 3.0

Also the flash drive is a cheapo 1 GB throwaway that came from a conference. I tried multiple other larger and what I would have considered more quality flash drives and they kept not having Unique Identifiers and whatnot so I just went with the one I had that it accepted.

Edited by ANJ_
Link to comment
6 hours ago, ANJ_ said:

 

If it were anything hardware related I would have to put the odds on it being the cpu being that I did pick it up used, so I will definitely check this next though whats the best way to put the cpu to the test? Most I did before dropping it into the rig was boot it on a test bench into BIOS. Also I can do a windows install to test it if thats the best way, but theres no good way to test on linux? I actually thought Memtest also tested cpu.

And I haven't had any warnings on thermals within unraid but I won't rule it out entirely. My HDD thermals have all been fine after some reconfiguring from the start when I did get 1 temp warning, the CPU I would think is doing fine on thermals being that it's almost always running at pretty low usage and I did drop a noctua cooler on the thing.

Some additional information in case someone finds it may be pertinent to the problem:

My mobo does not have usb 2.0 so the flash drive is plugged into usb 3.0

Also the flash drive is a cheapo 1 GB throwaway that came from a conference. I tried multiple other larger and what I would have considered more quality flash drives and they kept not having Unique Identifiers and whatnot so I just went with the one I had that it accepted.

I would really encourage reseating the CPU before testing to see if it might be bad because no matter what the test results will be reseating the CPU would be the next step then running the tests again. prime 95 and passmark are the 2 that I use for CPU testing and keep track of its temp during the test.

 

As far as the usb goes I would recommend a different one down the road you can get usb 2.0 flash drives on amazon that are not too expensive. After you check the CPU and if all looks good trying a different USB would be my next step. I have seen older posts where users had issues with USB 3.0 I am not sure if it was fixed or not.

 

Keep in mind that your flash drive needs to be 32g or less I believe unless that has changed and if you find one with slc storage that is more reliable. Also Unraid does not boot from the flash drive. How it works is the contents from the flash drive are offloaded into RAM then the os boots from there. Nothing else happens on the usb unless you make a change in settings add a docker or the like, then the changes happen on the usb and that is it.

 

On the other flash drives that did not work did you reformat them before trying to put unraid on it? If not that may fix that issue.

 

Link to comment
5 hours ago, philliphartmanjr said:

I would really encourage reseating the CPU before testing to see if it might be bad....

 

 

Ok I guess I'll reseat the cpu next, I never really thought that would be an issue, and I suppose I would still find it pretty surprising, I would think the only issue resulting in reseating a cpu would be from a bent pin or something of the sort but it's definitely worth a shot. I'll look into those two test, I suppose I'll reseat and see if the server stays online and if it goes offline I'll try a windows install and run some tests.

 

The usb stick by the way I'm sure is 2.0 it's just that my motherboard only has 3.0 slots. Someone suggested buying a usb 2.0 cable/header that plugs into an open usb header on the mobo so I could probably do that I just wasn't sure how important it really was. And I did format the flash drives before trying to install unraid. They fell within the specs they just werent being recognized by the usb boot img creator tool for unraid.

 

If the Rams good and the CPUs good I feel like I'm sort of running out of hardware to be suspect of. I have no gpu in the server, the hard drives all passed smart and I haven't got anything more than a few random udma crc errors which seem to just pop up and then instantly be resolved when running a test on the drives, so I've started to ignore those. ( loose sata connections I believe is the error, but the sata cables aren't loose). I imagine these maybe happen because the drives are plugged into a board within the case which then have sata connectors out for each drive, so sort of a third wheel at the party. These are far and few between but periodically one will pop up in the notifier for a random drive.

Edited by ANJ_
Link to comment
49 minutes ago, ANJ_ said:

 

Ok I guess I'll reseat the cpu next, I never really thought that would be an issue, and I suppose I would still find it pretty surprising, I would think the only issue resulting in reseating a cpu would be from a bent pin or something of the sort but it's definitely worth a shot. I'll look into those two test, I suppose I'll reseat and see if the server stays online and if it goes offline I'll try a windows install and run some tests.

 

The usb stick by the way I'm sure is 2.0 it's just that my motherboard only has 3.0 slots. Someone suggested buying a usb 2.0 cable/header that plugs into an open usb header on the mobo so I could probably do that I just wasn't sure how important it really was. And I did format the flash drives before trying to install unraid. They fell within the specs they just werent being recognized by the usb boot img creator tool for unraid.

 

If the Rams good and the CPUs good I feel like I'm sort of running out of hardware to be suspect of. I have no gpu in the server, the hard drives all passed smart and I haven't got anything more than a few random udma crc errors which seem to just pop up and then instantly be resolved when running a test on the drives, so I've started to ignore those. ( loose sata connections I believe is the error, but the sata cables aren't loose). I imagine these maybe happen because the drives are plugged into a board within the case which then have sata connectors out for each drive, so sort of a third wheel at the party. These are far and few between but periodically one will pop up in the notifier for a random drive.

Repeating the cpu is a newer thing mostly on CPUs where the pins are on the motherboard. I didn't look at your diagnostics I will do that before I post again. I will say I have had problems with unraid doing similar to me and it ended up being because of it no liking small errors or some minor setting not quite right. Especially network settings have froze my system.

 

I ran a cheep 3rd party pcie card in my old system and it worked great until I upgraded. I don't know if yours is pcie or not maybe try another lane or if you have another PC throw it in there and see if you have any issues with the card. I assuming it's not a raid card, but if it is you will want any parity on the card turned off.

Link to comment
8 hours ago, philliphartmanjr said:

Repeating the cpu is a newer thing mostly on CPUs where the pins are on the motherboard. I didn't look at your diagnostics I will do that before I post again. I will say I have had problems with unraid doing similar to me and it ended up being because of it no liking small errors or some minor setting not quite right. Especially network settings have froze my system.

 

I ran a cheep 3rd party pcie card in my old system and it worked great until I upgraded. I don't know if yours is pcie or not maybe try another lane or if you have another PC throw it in there and see if you have any issues with the card. I assuming it's not a raid card, but if it is you will want any parity on the card turned off.

 

Sorry, I wasn't super clear, the card I'm talking about that the HDDs are plugged into isn't a PCI card or anything, it's built into the case and it just has slotted HDD connections that connect to the same amount of sata outputs on the other end, it's only purpose is to serve as cable management allowing you to connect your sata cables from a different position in the chassis . Speaking of network errors though, I have now ran into a whole new issue I might need to make a new thread for. I'm not very well versed with network configuring but after doing some memtests and toggling a setting in bios which seemed to be a mistake as I could no longer boot to BIOS, I flashed BIOS, reconfigured and relaunched into Unraid. However, my server now no longer has internet functionality. My mobo has two LAN ports and when I'm plugged into the same Ethernet port I was plugged into up to today I can't access the unraid GUI via LAN and have no internet access on the server, when I switch the Ethernet cable to the other (previously unused) LAN port I can access the unraid GUI via LAN but I still don't have internet access. I don't understand why this is, in my network settings it shows eth1 as inactive/shutdown but eth0 seems fine. I have no bridging or bonding between the ports as I am not savvy enough to understand it's function but on my first install of unraid I did disable bridging and bonding because with them enabled my network speeds on all devices in the network were terrible, once I turned off bridging and bonding devices were usable again.

Anyway. Once again I'm at a loss. Another oddity of it all is that only when plugged into the second (previously unused) Ethernet port will I be able to see the tower listed as connected to the router, but even then it doesn't show up as an attached device, it only shows up in the list of attached devices when I hit the Access Control button in my router, then it is on a list of connected devices.

 

to reiterate:

Scenario 1:
Connected to - LAN port 1 (previously used port)

Unraid results - eth0 Interface Down, eth1 Interface Down

(No internet access/no LAN GUI access)

 

Scenario 2:

Connected to - LAN port 2 (previously unused port)

Unraid results - eth0 (1000mbs out, #mbs in) eth1 Interface Down

(No internet access/Yes LAN GUI access)

 

Before this my setup looked like:

Connected to - LAN port 1

Unraid results - eth0 (1000mbs out, #mbs in), eth1 (I honestly don't remember if any information was displayed)

(Yes internet access/Yes LAN GUI access)

 

Edit: Updated to most current bios and networking started working again. *shrugs*

I found a possible C state setting in BIOS I hadn't disabled (I think it was C1E) so that is disabled now and I'm going to see if the server stays up, I'll update if it works and if not I'll continue with testing CPU and reseating.

Edited by ANJ_
Link to comment
9 hours ago, philliphartmanjr said:

Sounds like somehow your interface got switched in network settings. This is managed in unraid in settings network then scroll down past eth0 and eth1 you should see interface rules there you can switch which ethernet interfaced is assigned where. Eth0 has to be  assigned and functional because that is the one unraid uses for it's self.

 

Yeah I'm not too sure what happened, I'm guessing flashing bios just messed up the LAN drivers since a BIOS upgrade solved it. I tried switching the rules but it just made it so the problem was reversed based off of which port I had my cable plugged into (scenario 1 became scenario 2 and vice versa). It's working as expected again and currently my network settings read eth0 1000 Mbps, full duplex, mtu 1500, eth1 interface down / [ shutdown (inactive) ]

 

I have the server back up and am testing to see if it goes offline at or around 2days. I did have one server crash this morning and the RAM log% was at 100% so I rebooted, going to consider it a fluke for now and see what happens in the next coming days as its up and running at the moment. I think the logs were getting spammed with an error possibly related to my intel igpu being used as a transcoding device. The logs were spammed with: " kernel: get_swap_device: Bad swap file entry  ############# " where # = a seemingly random string of letters and numbers

Edited by ANJ_
Link to comment

Alright, so still at the drawing board here. Server went unresponsive again today. I had the server booted into GUI mode with peripherals plugged in at the tower and the GUI was totally frozen. Completely inaccessible through LAN, but frozen at the tower itself. I rebooted and went to BIOS and noticed that none of my BIOS settings had saved. After flashing BIOS I'm sure I reconfigured but it may have been after the fact that I then updated BIOS, not sure if that essentially flashed it again to default values, so I'm once again making the BIOS changes and letting it run another 24-48hours or until crash. I re-enabled syslogs mirrored to flash, then, again if it fails this time I will try reseating CPU and RAM while I'm at it.

 

I'm still so lost though I haven't the faintest clue what is causing this. Is it possible the MOBO is bad? The BIOS changes being unsaved is suspicious to me but because of the recent BIOS shenanigans I'm not sure if I can confirm whether or not it just completely failed to acknowledge changes. I've never received a bad MOBO so I'm not really sure what the tell tale signs are of a faulty one that seemingly works fine for up to at least 48 hours before just going ghost on you.

Edited by ANJ_
Link to comment
16 hours ago, ANJ_ said:

Alright, so still at the drawing board here. Server went unresponsive again today. I had the server booted into GUI mode with peripherals plugged in at the tower and the GUI was totally frozen. Completely inaccessible through LAN, but frozen at the tower itself. I rebooted and went to BIOS and noticed that none of my BIOS settings had saved. After flashing BIOS I'm sure I reconfigured but it may have been after the fact that I then updated BIOS, not sure if that essentially flashed it again to default values, so I'm once again making the BIOS changes and letting it run another 24-48hours or until crash. I re-enabled syslogs mirrored to flash, then, again if it fails this time I will try reseating CPU and RAM while I'm at it.

 

I'm still so lost though I haven't the faintest clue what is causing this. Is it possible the MOBO is bad? The BIOS changes being unsaved is suspicious to me but because of the recent BIOS shenanigans I'm not sure if I can confirm whether or not it just completely failed to acknowledge changes. I've never received a bad MOBO so I'm not really sure what the tell tale signs are of a faulty one that seemingly works fine for up to at least 48 hours before just going ghost on you.

Usually on a bad motherboard you can visually look at it and see it's bad. Blown capacitor or the like. Are you sure your network setting are correct? I have had a similar experience to what you are having because the network settings were not correct. I don't just mean the unraid network settings either I also mean the settings on your router and Docker containers. If you try troubleshooting this route I would recommend not letting your Dockers run until you know the rest is working properly then check test and launch your Dockers containers individually

Link to comment
  • Solution
On 1/26/2023 at 10:16 AM, philliphartmanjr said:

Usually on a bad motherboard you can visually look at it and see it's bad. Blown capacitor or the like. Are you sure your network setting are correct? I have had a similar experience to what you are having because the network settings were not correct. I don't just mean the unraid network settings either I also mean the settings on your router and Docker containers. If you try troubleshooting this route I would recommend not letting your Dockers run until you know the rest is working properly then check test and launch your Dockers containers individually

 

I couldn't say with 100% confidence that my networking is configured entirely professionally, but only because I'm not super savvy on networking. Everything does seem to be working fine currently and maybe I'm speaking too soon but I think my issue has resolved. I'm about to hit 3 days uptime on the server without any issues, it's been getting used quite frequently this whole time, and typically by now, or by 24-48 hours the server would have crashed and I would have needed to hard reboot.

 

I think what happened is this: Initially I missed a C state setting in Bios (C1E) which needed to be disabled. The gigabyte bios on my board unfortunately doesn't have any global c-state toggle and I happened to miss the C1E setting. I toggled it off but then thought it didn't resolve anything because at that same time is when I had flashed bios and the LAN drivers seemed to fail. So then I updated BIOS not realizing it essentially flashed bios again and booted it with all C states on, of course, with no success, making me think the C1E disable didn't change anything. Once I realized though that BIOS was back at default, and I never actually got a boot with C1E disabled, I reconfigured BIOS once again, this time with ALL C states off (as far as I can tell), and now the server is up for 3 days with no issues. 

 

I'll give it a week before I call it good but I feel confident right now that this was the issue and everything is now configured as it should be.

Link to comment
On 1/28/2023 at 7:22 PM, ANJ_ said:

 

I couldn't say with 100% confidence that my networking is configured entirely professionally, but only because I'm not super savvy on networking. Everything does seem to be working fine currently and maybe I'm speaking too soon but I think my issue has resolved. I'm about to hit 3 days uptime on the server without any issues, it's been getting used quite frequently this whole time, and typically by now, or by 24-48 hours the server would have crashed and I would have needed to hard reboot.

 

I think what happened is this: Initially I missed a C state setting in Bios (C1E) which needed to be disabled. The gigabyte bios on my board unfortunately doesn't have any global c-state toggle and I happened to miss the C1E setting. I toggled it off but then thought it didn't resolve anything because at that same time is when I had flashed bios and the LAN drivers seemed to fail. So then I updated BIOS not realizing it essentially flashed bios again and booted it with all C states on, of course, with no success, making me think the C1E disable didn't change anything. Once I realized though that BIOS was back at default, and I never actually got a boot with C1E disabled, I reconfigured BIOS once again, this time with ALL C states off (as far as I can tell), and now the server is up for 3 days with no issues. 

 

I'll give it a week before I call it good but I feel confident right now that this was the issue and everything is now configured as it should be.

Good to hear having c state on could also cause the same problems

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...