Jump to content

[Resolved] UnRaid Server randomly power cycling


Go to solution Solved by nexusjosh,

Recommended Posts

A few days ago, I was tinkering with my server, and upgraded it with an ASUS Hyper M.2 X16 PCIe 3.0 X4 Expansion Card V2, retiring the single nvme riser that my cache is on, as well as upgrade my cache drive from an old 500GB SKHynix to the current (Retired from my gaming PC) Samsung Evo Pro 1TB.

 

I originally left the SK Hynix SSD in the Hyper, but have been having strange issues with it for a while, where it would hang on S.M.A.R.T but when I had it connected as Cache would run with no problems, except when I checked its S.M.A.R.T, it would hang, and casue a shutdown.

 

After I moved it to being my Docker Storage device, my server would be up and randomly power cycle, and is still doing it.  At this time, I have absolutely zero idea as to why it is power cycling.  Its doing it a bit randomly between 6 hours, and about 12 hours of up time, and I can't figure out why. 

 

Can one of you masters take a peek at my log and potentially sus out why the system is randomly doing an unclean power cycle?

 

Thanks guys!  Hardware specs are in my sig.  Running the latest stable release of UnRaid

syslog.txt

Edited by nexusjosh
Link to comment
43 minutes ago, JorgeB said:

If you mean it's rebooting on its own this is usually hardware related and unlikely to leave a trace in the log, still you can enable the syslog server and post that together with the full diagnostics after it does it.

Cool, I didn't know about Syslog Server, thanks.  I've enabled it.  My SuperO Motherboard also has IPMI, so I'm going to monitor the IPMI's event log, and perhaps that might shed some light?

Link to comment
12 hours ago, JorgeB said:

Maybe, it's worth looking at.

Yeah, I appear to now be getting a cascade of errors.  It ran fine all day yesterday, but this morning I powered on my system, and Disk 6 was turned offline.  Apparently the CMC Error count had gone high for some reason so the OS shut down the drive.  Powered off my server, re-seated the SATA power, and data cable, turned the server back on, and now it loads the OS, and freezes when I try to login to the web interface. I'm continuing to troubleshoot.

Link to comment
21 hours ago, itimpi said:

According to your screen shot you have so far only set up the server to receive syslog entries from other machines.

 

You have two options:

  • set the IP address of your server into the remote syslog server field so your server logs to itself.
  • set the option to mirror the syslog to flash

I Set it to Localhost.  And it appears to be working.  The server was all good yesterday, went to sleep, woke up to the OS hanging on boot.  Must have powered down again last night.  Ugh.  Reset the system again, and got it to boot.  It appears that the OS crashed around 3AM, and hung in the boot process around 8, and then when I reset the system.

 

Here are the lines that pop up in the log. 

 

Quote

Feb 26 00:00:04 SwiftServer root: /etc/libvirt: 23.7 MiB (24899584 bytes) trimmed on /dev/loop3
Feb 26 00:00:04 SwiftServer root: /var/lib/docker: 4.6 GiB (4939444224 bytes) trimmed on /dev/loop2
Feb 26 00:00:04 SwiftServer root: /mnt/disks/VM Drive: 432.7 GiB (464568934400 bytes) trimmed on /dev/sde1
Feb 26 00:00:04 SwiftServer root: /mnt/cache: 1 GiB (1084268544 bytes) trimmed on /dev/nvme0n1p1
Feb 26 00:20:01 SwiftServer sSMTP[16441]: Creating SSL connection to host
Feb 26 00:20:01 SwiftServer sSMTP[16441]: SSL connection using TLS_AES_256_GCM_SHA384
Feb 26 00:20:01 SwiftServer sSMTP[16441]: Authorization failed (535 5.7.8  https://support.google.com/mail/?p=BadCredentials l5-20020a056a0016c500b004f140564a00sm5997242pfc.203 - gsmtp)
Feb 26 03:00:07 SwiftServer crond[2546]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
Feb 26 08:00:04 SwiftServer root: /etc/libvirt: 23.7 MiB (24899584 bytes) trimmed on /dev/loop3
Feb 26 08:00:04 SwiftServer root: /var/lib/docker: 4.6 GiB (4939444224 bytes) trimmed on /dev/loop2
Feb 26 08:00:04 SwiftServer root: /mnt/disks/VM Drive: 432.7 GiB (464568934400 bytes) trimmed on /dev/sde1
Feb 26 08:00:04 SwiftServer root: /mnt/cache: 1 GiB (1084571648 bytes) trimmed on /dev/nvme0n1p1
Feb 26 09:41:19 SwiftServer dhcpcd[2450]: br0: failed to renew DHCP, rebinding
Feb 26 10:21:25 SwiftServer kernel: md: sync done. time=77904sec
Feb 26 10:21:25 SwiftServer kernel: md: recovery thread: exit status: 0

 

 

Thank god the Parity Rebuild for the durrped Disk 6 completed before it crashed!

Edited by nexusjosh
Link to comment
1 hour ago, trurl said:

 

 

OK.  Between my posting that, and now, the server crashed AGAIN.  SIGH.  Attaching Sislog, which is the full log, And Syslog Crash, with is the logs beginning right from the end of the last log.  I'm also uploading the full diagnostics.

 

What may have brought on the instability may be enabling of bifurcation?  I just don't understand why that would cause this?! @.@

 

Unassigned is also throwing/spamming a new error. Ugh.
 

Feb 27 19:22:13 SwiftServer root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token

syslog.log syslog-Lastlog-Crash.log swiftserver-diagnostics-20220227-1921.zip

Link to comment

Not surprisingly I'm not seeing anything relevant logged, this usually points to a hardware issue, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment
52 minutes ago, JorgeB said:

Not surprisingly I'm not seeing anything relevant logged, this usually points to a hardware issue, one thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Thats what I figured.  The only thing I've changed, is, as I said in my first post, enabled bifurcation, and installed the ASUS Hyper M.2 X16 PCIe 3.0 X4 Expansion Card V2.  I'm going to boot the OS into safe mode, see if it stays on for 3 or 4 days.  If it does not, I'll disable bifurcation again, and see if it happens again.  I only have 1 SSD (Cache) installed in Hyper atm.

 

Also, I had an SK Hynix M.2 I pulled out of a Dell a couple years ago now.  I thought the drive was going back, but come to find out.  The SSD doesn't Support SMART!  What?!  Crazy.  So because it didn't Support SMART, every time I ran a Smart check, it would freeze and crash the server.  But upon plugging it into an NVME Enclosure and plugging it into my system, its still at 96% health, so its a perfectly good SSD still.

 

Edit:  If it crashes again, I'm going to set the IPMI viewer to record, and maybe get a recording of the damn thing crashing.  That ought to help.

Edited by nexusjosh
Link to comment

It seems to me like you're definitely having hardware issues here. Faulty PSU or some overheating maybe or something? No reason why it should shut anything down unless the system is being told to do so either due to user command or to prevent damage to hardware.

 

Do the basics:

 

1. Update BIOS

2. Check temperatures

3. RAM test

4. Replace any possibly faulty drive etc

5. Stress test the machine (read/write, CPU etc)

 

If I was you I'd basically put my machine through hell to see if I can get it to fail on a specific task etc. monitor temps etc.

This smells like a temperature issue to me, especially if everything was working fine before.

Link to comment
11 hours ago, plantsandbinary said:

It seems to me like you're definitely having hardware issues here. Faulty PSU or some overheating maybe or something? No reason why it should shut anything down unless the system is being told to do so either due to user command or to prevent damage to hardware.

 

Do the basics:

 

1. Update BIOS

2. Check temperatures

3. RAM test

4. Replace any possibly faulty drive etc

5. Stress test the machine (read/write, CPU etc)

 

If I was you I'd basically put my machine through hell to see if I can get it to fail on a specific task etc. monitor temps etc.

This smells like a temperature issue to me, especially if everything was working fine before.

I do suspect there is some hardware issue.  It COULD be that one of the two CPU coolers, may need a new application of thermal paste.  Tiz been about 4 or 5 years now.  And given that one of the CPU's strangely runs hotter than the other CPU.  But not to the point of overheating. 

image.png.4a434f715741d991bd9866be23868837.png

As of right now, the shut downs are random.  Sometimes the server stays on for 24-30 hours, before resetting.  Other times, it stays on for 2 or 3 hours, before resetting.

As it stands now, in safe mode, we're sitting at 12 hours & 57 minutes and counting, in safe mode.

 

 

9 hours ago, trurl said:

 

Yeah, I researched it, and found that.  Its because the OS doesn't like old logins on a web browser pretty much.  Though why doesn't unraid just end the collections it doesn't like, and promot the user to relog, instead of throwing errors. @.@

Edited by nexusjosh
Link to comment

So My server has been up for 2 days, 9 hours in safe mode without issue.  Going to wait until tomorrow morning, when it will have been 3 days, and reboot it into regular mode.  And see what happens.  Its worth noting, earlier when I was having the problems, I wasn't turning on my VM's or Docker containers, just leaving it to run. 

 

The Thick Plottens.

Link to comment
On 2/28/2022 at 6:34 PM, nexusjosh said:

the OS doesn't like old logins on a web browser pretty much.  Though why doesn't unraid just end the collections it doesn't like, and promot the user to relog, instead of throwing errors

The old login(s) can be on other computer(s). Because of the way web works, it can only respond to the browser currently asking for information.

 

also, from 6.10rc2 release notes:

Quote

Other improvements available in 6.10, which are maybe not so obvious to spot from the release notes and some of these improvements are internal and not really visible:

 

Event driven model to obtain server information and update the webGUI in real-time

  • The advantage of this model is its scalability. Multiple browsers can be opened simultaneously to the webGUI without much impact
  • In addition stale browser sessions won't create any CSRF errors anymore
  • People who keep their browser open 24/7 will find the webGUI stays responsive at all times

 

Link to comment
51 minutes ago, trurl said:

The old login(s) can be on other computer(s). Because of the way web works, it can only respond to the browser currently asking for information.

 

also, from 6.10rc2 release notes:

 

I'm guessing that the Beta version?  Any idea when it'll be released in Stable?  If not, I'll probably move to the Beta Build.

Link to comment

So I decided to go ahead, update the OS, to the new beta build, and restart the server out of safe mode.  When I was setting up my IPMI recording the server crashed!  I didn't see the damn console, but upon restarting it hung, and I took a screenshot

1.jpg.c723af7ae0bd6d7d0ac769f559c04a2f.jpg

I saw it was throwing USB errors, and there are two front case USB ports.  I unplugged them, reset the server, and it came up as per usual.  I got the IPMI recording going, so we shall see if it crashes again... Or if something went bad with the front case USB ports.  I don't want to get my hopes up... so we shall see!

 

I'm going to leave the server running for another 2 or 3 days to check for stability.  Then do a Parity Check.  Then boot my dockers and VM's.

 

Also, the sda1 on the screen grab is the OS flash drive?  Straaaange.

 

 

1 hour ago, ChatNoir said:

Release Candidate is more advanced than a beta.

 

Limetech do not provide release dates.

Ty

Edited by nexusjosh
Link to comment

Update:  The server stayed on for almost exactly an hour before just powering off.  I had the screen recording it.  Though after 10 minutes, the screen goes black?  I'm unsure if that is Unraid, or my SuperO Motherboard, instead of it just staying on the login.  So it went from black screen, to booting again.  Sigh.

 

I'm going to go ahead and turn off bifurcation, and see if that makes a difference.

 

But then I also question it being a hardware issue?  Because it stays booted in Safe Mode.  And when I had it out of safe mode, I didn't have my VM's, or Docker booted.  Maybe a plugin?

Edited by nexusjosh
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...