Jump to content

(SOLVED) Random Restart After Chassis Swap


Go to solution Solved by Squeeth,

Recommended Posts

I recently upgraded my hardware and swapped the case I was using. I'd have random GUI and web UI hangs where it'd become unresponsive, but now with the swapping of the chassis, I have random restarts.

My current config is as follows:
ASUS WS X299 Sage
i9-7920X
32GB Crucial 2666(?)
Nvidia GT 1030
Nvidia GTX 980TI
Adaptec Raid Card (don't remember the model)
8x2.5" SSD's in hotswap bays to Intel backplanes, mini SAS to SATA breakout cables to SATA ports on board for cache.
Then I have that system connected to a JBOD with an HBA in it by a mini SAS 8088 cable.

I have the most recent diagnostic zip attached, and took a quick glance into the logs despite not really knowing what exactly to look for. I'm savvy enough to take a look into them and know what to look for after browsing through the forums a bit, but nothing is seemingly jumping out at me...besides the odd date difference between December to July.

I'm sorry for this being my first time post and if I'm missing any information that may be useful, or if my hardware config is confusing and/or missing any crucial info. I greatly appretiate any and all help that gets posted. Many thanks in advance! :)

ethan-unraid-diagnostics-20210730-0125.zip

Edited by Squeeth
Link to comment
15 hours ago, JorgeB said:

Enable syslog mirror to flash then post that log after a crash.

Just had my server reboot again. This time I was running mover to move my domain share for my VM from one pool to the array in order to move it to a different pool. I saw the array disks being written to and then after ~25-30 seconds the web ui hung and I heard the POST beep.

I see hardware error at the top on the CPU, however I'm not even going to begin acting like I know if that means anything.

Hopefully this log helps to provide some information. Thanks!

syslog (1)

Link to comment

Any idea what is spamming syslog with these? What is IP 192.168.2.117?

Jul 30 18:53:30 ETHAN-UNRAID nginx: 2021/07/30 18:53:30 [error] 4797#4797: *198 limiting requests, excess: 20.971 by zone "authlimit", client: 192.168.2.117, server: , request: "PROPFIND /login HTTP/1.1", host: "ethan-unraid"
Jul 30 18:53:30 ETHAN-UNRAID nginx: 2021/07/30 18:53:30 [error] 4797#4797: *200 limiting requests, excess: 20.940 by zone "authlimit", client: 192.168.2.117, server: , request: "PROPFIND /login HTTP/1.1", host: "ethan-unraid"

 

Link to comment
6 minutes ago, trurl said:

Any idea what is spamming syslog with these? What is IP 192.168.2.117?



Jul 30 18:53:30 ETHAN-UNRAID nginx: 2021/07/30 18:53:30 [error] 4797#4797: *198 limiting requests, excess: 20.971 by zone "authlimit", client: 192.168.2.117, server: , request: "PROPFIND /login HTTP/1.1", host: "ethan-unraid"
Jul 30 18:53:30 ETHAN-UNRAID nginx: 2021/07/30 18:53:30 [error] 4797#4797: *200 limiting requests, excess: 20.940 by zone "authlimit", client: 192.168.2.117, server: , request: "PROPFIND /login HTTP/1.1", host: "ethan-unraid"

 

After looking at my firewall, it appears to be my roommates desktop.

Edit: I might have a "lightbulb" moment here. I'm pretty sure he's working with music production projects that live on the array. I'm sure that probably isn't helping anything (also not sure if it's hurting besides writing to parity too) but would it be a wise idea to have him start moving things to his computer? I think we set up most of his stuff on the array due to him only having a 120GB drive at the time, but he now has a 1TB drive in his system.

Edited by Squeeth
Link to comment

Shouldn't really cause crash though. Shouldn't really be spamming syslog either so don't know what his computer is doing to make that happen.

 

He would probably get better performance if he were working with his own storage though, and maybe just using your server for archive or backup.

 

As for crashes, have you done memtest?

Link to comment
16 minutes ago, trurl said:

Shouldn't really cause crash though. Shouldn't really be spamming syslog either so don't know what his computer is doing to make that happen.

 

He would probably get better performance if he were working with his own storage though, and maybe just using your server for archive or backup.

 

As for crashes, have you done memtest?

I did run MemTest on the board and memory prior to installing into the chassis, two successful passes with XMP enabled. Didn't let it run all four passes as I was leaving work and had it on my bench there, but kind of assumed that two successful passes were good enough.

Also tried searching the syslog for any memory related errors and didn't come across any that jumped out at me.

Link to comment
7 hours ago, Squeeth said:

I did run MemTest on the board and memory prior to installing into the chassis, two successful passes with XMP enabled.

 

Have you tried running without XMP enabled to see if that improves stability?  XMP often seems to cause unpredictable issues despite the memtest passing.

Link to comment
7 hours ago, JorgeB said:

Nothing being logged before the crash suggests a hardware problem.

That was a hunch I had but I'm glad someone else has the same hunch.

 

6 hours ago, itimpi said:

 

Have you tried running without XMP enabled to see if that improves stability?  XMP often seems to cause unpredictable issues despite the memtest passing.

I haven't tried with XMP disabled, I'll give that a go today and report back. I'll try running mover again to see if it restarts during that as a test, since that invoked a restart.

Link to comment

I didn't check and/or notice any issues with the server today, until my roommate messaged me while I was out for the day about it being down. I got home quite a few hours later and the GUI was unresponsive, and it was not accessible via the web UI.

I did disable XMP yesterday, and I can't tell from the syslog if it was doing anything during its time of becoming unresponsive. I also don't know if he messaged me as soon as it happened or not.

I looked through the syslog and nothing seems like a glaring issue besides the obvious errors on one of the hard drives in the array...however I don't think that would cause the entire system to become unresponsive. So at this point I can only assume that it is in fact some sort of hardware error.

Hopefully there's something in the syslog I'm overlooking given my very limited knowledge, however I'm sadly thinking I'm not and it's time to start eliminating hardware in the system one at a time to find the possible culprit.


Thanks again in advance for any help, as well as all the help that's been provided thus far!

syslog (2)

Link to comment
19 hours ago, Tristankin said:

I have had to roll back to 6.8.3 due to freezes on 6.9.x. Before you buy anything new give that a go too.

I'll have to give that a go at some point too if this next thing I try doesn't solve my issues.

I do remember a friend of mine once saying that I shouldn't be using the USB 3.0 port on my old board for my Unraid flash drive (that's a USB 3.0 drive), and that I should be using a USB 2.0 port. I don't remember what I plugged the flash drive into on this new config, but I'll be checking that when I get home and switching it from 3.0 to 2.0 if that's the case to see.

Link to comment

Web UI via VPN configured through separate firewall (Untangle installed on Sophos UPM320) was responsive today, however the dashboard was blank not showing anything besides the top tabs (Dashboard, Main, Shares, etc.) and was unable to access commands through terminal. Tried htop with no information reporting back, powerdown -r was also unresponsive.

Main tab showed errors on every disk (16 disks in the array) which has never been present before except for <200 errors on disk 1 at times. Also was missing array options on the bottom of the page, so there was no ability to stop, check, spin up or down as would normally be present.

Nobody was home today to even be putting any activity on the server either, so this appears to have been just a random occurrence.

Unraid flash drive was in a USB 2.0 port, I also disabled Intel C-States in BIOS, and set CPU Scaling Governor to Power Save as well as Enable Intel Turbo/AMD Performance Boost to No in the Tips and Tweaks plugin.

I'm grasping at straws here before downgrading Unraid OS to a previous version as mentioned above, and have attached todays syslog file as good measure in case anyone feels like taking a peek.


As per usual, I will be reporting back with any good or (probably) bad news regarding stability. And as always, thanks again for all the help so far! :)

 

syslog

Edited by Squeeth
I'm scatter brained and missed some information
Link to comment

Just wanted to provide an update. Since the previous post changes have been made via BIOS and Tips and Tweaks plugin, the server has been stable for almost 48 hours and has had zero hiccups or issues along the way!

I am going to try reverting those changes I made in BIOS and the plugin just to see what happens. One thing I should mention is that the case swap I did involved me redoing the front panel connections since it had the proprietary Intel chassis FP connection for their server boards and didn't match pin outs for consumer boards. When I first tried booting the system it wouldn't even POST, and after unplugging and removing all components except for CPU and one memory module... what solved it was removing the HDD LED connection and the Reset button connection. At that point it would then POST, and thought I had success until it was being unstable. However since this near 48 hour stability I have also removed the Power LED connection to the front panel...not sure why that would cause instability (if any) but it is now very much a possible suspect.

In addition to stability for the past 48 hours I should add in that last night I was copying ~650GB of data from the array to my roommates local drive to remove his activity of working on projects directly from the array as a suspect. Additionally during that process I was able to run mover twice, install and configure multiple docker containers, and setup a Mac OSX and Windows 11 VM while the file transfer was running. I'd say that if I were to have any random instability issues still...they'd have peeked their ugly face last night when I was hammering the system 😂

I'll re-enable the settings and report back with updates. I'm sure you guys are probably tired of seeing me update this thread, but I just figured I'd keep this updated along the way until resolved :) 

Thanks once again for all the input everyone!

Edited by Squeeth
Link to comment
  • Solution

Update

Server has been stable for 30 hours with the following changes made:
XMP : Enabled
C-State : Enabled
Enable Intel Turbo/AMD Performance Boost? : Yes
CPU Scaling Governor : Power Save

Have since been able to install and configure a Windows 98, 2000, 7, and Server 2019 Standard VM. Also configured WDS in Server 2019 to have PXE boot on my network for Macrium Reflect recovery and Windows 7, 10, and 11 installation. Was able to confirm working by booting a new blank VM to PXE and was able to boot to Macrium and Windows Install on the WDS and install Windows from the Server VM to the new VM after injecting the VirtIO drivers to the appropriate wims.

I'd like to think all is stable now! :)

From here I know if it becomes unstable then I'll disable XMP first and then go from there....but I do believe for some strange reason the front panel connections I re-did were the culprits...

Anyways, thanks again for all the help everyone! I do believe we can call this thread closed! :D 

Link to comment
  • Squeeth changed the title to (SOLVED) Random Restart After Chassis Swap

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...