Unraid Crashed twice in 13 hours


Shu
Go to solution Solved by JorgeB,

Recommended Posts

Hello everyone,

 

I am once again asking for your financial support help and wisdom. I woke up this morning to an unresponsive unraid (needing to do an unclean reboot) and then again over lunch to another crash. 

 

Here are my diags (recovered post-crash, unfortunately) and a copy of my persistent logs (written to a share, not flash) starting from two days ago. 

 

520unraid-diagnostics-20220913-1601.zip

syslog-192.168.1.3.log

 

I've dealt with seemingly random crashes since I started with unraid last month. Here is what I've done since then to narrow down the problem:

 

Update LSI HBA to latest firmware

Change from 1-to-4 sata adapters to Y-splitters

Cut down number of drives from 10 to 8

Run two 4-hour memtests (both passed)

Upgraded fans to run drives cooler (high 30s to low 30s now)

 

Any other thoughts? Maybe my gpu? It's an old FirePro V4800. I used to run windows 24/7 with no crashes (but I did have file corruptions a few times hence my switch to unraid). 

 

I can say one new thing about today's crashes were that my qbit settings seemed to have been reset to default? WebUI username and password as well as network limits. Though, those were made in the WebUI settings as opposed to being hardcoded in the config file. 

 

Thanks!

 

Edit1: Not sure if this is a coincidence or not but it seems that the crashed happened during the import of a newly-"acquired" 4-season 1080p show. Sonarr was importing from my downloads folder to my media folder but it seems to have stopped after season 1. I do have my Plex configured to analyze new media as they're imported. So manybe my server was hit too hard at once? I would hope my Ryzen 5 3600 and 16gb ram could handle it, though....

 

Edit2: Actually, I think sonarr was only monitoring the first season and didn't automatically Import the other seasons. That's my fault

 

Edited by Shu
Link to comment
13 minutes ago, trurl said:

 

 

 I am running an XMP profile on my ram, perhaps I'll change that (though, my setup is 2 of 4 DIMMS filled on a 3rd gen ryzen and my ram is running at 3000mhz.)

 

I'll also look for any c-state settings I can change and I think I need to do a bios-update anyway

Edited by Shu
Link to comment

Okay, I just disabled XMP (now running 2133mhz, down from 3000mhz), disabled C-states (from Auto), and set my Power Supply Idle Control to Typical Current Idle (from Auto).

 

I also swapped the pcie location of my LSI HBA (9200-8i) from the second gpu slot that ran at x4 to the top gpu slot running x16 with my gpu. No idea if that will change performance but I figured it couldn't hurt.

 

Hopefully this fixes things but I'll report back if it doesn't...

Link to comment

Let's make it "crashed 3 times in 24 hours"

 

Here are the diags and syslog again (syslog starting from the 12th)

syslog-192.168.1.3 (1).log

520unraid-diagnostics-20220914-0700.zip

 

This time, it seems to have crashed during my appdata backup plugin running. I'm really not sure what else to try at this point, unfortunately.  @trurl, any thoughts for this struggling newcomer? Appreciate it in advance.

 

Thanks

Link to comment
35 minutes ago, trurl said:

 

 

I have one container set with a static IP using br0 (speedtest-tracker - set to run twice a day at 6am then 8pm); my other containers are either host or bridge, with a number of them being run through my qbit-vpn container. I will stop the br0 container and see if issue continues - if it does, I'll try the linked solution.

 

Thanks, @trurl

Link to comment

Hadn't crashed since!...until today :(

 

I initiated a clean shutdown around 1pm to install a UPS (worrying that possible unclean power is what's causing the crashes as my server is in the same room/circuit as my laundry and AC/Heating unit) - though I guess now I know that wasn't the case... 

 

But when I attempted to login to my server around 11pm, I couldn't. Tried a ping and it came back saying host was down. The computer was still on (fans spinning and UPS reporting 144w being used - slightly less than when the server was available). Used the reset button on my case>mobo and am running a parity check now. 

 

Here are the diags: 

520unraid-diagnostics-20220921-2312.zip

 

Worth mentioning a couple things: 
I hadn't switched to Ipvlan yet. Instead, I just haven't started/used the container that was in br0 mode (Speedtest-tracker) since the last post (the 14th)

I did get a boot drive fail but I was able to swap the usb drive fairly easily. I don't think this is related but who knows. This was about a week ago, on the 15th)

Edited by Shu
Link to comment
13 minutes ago, JorgeB said:

Nothing relevant I can see, this usually suggests a hardware problem, one more thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

I can do that. If it ends up being hardware, are there specific points that are more likely (CPU, mobo, ram, etc)? Most everything was bought new except for ram and gpu

 

Either way, I'll report back in a few days. 

Link to comment
  • 2 weeks later...
On 9/22/2022 at 9:07 AM, JorgeB said:

I would try running with just one DIMM at a time, that would basically rule that out if it keeps crashing with either one, then next suspect would be the board/CPU.

 

I swapped my two 8gb DIMMs (16gb total) with my main PC for 4 8gb DIMMS (32gb total). I know 4 DIMMs is more variables than 2 but I figured two kits being faulty is unlikely. I went almost 7 days strong until a crash today while watching a Plex movie. 

 

Attached diags and syslog (syslog from past few days).

 

I just swapped the gpu from my PC too to test that; it's an RX 580 8gb. We'll see...

520unraid-diagnostics-20221005-1932.zipsyslog-192.168.1.3 (4).log

Link to comment
4 hours ago, Shu said:

 

I swapped my two 8gb DIMMs (16gb total) with my main PC for 4 8gb DIMMS (32gb total).

Your DIMMs could be fine but your memory controller might still not like them when running in pairs.

The sure way to eliminate memory problems is to Memtest each DIMM separately, one by one.

Use a different slot for each consecutive test.

If finished without errors, Memtest them again but now in pairs.

4-5 passes of Memtest is usually enough to uncover problems.

Edited by Lolight
Link to comment
  • 2 weeks later...
On 10/5/2022 at 11:22 PM, Lolight said:

Your DIMMs could be fine but your memory controller might still not like them when running in pairs.

The sure way to eliminate memory problems is to Memtest each DIMM separately, one by one.

Use a different slot for each consecutive test.

If finished without errors, Memtest them again but now in pairs.

4-5 passes of Memtest is usually enough to uncover problems.

It's been a while but, folks, it's happened again. Although this time! I logged in during the crash and was able to get some good insights. 

 

I logged in around 7.45am and noticed my CPU was almost pegged at 100%, I checked my dockers but nothing seemed to be using it, I checked HTOP but in there, HTOP was the highest CPU usage at 30% (I think, I'm not 100% sure how to read HTOP)

 

One odd thing I noticed was that my ram utilization was reporting 0gb. Here are a few screenshots.

 

647783273_ScreenShot2022-10-15at07_46_58.thumb.png.72a26a3522aa6babf1b38c41b384a989.png886515162_ScreenShot2022-10-15at07_46_42.thumb.png.399d6128cdd7b55b8c7340900a902996.png

 

I checked the syslog and this is what I see: 

1283806282_ScreenShot2022-10-15at07_48_47.thumb.png.8359dfb88c6cb30507ff17610a5c5773.png795249892_ScreenShot2022-10-15at07_48_53.thumb.png.a6b86a182196f447d36031d6e0edbb8e.png

At this point, I rush to download the Diags in hopes of capturing it before it goes kaput. However, it seemed to hang at the 

sed -ri 's/^(share(Comment|ReadList|WriteList)="[^"]+/\1.../' '/520unraid-diagnostics-20221015-0927/shares/5--------d.cfg' 2>/dev/null 

With the --- being the name of the share (same name as server, 520unraid

I figured the server probably crashed so I closed the tab and tried accessing the gui again but it never loaded. Same issue as always.

 

After a quick shower, I go downstairs to hard reset the computer as normal but I checked my UPS and I see a slightly-lower-than-normal draw but no where close to an off-server (normal: 170w, current: 150w, off: 0w).

 

I check the gui on my phone and I can log in! I check on my laptop...but no login (both wifi). I tried both of my phones and they worked (Pixel 6a Chrome and iPhone X Brave).

I go to my phone and try to download the diags but it's stuck on the same command for the past 25 minutes or so (I don't think it's going to download).

To prevent an unclean shutdown, I tried to "Stop" the array from my phone. Unsuccessful

 

My next attempt was to try to login from my partner's laptop (have never logged in from there before) and it worked! Using chrome (I checked my laptop again but I couldn't access the webgui login page)

I tried to "Stop" the array from my partner's laptop but nothing happened. 

 

I plugged a monitor into my server and the monitor recognized the output but the screen was blank/black (I believe I last logged in with no gui, just CLI)

 

I tried a shutdown from my partner's laptop but no luck.

So I go nuclear, I push the power button my server and wha'do'u'know, the monitor comes to life. I see the syslog but a more current version (with the logs from when I tried stopping my dockers). I checked my partner's laptop again at the syslog but it didn't have those commands. I wonder if they're accessing an old "snapshot" of the gui? 

 

On the monitor, I see the command from the power button ("90 seconds for graceful shutdown"...90 seconds later... "Forcing shutdown")

5 minutes in and it's still on forcing shutdown.

 

15 minutes later, I come back and I hear the fans still on but the monitor doesn't see any output (not the same the previous' blank output). 

I try my laptop, no webgui access. I try my partner's laptop and: I can still access the webgui. I notice the uptime is still counting up so maybe not a snapshot? I try to click shutdown once more and to my surprise, I hear the fans stop and my server has shutdown (graceful or not, I'm unsure).

 

Before I start it up again, I'm going to try going from 4 sticks to 2 and I'm going to put them in the "B" slots instead of the "A" ones. 

 

Link to comment

Just an update: server was showing same issues as lasts time again today. Tried to get diags but they got stuck at the same spot. Went nuclear and swapped mobo/cpu with my main PC 

 

Old: MSI X570 a-Pro & AMD Ryzen 5 3600

New: Asus B550-F & AMD Ryzen 5 5600X

Put back 4x8gb of ram. 

 

We'll see (posting this as more of a status update for future me)

Link to comment
  • 4 weeks later...

Providing a (hopefully) final update: Since swapping the CPU/Mobo, I haven't had issues with unexpected shutdowns. Perhaps two days after the swap, I did wake up to a whole bunch of nginx errors (my log was full of them, like 5 or 8 thousand entries). But no other issues since then thankfully. I'd be at 24+ days uptime if it wasn't for needing to restart to update the OS. 

Old CPU/Mobo is currently in my main PC and has been running windows no problem (currently 2 days uptime). 

 

I appreciate all the help the many of you have provided with getting this issue fixed. 

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.