[Solved] unRAID 6.9.1 - System unresponsive and requires hard reset


Recommended Posts

I'm having some serious stability issues as of late.  I started building an upgraded server at the beginning of the year and running MemTest for a couple days, testing CPU, read and throughput testing of drives, etc. before I started doing anything on the new build.  I was running 6.9 rc-20 as I wanted to test out and use multiple cache pools.  After my initial burn in, I began transitioning some of my basic containers that were simple to migrate and began copying my data over from my older server because I wanted to restructure my data storage and not use the same share structures and settings (lessons learned over the previous couple years) to provide better management and experience for myself.  During this period I also upgraded to 6.9.0 when it was released, and to 6.9.1 a few days after release.  I had parity enabled initially but decided to disable it during the file transfer to maximize my direct to disk writes as I was moving about 65TB of data.  After the data transfer completed, I re-added the Parity drives back into the array and let Parity rebuild.  Everything seemed fine until I started getting system hangs and unresponsive from the GUI and typically SSH and direct keyboard access (sometime SSH and direct access will work-ish although very slow.  The first hang occurred within a couple hours after completion.  The forced restart triggered another Parity check and the second occurred before the check completed.  I was getting about 1-1.5 days of continuous runtime before the system locked up.  

 

I setup my stabel unRAID box (also 6.9.1) as a remote syslog server so I could capture the logs for the system crash (attached as crash_report.txt).  I have also attached my diags that I ran immediately after the hard reset.

 

I've read a ton of posts and forum topics about stability issues with Ryzen CPU's since I started having these issues.  My first build was a Ryzen 7 2700x and never had a single issue like this (all issues were user caused, LOL).  This new build is on a Ryzen 9 3950x.  I have added the "rcu_nocbs=0-31" line to my syslinux.cfg file and thought that resolved my issues since I was up and stable for 3 days, but... locked up again.  With this forced restart I have now also disabled Global C States in the BIOS as well since that seems to provide mixed results for Ryzen users.  

 

I'm hoping someone can take a look and give some advice as to any other issues they think might be causing me heartburn ATM.

titan-diagnostics-20210331-1146.zip crash_report.txt

Edited by TexasDaddy
Marked solved
Link to comment
9 hours ago, TexasDaddy said:

If I'm not supposed to get support here can someone point me in the right direction because there have been at least a dozen other topics opened since mine and nearly all have at least one response. 


It is worth pointing out that forum support is by members rather than Limetech so you will normally only get any feedback when somebody thinks they have something to contribute.

Link to comment
On 3/31/2021 at 6:22 PM, TexasDaddy said:

I've read a ton of posts and forum topics about stability issues with Ryzen CPU's since I started having these issues.  My first build was a Ryzen 7 2700x and never had a single issue like this (all issues were user caused, LOL).  This new build is on a Ryzen 9 3950x.  I have added the "rcu_nocbs=0-31" line to my syslinux.cfg file and thought that resolved my issues since I was up and stable for 3 days, but... locked up again.  With this forced restart I have now also disabled Global C States in the BIOS as well since that seems to provide mixed results for Ryzen users.

 

The only Ryzen-specific tweak that's relevant nowadays is the Power Supply Idle Control one and that one really applies to early 1000-series, though it does no harm to apply it anyway. There's certainly no point in disabling Global C-States. C6 is the only one that has ever been problematic. Do be careful not to overclock RAM or RAM controllers, though.

 

Link to comment
On 4/1/2021 at 3:52 AM, TexasDaddy said:

I was getting about 1-1.5 days of continuous runtime before the system locked up.  

 

These can be really tricky to diagnose... Do you have enough memory that you could remove half of it?  If still happening, swap for the other half?

 

On 4/1/2021 at 3:52 AM, TexasDaddy said:

I began transitioning some of my basic containers

 

Can you turn these off for a few days?

 

My general advice is to remove/turn things off that you can to try to isolate the problem.

Link to comment
1 hour ago, jortan said:

 

These can be really tricky to diagnose... Do you have enough memory that you could remove half of it?  If still happening, swap for the other half?

 

 

Can you turn these off for a few days?

 

My general advice is to remove/turn things off that you can to try to isolate the problem.

Yes, I've got 4 x 32GB Kingston ECC from the QVL list.  I was going to do some more digging into it once the Parity check completes since I had to do the hard reset the other day, but that won't finish for another 4 days.  The parity check had been taking about 1.5 days with dual 18TB drives, but it has been moving incredibly slow since this reboot, not sure if that's indicating another issue or just being exacerbated by the current issue(s).  If it locks up before the parity check completes, then I will be reverting the Global C States, removing the syslinux.cfg mod, and pulling a pair of memory sticks to see if things stabilize.  Perhaps memtest didn't detect the issue.  

 

I'm considering buying an inexpensive MB and CPU that supports this memory so I can build an open case for additional testing.  

Link to comment
12 hours ago, TexasDaddy said:

won't finish for another 4 days.

 

Grab diagnostics while the parity check is underway. 18TB of parity is going to take a while to check, but 4 days is excessive. This is not a C states problem - that only affects an idle processor. This is a disk/controller/cable kind of problem.

 

Link to comment
2 hours ago, John_M said:

 

Grab diagnostics while the parity check is underway. 18TB of parity is going to take a while to check, but 4 days is excessive. This is not a C states problem - that only affects an idle processor. This is a disk/controller/cable kind of problem.

 

 

Here are my current diags.  Someone's previous post mentioned disabling LACP, so I was looking through my system to see if everything was configured correctly and noticed that on my 1Gbps dynamic trunk that it was only showing 1 member.  Looking at my switch and the syslogs I noticed the second member was auto-negotiating down to 100Mbps.  I disconnected / reconnected the network cable from my switch and NIC port, then changed both of my trunks on the switch to Static LACP trunks and everything looks good in unRAID now, although bond0 reports as bond0: IEEE 802.3ad Dynamic link aggregation, mtu 1500 and bond2 reports as bond2: 2000Mb/s, full duplex, mtu 1500.  The only config difference between them is bond0 has an IP address and I'm using that for share access, web UI, etc., and bond2 is my network interface for my docker containers.  

 

As for disk/controller/cable...

Controller: LSI SAS 9300-8I

Chassis: Supermicro CSE-847E16-R1K28LPB 36-bay (24-bay expander backplane on front, 12-bay expander backplane on back)

Cables: CableCreation Internal Mini SAS HD Cable, 3.3FT Mini SAS SFF-8643 to Mini SAS 36Pin SFF-8087 Cable (1 to each expander backplane)

Disks: mix of SAS and SATA III 10/12/18TB drives

 

All array disks are on the front and the 2 parity disks are on the back. 

 

Maybe I just need to shutdown and reseat the memory modules and SAS cables.  

 

titan-diagnostics-20210402-1433.zip

Link to comment

I'm an idiot... the slow parity check was because of my Resilio Sync docker indexing my files to sync to my backup server.  As soon as I stopped the docker, my parity speed shot up to about 155 MB/s.

 

I did cancel the slow parity check, removed the line from my syslinux.cfg, enabled Global C States, and set the Power Supply Idle Control to typical.  The Parity check started over after the reboot, but I'll keep my sync offline for the time being till this check completes successfully.  I'm really hoping this resolves my issues as I was looking at some long downtimes and additional purchases to get to the bottom of these issues.  Thanks to everyone's suggestions, ideas, and advise.  I'll post back as things progress whatever the outcome. 

Link to comment
1 hour ago, TexasDaddy said:

 Resilio Sync docker indexing my files to sync to my backup server.

 

I had problems with Resilio continuously rescanning a large dataset - by default it will attempt to rescan every 10 minutes.  You can set how often Resilio will rescan here:

https://help.resilio.com/hc/en-us/articles/205458185-Setting-how-often-Sync-should-check-for-file-changes-

 

I ended up changing this to a week (604800 seconds) as in general Resilio should get notified of any new/changed files.

Link to comment

So everything just crapped out again and I was able to get diagnostics before giving a shutdown command.  Time to pull a pair of RAM sticks and see if things get any better.  I might pull 3 and just run on one stick at a time till I find out if it is one of them.

 

Is it possible to get @limetech (or someone from support) to possibly take a look and give some ideas on getting this thing resolved?

titan-diagnostics-20210402-2155.zip

Link to comment

I had a look at your latest syslog and it's really nasty. You have a very complicated start-up and there are so many issues that it's difficult to untangle them. May I make the suggestion that you simplify things temporarily? I'm a simple soul and need to work methodically by breaking a difficult problem into smaller pieces. Try turning off all auto-starting docker containers and boot in safe mode and let the inevitable parity check complete, then use the server as a plain NAS without any bells and whistles for a few days, to prove its stability. If the diagnostics look good, then start building on that solid foundation. Because, at the moment you have a lot of things trying to happen and, really, none of it is working.

 

Incidentally, how fast are you clocking the RAM? It's one of the things I can't tell from the diagnostics. You have four dual-rank DIMMs on a Zen2 CPU, so the specification is 2666 MT/s, or 1333 MHz. Any higher and you're overclocking the memory controllers on the IO die. Also, please confirm that the CPU cores are not being overclocked - your diagnostics suggest they are not but then again the figures don't look entirely believable:

 

CPU MHz:                         2200.182
CPU max MHz:                     6767.5781
CPU min MHz:                     2200.0000

 

What I'd like to see is a nice clean start-up and then we can tackle things like the network bonding problems, then move on to the other issues.

 

  • Like 1
Link to comment
22 hours ago, John_M said:

I had a look at your latest syslog and it's really nasty. You have a very complicated start-up and there are so many issues that it's difficult to untangle them. May I make the suggestion that you simplify things temporarily? I'm a simple soul and need to work methodically by breaking a difficult problem into smaller pieces. Try turning off all auto-starting docker containers and boot in safe mode and let the inevitable parity check complete, then use the server as a plain NAS without any bells and whistles for a few days, to prove its stability. If the diagnostics look good, then start building on that solid foundation. Because, at the moment you have a lot of things trying to happen and, really, none of it is working.

 

Incidentally, how fast are you clocking the RAM? It's one of the things I can't tell from the diagnostics. You have four dual-rank DIMMs on a Zen2 CPU, so the specification is 2666 MT/s, or 1333 MHz. Any higher and you're overclocking the memory controllers on the IO die. Also, please confirm that the CPU cores are not being overclocked - your diagnostics suggest they are not but then again the figures don't look entirely believable:

 


CPU MHz:                         2200.182
CPU max MHz:                     6767.5781
CPU min MHz:                     2200.0000

 

What I'd like to see is a nice clean start-up and then we can tackle things like the network bonding problems, then move on to the other issues.

 

Just for some info/background.  The CPU max settings you're seeing are accurate as I am using the Power Save CPU Governor setting from Tip and Tweaks to keep this CPU from running so hot during Handbrake transcodes.  I'm going to get a 65W CPU to swap out in this server as I only have 2U worth of space to work with in this chassis so to keep the fan noise low I'm throttling for now and will swap for a lower TDP in the near future.  I'll be putting this CPU in my backup unRAID server where I can put a much larger cooler in to better handle the heat and will offload the transcoding to that machine and just let it go full throttle. 

 

As for the "network errors" they are all UDP failures for syslog as I was sending my logs to a remote server during this instability and those are failures of logs trying to write while the LACP Team Trunk is being negotiated and initialized.  As soon as that completes, the errors stop as it is able to write all events to the remote log server.  I've since configured it to just write to the local syslog and will verify after this Parity check that those errors are no longer present during the boot process.

 

So, after running without issue, or any errors or process kill messages in the logs since pully the pair of memory sticks I'm really thinking my issues are all related to bad memory.  I'm not gonna call this solved just yet as I'd like the parity to finish and go through a significant idle period afterward to say things are now stable.  Fingers cross, all goes well over the next day or two and I can start working on figuring out which stick of memory I'll need to RMA. 

 

I do completely agree that I too would like to see a clean log, so after the parity completes I'll be performing a normal restart and grab a fresh set of diags for all to look at just to make sure I'm not missing anything.  More to come...

Link to comment
On 4/2/2021 at 11:23 PM, John_M said:

I had a look at your latest syslog and it's really nasty. You have a very complicated start-up and there are so many issues that it's difficult to untangle them. May I make the suggestion that you simplify things temporarily? I'm a simple soul and need to work methodically by breaking a difficult problem into smaller pieces. Try turning off all auto-starting docker containers and boot in safe mode and let the inevitable parity check complete, then use the server as a plain NAS without any bells and whistles for a few days, to prove its stability. If the diagnostics look good, then start building on that solid foundation. Because, at the moment you have a lot of things trying to happen and, really, none of it is working.

 

Incidentally, how fast are you clocking the RAM? It's one of the things I can't tell from the diagnostics. You have four dual-rank DIMMs on a Zen2 CPU, so the specification is 2666 MT/s, or 1333 MHz. Any higher and you're overclocking the memory controllers on the IO die. Also, please confirm that the CPU cores are not being overclocked - your diagnostics suggest they are not but then again the figures don't look entirely believable:

 


CPU MHz:                         2200.182
CPU max MHz:                     6767.5781
CPU min MHz:                     2200.0000

 

What I'd like to see is a nice clean start-up and then we can tackle things like the network bonding problems, then move on to the other issues.

 

 

So here are my latest diags.  Parity check finished successfully yesterday and I let the server continue running through the night to see if I had another hang during an idle period, and all appears to be good at this point.  I performed a normal reboot today and let the system run for a while before collecting these logs.  No errors that I've noticed but wouldn't mind a second set of eyes on  them if anyone is up to it.  

 

I've got a friend with a bench rig testing my 2 sticks I pulled the other day with the latest version of MemTest86.  Testing them individually and so far no errors.  If they both pass all testing then I guess I'll be dropping them back in and see if things continue to run without issues.  

 

As for the memory speed question earlier, I'm running a Gen3 Ryzen so it supports 3200 MHz, which is the speed of the memory I purchased from the QVL (4 x KSM32ED8/32ME). 

titan-diagnostics-20210405-1229.zip

Link to comment
8 minutes ago, TexasDaddy said:

As for the memory speed question earlier, I'm running a Gen3 Ryzen so it supports 3200 MHz,

 

Not with four sticks of dual-rank memory. It's too heavy a load for the memory controllers to drive at that speed. The spec is 2666 MT/s.

 

I'll take a look at your diags.

 

Link to comment
1 minute ago, John_M said:

 

Not with four sticks of dual-rank memory. It's too heavy a load for the memory controllers to drive at that speed. The spec is 2666 MT/s.

 

I'll take a look at your diags.

 

You are correct sir.  I was looking at what I'm running now, not what I was running before.  

Link to comment
6 minutes ago, John_M said:

That's a lovely motherboard, BTW. I think I'll spend the rest of the day just browsing the manual!

LOL

 

Yes, I do really love this motherboard.  Thinking of buying a second for my other server, moving this Ryzen 9 over and downgrading my primary to a 65W TDP CPU like the Ryzen 7 3700x.  This server only has 2U of workable space with the 12-bays in the back and my other server has a full 4U so I can put in a much better cooler and push the CPU without feeling uncomfortable about the fan noise or the temps. 

 

Thanks for looking at the logs and my stress level has come down quite a bit since things appear to be very stable at this point.  

 

If all the memory modules pass after rigorous testing, then I'll just put all 4 sticks in.  If I start having problems again then it will be time to engage AsRock to see about RMA'ing the board and/or a possible BIOS update from them.  I know they are a bit slow to release stable BIOS updates, but support seems to be pretty good with providing Beta BIOS fixes. 

  • Like 1
Link to comment
  • TexasDaddy changed the title to [Solved] unRAID 6.9.1 - System unresponsive and requires hard reset

Alright, marking this solved as I've been completely stable since removing 2 sticks of memory.  Multiple memory test were run with no issues found, both in my buddy's bench rig and in my server.  I'm thinking the issue is with the system controller trying to run the memory at lower speeds when all 4 DIMM slots are populated.  Hopefully this will be fixed with a BIOS update, but for now I'm more then happy to run with the 2 sticks at rated stock speeds and I don't really need the other 64 GB of memory at this time as my system is currently only using 14%.  I'm going to use these 2 sticks for upgrading my backup server so it's just one (technically 2) less thing to buy.  I GREATLY APPRECIATE all the feedback and advice.  

Link to comment

 

4 hours ago, TexasDaddy said:

Alright, marking this solved as I've been completely stable since removing 2 sticks of memory.

 

Have you also swapped those 2 sticks for the other 2 sticks to make sure they are stable as well?

 

4 hours ago, TexasDaddy said:

I'm thinking the issue is with the system controller trying to run the memory at lower speeds when all 4 DIMM slots are populated.


Have you definitely confirmed in the BIOS that with 4 DIMMS populated they are running at 2666Mhz?    Maybe I've misunderstood but you might have this the wrong way around.  The platform / memory will run below it's rated maximum speed with no issues - the issue can arise where you're trying to run your DIMMs at too high a speed.  In this case the limiting factor is the platform (not the memory) due to following factors:

  • Number of DIMMs
  • Single or dual rank DIMMs
  • CPU architecture

image.png.75ac932a251cac7c3d4924c461bc47b7.png

 

Even though 2666Mhz should be supported, you might find that you achieve stability by dropping this to 2333Mhz or lower.  While not ideal, if this is stable it would confirm that your validated configuration isn't running as it should - you could then follow this up with Asrock re: possible BIOS fixes / RMAs.

 

ps:  memtests are worth running, but aren't necessarily going to show up every possible issue with memory stability in a running system - particularly in edge cases like this where the memory speed isn't the issue, but the platform itself.

 

edit:  I guess you can ignore the above if you're sticking with two DIMMs - though may be worth working out what the issue is anyway if you might want the extra memory in future (VMs, dockers, etc)

Edited by jortan
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.