Recent troubles shutting down/rebooting.


Recommended Posts

3 hours ago, hansolo77 said:

Unless when I did the Unbalance plugin/scatter.  Maybe that screwed me all up.

Unlikely. Any write to any data disk in the parity array updates parity at the same time. Doesn't matter whether writing a disk directly or through user shares, or what application may be doing the writing. Move, copy, delete, even format, are all write operations that update parity. This all happens at a lower level than any application or even the command line.

 

Unclean shutdowns can cause parity errors because everything is powered off before the parity update can be completed.

Link to comment

Sounds good to me, that's what I thought.  I'm just trying to come up with an explanation to why I would have so many sync errors after 3 runs.  Doing a non-correcting scan now and it's found 6 already in the 9 hours of running.  Still got 22 hours to go.

 

I looked at the extended scan log from FCP.. looks like I have a problem too where my `/system` path, that holds the docker and libvert folders is actually spread across multiple locations.. disk 1 and cache.  That might be why I'm having issues with the nzbget docker...  I wonder how that happened, as I never touch those things.  The share is configured with 'Prefer Cache`.  But I would think it's an either/or situation, where it would either be on the array or on the cache, not both.  How would I fix that?

Link to comment

I think we need to be a little more methodical to get to the bottom of this.

 

You ran a correcting parity check, which should have corrected all parity sync errors.

 

You followed this with another (non-correcting?) parity check, and it still gets parity sync errors.

 

Is this correct?

 

What we need to see is the syslog of both parity checks so we can compare them. If you haven't rebooted at all during this then not problem, you can just give us Diagnostics after this parity check completes and the syslog within would contain both parity checks.

 

If you did reboot during any of this, and you didn't capture the syslogs of both parity checks, either with Diagnostics or the Syslog Server, then we would have to start over with a correcting check, then a non-correcting check, all without rebooting, and then get the Diagnostics.

 

 

Link to comment
15 minutes ago, hansolo77 said:

I looked at the extended scan log from FCP.. looks like I have a problem too where my `/system` path, that holds the docker and libvert folders is actually spread across multiple locations.. disk 1 and cache.  That might be why I'm having issues with the nzbget docker...  I wonder how that happened, as I never touch those things.  The share is configured with 'Prefer Cache`.  But I would think it's an either/or situation, where it would either be on the array or on the cache, not both.  How would I fix that?

We can deal with that later. It shouldn't really break anything, but it will impact performance to have these on the parity array and it will keep disks spunup since these files are always open. The usual cause of this is starting Docker or VM services when there is no cache disk present, and so these get (re)created on the array.

Link to comment
1 minute ago, trurl said:

We can deal with that later. It shouldn't really break anything, but it will impact performance to have these on the parity array and it will keep disks spunup since these files are always open. The usual cause of this is starting Docker or VM services when there is no cache disk present, and so these get (re)created on the array.

I think that might be the case. When I looked at the modification date for those files on the array, they were back in August of last year when I was building my server.  I just used Krusader and erased them.

Link to comment
1 minute ago, hansolo77 said:

I think that might be the case. When I looked at the modification date for those files on the array, they were back in August of last year when I was building my server.  I just used Krusader and erased them.

That should be OK. Have you done another FCP scan since then?

 

You can see how much of each disk is used by each user share by clicking on Compute... for the share.

Link to comment
1 hour ago, JorgeB said:

Ryzen with overclocked RAM

Not just that, my streamer roomie had a Ryzen 3700X rig with RAM (I cannot remember which brand right now but if I ask he'll definitely remember because he's never buying from them again) rated for 3200mhz (not XMPP, just normal profile) that won't run above 2400mhz in any stable manner even in Windows -- something about the revision of RAM he has is using different branded ICs and Ryzen doesn't get along with them. Honestly, I don't care how many success stories there are, I don't trust Ryzen with Linux.

Edited by codefaux
Link to comment

I’m pretty sure I’m not overclocking the ram but I can double check. If it’s true though that the problem is related to that, why would it only now be showing issues and not before? I’ve had this running for a long time without issues. 

 

Also, I’m at work right now and My Servers shows my server is offline. 😬 not a good sign...

Link to comment
17 minutes ago, hansolo77 said:

I’m pretty sure I’m not overclocking the ram but I can double check.

I am pretty sure you are:

  • 2 DIMMS
  • Dual rank
  • running at RAM at 3600MT/s
  • on a 3900X

The maximum supported speed is 3200.

 

You may be within the RAM specs, but you are way out of the CPU specs. It might not be the cause, but probably something to address nonetheless.

Link to comment

Just got home.  Looks like it's just the remote access that was having trouble... Dad said the power went out and the modem hadn't come back on yet.  I had to power cycle the modem and router.  The server is on a UPS so it was able to continue the non-correcting scan.

 

Good news I guess.... it only found 10 errors.  So I'm going to reboot the server, access the BIOS and drop the RAM speed down.  Once back up, what should I do next?  Also.. the last extended Fix Common Problems scan came back error free!

 

Thanks again for all your help!

 

kyber-diagnostics-20210501-1605.zip

Link to comment

Seems like the "safe" shutdown didn't work.  I had all Dockers off, and disabled them in the Settings tab.  The only VM running was Windows 10, which I actually had "sleeping".  Nothing was accessing anything on the array.  When I did the shutdown, my web interface showed it was shutting down but the server computer never did.  I attached a keyboard to the machine (the screen was blank) and hit enter a few times.. it indicated it was waiting 420 seconds for a graceful shutdown.  I remembered I saw a thread on here, I think @Squid posted it, about increasing the timeout for various things to allow them more time to shutdown before hand.  Well anyway, the time expired and the terminal screen indicated it was Forcing the shutdown.  It saved a log:

 

kyber-diagnostics-20210501-1615.zip

 

After it fully shutdown, I rebooted and accessed the BIOS.  It indicated that the RAM was running at 3600mhz.  A feature called A-XMP was enabled, and the speed was set at 3600mhz.  I think I must have done that when I first put the parts together.  I changed the frequency to "AUTO" then disabled the A-XMP and rebooted.  It rebooted up and the BIOS then showed it was at 2600mhz.  That sounded really low, so I turned the A-XMP back on and set the frequency to 3200mhz and rebooted.  Now the BIOS shows it's running at 3200mhz.  UnRaid loaded up, but (as expected) it's now running another parity check.

 

Awaiting further instructions...

Link to comment

Quick little update.  I decided to remove the 10gbe NIC hardware from the server.  Since it's always giving me trouble with my other PC (drivers, Windows always "detects a problem" and disables it), it's just time to take it out.  I got it to move the nearly 100tb of data between my old server and this one.  Now that it's done, I put the NIC in my regular PC and had troubles since.  Figured since I was inside the case reseating the RAM, I might as well.  So logs might show it's missing now.

 

Link to comment
12 hours ago, hansolo77 said:

why would it only now be showing issues

I mean - it worked until it broke? I dunno what to tell you, that's just what happens when things stop working, lol. It ALL worked until it stopped working. Parts degrade over time, lose the capacity for running faster than they were meant to as they aged. Eventually they'll degrade until they stop working entirely.

 

I would suggest running the RAM at the Auto speed instead of clocking the RAM over what it defaults to when autodetected. I think they call that overclocking, right? (Trying to be jovial, not sarcastic.)

 

I'd say drop the RAM to its default, and see where that gets you. Beyond that, nothing immediate comes to mind.

Link to comment

The thing that bugged me about that 2600 vs 3200 is that the RAM is 3600 to begin with.  Kinda like the old CPU days where you had a clock speed of 100mhz with a 4x multiplier to get 400mhz.  My BIOS settings for the RAM overclock has all kinds of settings to tweak that stuff.  I never mess with it, because I know overclocking leads to instability.  I try to keep everything at their base levels.  So when I installed this system, I knew the RAM was clocked at 3600mhz, and the motherboard supported it, so I set it to that speed.  I didn't attempt to overclock it, just set it at the RAM's preset speed.  I also didn't know the Ryzen processor had an upper safe limit of 3200mhz.  So now I have the RAM clocked down to that.  This is mostly a media/PLEX server, so I guess RAM speed isn't all too important, except for when I'm transcoding.

 

I've still got the Docker and VM modules disabled.  I've been able to complete another extended Fix Common Problems scan with no issues.  I've also rebooted the server 3 times now and it was able to do so with no problem.  I'm going to try slowly turning things back on and see if the reboots still work.

 

I've had a few ideas as to what caused the problem in the first place.  The first major thing I added to my server since the last time I was able to properlly shutdown was the "My Servers" addon.  I wonder if that is causing some kind of background lockup with the network.  The last time I had a force shutdown, the logs showed some networking issues.  It also indicated it was unable to unmount the cache drive.  With no array activity, and no dockers or vms, it makes me wonder what else could be holding the cache.  ---  The other thing I've done recently, was the UnBalance Scatter.  The system has been fine for months.  I actually reached something like 48 days uptime before rebooting to install new nVidia drivers.  But when I noticed I had a drive 100% full and started to use the UnBalance plugin, that's when I started having trouble.  I remember the system becoming unresponsive and needing a hardware power button reset to come back up...while it was running the scatter.  I think that also explains why I had so many duplicated files (literally like complete TV shows, not just a few episodes, were duplicated on 2 drives).  The process of moving was interrupted.  And with the forced shutdown, the Parity WOULD be out of sync.  I've gone back through and erased all the duplicates, and that last non-correcting parity check only came back with 10 errors.  I think I probably need to just do one more correcting scan and that should be it.

 

I think for the time being, I'm going to try enabling various dockers, one at a time, and reboot.  See if they are stable.  NZBGET in particular since that's the one that was logging segfaults.  I think it might have been struggling because I remember it attempting to uncompress a really large file that was incomplete and it was rebuilding from parity files, all while the scatter was running, and then the  lockup happened with a forced reboot.  So I've gone in and removed all the files in the NZBGET download folder to eliminate any needed work.  I might just uninstall that docker, and reinstall it to be safe.

 

I am still interested in what I'm reading about the Ryzen c-states.  After coming home; waiting for a reply, I've been reading about it.  The forum was a buzz with posts back when 6.7 was coming out.  We're now on 6.9, so I'd think some maturity in the code would have come by now.  The same is true about my specific components; I need to see if there's a new BIOS update, since I believe I'm still running the initial release.

 

All things to look into tomorrow.  Good thing I'm off work.  :)  As for now, I'm really tired and my brain (and eyes) are starting to get foggy.  I'll need to look at this from a fresh perspective in the morning.  Still would like feedback and help if anybody has any further ideas or suggestions.

Link to comment

OMG Somebody please help me!

 

I went in and updated the BIOS.  It finished 100% and rebooted.  Now, the PC comes on, but there is no video display.  I'm completely screwed now.  I can't access anything.  i can't go into the BIOS, and I can't get Unraid to load.  It just sits there, with the fans spinning and the LED on.  I've already tried using the reset CMOS button on the back of the case, with power plugged in and unplugged.  Nothing works.  I can't even access anything to try to reflash the BIOS again.  It's just completely bricked except for power, fans, and LED.  It's Sunday too, so their Live Chat support is not available.  I don't know what to do.  I'm breaking down mentally here, about to start crying.

 

OMG WHY ME?

 

------------------------

 

Edit:

I think I got it.  I discovered I had downloaded the latest BIOS, and it was a BETA.  So I grabbed instead the latest "stable" BIOS and used that.  Now everything is back up.  I noticed the settings were all reset, and a lot of the tweaks I set, like (dis)enabling things had changed where they were.  I think I got it for the most part, but now it's displaying my HBA card discovering all the drives.  It didn't do that before.  Something else I set, I read in that Ryzen thread @JorgeB linked to set "Power Supply Idle Control" (or similar) to "typical current idle" (or similar)".. so I did that.  I also found where the C-States setting is, but I left it at Auto.  Should that be disabled?

 

Thanks again guys, I really appreciate it!

Edited by hansolo77
Link to comment
2 hours ago, hansolo77 said:

but now it's displaying my HBA card discovering all the drives.  It didn't do that before.

Disable storage oprom in mobo BIOS.

 

2 hours ago, hansolo77 said:

it at Auto.  Should that be disabled?

Yes.

 

Then as previous mention run UEFI memtest86.

  • Thanks 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.