Recent troubles shutting down/rebooting.


Recommended Posts

On 5/2/2021 at 8:39 AM, hansolo77 said:

the settings were all reset

This is typical, and any time you reset a BIOS it is considered best practice to do a Load Defaults immediately afterward -- as is uaully indicated in the flashing instructions, and widely ignored.

 

The reasoning is obscured to the average user, but important; when storing your values, historically, they don't save "Processor 1 clock multiplier was 24x" or whatever. It's a block of numbers, with no context. The next BIOS version might use that same data space for "CPU Northbridge Overvolt" and cause damage. Typically vendors have important things like that reset during the update process, but things can be missed and it can cause some truly bizarre bugs. Honestly, at this point I almost suspect the advice may not have as much merit as it did, say, ten years ago -- but it still seems prudent to be cautious.

Link to comment

How do I go about running a memtest?  I see there is an option in the pre-boot options of Unraid, but it doesn't do anything when I select it.  Do I need to have a copy of it on a flash drive or something first?

 

Also... I have reboot my server probably a dozen times in the last few days.  Looks like I've fixed THAT problem.  I ran another correcting parity check and it found 24 errors.  I feel like I'm getting closer to a solved matter, but I still want to be sure.  A memtest sounds like something I can do.

Link to comment

I'm actually trying to get the memtest started..  I have it written to a USB flash drive.. plugged it in, my BIOS reads it, but if I have it boot via UEFI it just sits at a blank screen.  If I boot it via regular, I get an error that it's not bootable and to replace it or select a different device.  So I'm going to try an older version.  My motherboard should be compatible with the latest version (released around September 2019), so I don't know what I'm doing wrong.

 

====

EDIT

====

 

Got the latest v4 release, got a new option in my BIOS for "USB Floppy".  Using that, I've now been able to get it to boot the memtest, and it's running.  If I remember right, and the program hasn't changed, I can just leave it running overnight.  The longer it runs the more accurate the test.  I'll report back tomorrow unless somebody says otherwise. 

 

Also of note.. I was able to successfully shutdown Unraid without any delay issues again before loading Memtest.  I truly feel I've resolved the original issue.  It had to have been a combination of using UnBalance Scatter while the NZBGet docker was attempting to repair a massive download that was missing parts.  It was something like 54gb.  Mover was set to go every hour, so it was probably trying to move files onto the array while trying to repair the download, while trying to Scatter.  Everything all at once.  I didn't even realize it was doing that until I had rebooted (unsafely) and saw it trying to resume the recovery after the fact.  I'll know better next time.

 

If my RAM checks out, I might have a new problem to address.  I just completed another self-correcting parity check and it found 24 errors.  That's a LOT better than the 138k I was getting before.  I'm just concerned because it IS still 24 errors.  I wish I could verify integrity of the data, but it's all media and I don't have anything to compare it against.  No way to know if anything is corrupted until I try to access it and have issues.  Oh well.  Thanks again for the help though everybody.  Still learning!

Edited by hansolo77
Update
Link to comment

I know this is late, but the Unraid boot media should contain memtest and a boot entry to reach it unless the end user manually removed it. I can understand not noticing it, it's in a text menu that only shows up for five seconds or so when you first boot.

 

And yes, basically run memtest for a really long time (it has a full pass counter and an error counter) and if there are errors either a) you're overclocking stop it, or B) your RAM is bad replace it.

 

There are some rare edge cases (BIOS updates, cleaning contacts, etc) but generally it's more common to be bad RAM or RAM compatibility with the motherboard.

Link to comment

I saw the Memtest86 entry in the Unraid boot menu but when I select it, not thing happens.  It might be that it's using the same version I was unable to load.  But yeah, I think the memory is fine.

 

I'm running another parity check, and it's still finding errors.  :(  I don't know what's causing it.  I disabled all my dockers and vm, rebooted (perfectly) and started it.  Nothing is running that could be accessing the array.  Yet still it's finding errors after all these corrections.

Link to comment

Completed another correcting parity check.. down to 20 errors.  Still pretty concerned about it.  Not sure what else to try.

 

Is it possible my parity drives are just corrupted, and I could "rebuild" them by removing them and re-adding them?  I've only been playing with Unraid about a year.  Hard to believe I've got such a massive problem with my parity after all this.

kyber-diagnostics-20210506-2242.zip

Link to comment

To be honest, I feel you just waste your time.

 

(1)  I highly recommend you use memtest86 v9 (UEFI boot only ) and set use all CPU. I know you have booting problem on v9, pls try troubleshoot why it fail. Actually, ones you create the USB, you should backup 3 folder+file, it use ~7MB. Then you no need create the USB everytime. Just copy it to a FAT32 USB stick could run memtest86 v9.

 

image.png.65158a8047e3f3eb46e307f8c7d24611.png

 

For UEFI boot problem, pls check those BIOS setting and choice boot device with UEFI XXXXXXXX, if you can't boot it, pls reset BIOS setting. Pls note It will blank screen as longer as 1min+ and init another 3min+.

 

(2)  In Unraid parity test, you don't need waiting for test whole disks 12TB size, the no. of count is meaningless, you just need error free.

You should mark the 1st error occur time and capacity point, i.e. in your log it just 15min to got 1st error .... then stop/start correct parity check, repeat-error-stop-repeat ....... if error still randomly happen, that means you still not fixing the root problem.

 

May  5 20:35:48 Kyber kernel: Linux version 5.10.28-Unraid (root@Develop) (gcc (GCC) 9.3.0, GNU ld version 2.33.1-slack15) #1 SMP Wed Apr 7 08:23:18 PDT 2021

May  5 20:50:19 Kyber kernel: md: recovery thread: Q corrected, sector=142011384

 

(3) Rule out other hardware until error free,  I would suggest next hardware will be HBA or try test use two spare disks with new config.

 

 

1 hour ago, hansolo77 said:

Is it possible my parity drives are just corrupted, and I could "rebuild" them by removing them and re-adding them?  I've only been playing with Unraid about a year.  Hard to believe I've got such a massive problem with my parity after all this.

 

No use.

 

Edited by Vr2Io
Link to comment

Thanks for the advice. 

 

10 hours ago, Vr2Io said:

Pls note It will blank screen as longer as 1min+ and init another 3min+.

 

 

I didn't know the UEFI version of memtest would cause a blank screen for so long.  I just assumed it wasn't working.  I never got an error messae or anything, so maybe it was working the whole time and I was just wrong.  I'll try it again.

 

10 hours ago, Vr2Io said:

You should mark the 1st error occur time and capacity point, i.e. in your log it just 15min to got 1st error .... then stop/start correct parity check, repeat-error-stop-repeat .......

 

That makes too much sense!  I never thought about that.  I will do that next.  When I do that test, should I be doing it as a repair or just a scan?  Also.. what if the problems don't crop up until like the last hour?  Bummer lol.

 

10 hours ago, Vr2Io said:

(3) Rule out other hardware until error free,  I would suggest next hardware will be HBA or try test use two spare disks with new config.

 

I hope it's not the controller.  I mean yeah, anything is possible.  I bought this one from eBay "The Art of Server".  Highly recommended seller.  Is there a way to test the controller directly?

 

8 hours ago, JorgeB said:

Note also that after the problem is fixed the first check may still find sync errors, but the 2nd one should find 0.

 

After I posted last night with the diagnostic logs, I started another non-correcting check.  It is almost 12 hours in and has found 0 errors. [fingers crossed]

Link to comment
1 minute ago, jonathanm said:

Do you have active cooling on the controller, or at least constant air movement? Server grade controllers assume they will be in rack mount or other flow through designs, they must not be in stagnant air.

 

I don't think the controller has a fan on it, but I do have a fan blowing air across the entirety of my expansion cards.  I also have all my fans set to be on full power rather than ramp up based on temps.

Edited by hansolo77
Link to comment
2 minutes ago, Vr2Io said:

Repair (corrected), because scan can't rule out does error still randomly or unpredictable.

 

Should I stop my current non-correcting scan then?  It's still found 0 errors and I'm pretty confident it's passed the mark where the first error occured yesterday.

Link to comment
3 minutes ago, hansolo77 said:

 

Should I stop my current non-correcting scan then?  It's still found 0 errors and I'm pretty confident it's passed the mark where the first error occured yesterday.

Let it continue.

 

I notice you said parity check need 1.5day, would you told me when start parity check/correct, how much total bandwidth reach ?, it show on array statistics bottom part.

 

Before buy another HBA, I will use two spare disk with new config to verify does no issue if test in onboard and problem on HBA only.

Link to comment

Total bandwith is averaging 3.2 GB/s.  Each disk is hitting around 155.5 MB/s.  When you say use 2 spare disks.. you mean a parity and a data right?  If it comes to that, I would actually have to BUY some drives.  I've been upgrading my drives to 12tb and have been giving my lower capacity drives to my brother, so I have no spares.

Link to comment
2 minutes ago, Vr2Io said:

Sorry, I change my mind, if it need too much time to complete, then I will stop it and run UEFI memtest86.

 

Ok sounds good. I'm off work today so I can spend time working on this.  Next days off are Sunday and Monday, then a long stretch of working.  Whatever I need to do I'm on it.  :)

Link to comment
7 minutes ago, hansolo77 said:

Total bandwith is averaging 3.2 GB/s.

So great, to be hornest, my array just 10disks or 16diskas max, so I never reach such high bandwidth usage.


Most people wouldn't reach this level.

Edited by Vr2Io
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.