codefaux Posted May 4, 2021 Share Posted May 4, 2021 On 5/2/2021 at 8:39 AM, hansolo77 said: the settings were all reset This is typical, and any time you reset a BIOS it is considered best practice to do a Load Defaults immediately afterward -- as is uaully indicated in the flashing instructions, and widely ignored. The reasoning is obscured to the average user, but important; when storing your values, historically, they don't save "Processor 1 clock multiplier was 24x" or whatever. It's a block of numbers, with no context. The next BIOS version might use that same data space for "CPU Northbridge Overvolt" and cause damage. Typically vendors have important things like that reset during the update process, but things can be missed and it can cause some truly bizarre bugs. Honestly, at this point I almost suspect the advice may not have as much merit as it did, say, ten years ago -- but it still seems prudent to be cautious. Quote Link to comment
hansolo77 Posted May 4, 2021 Author Share Posted May 4, 2021 How do I go about running a memtest? I see there is an option in the pre-boot options of Unraid, but it doesn't do anything when I select it. Do I need to have a copy of it on a flash drive or something first? Also... I have reboot my server probably a dozen times in the last few days. Looks like I've fixed THAT problem. I ran another correcting parity check and it found 24 errors. I feel like I'm getting closer to a solved matter, but I still want to be sure. A memtest sounds like something I can do. Quote Link to comment
ChatNoir Posted May 4, 2021 Share Posted May 4, 2021 You can get memtest there : https://www.memtest86.com/ and create an USB boot drive. Quote Link to comment
hansolo77 Posted May 4, 2021 Author Share Posted May 4, 2021 Cool thanks. So I should let it run it's tests.. and if it finds faults I should replace the ram? Or is there something else to do? It's been over a decade since I've ran a memtest. Kinda interested to see how it does. Quote Link to comment
soerenderfor Posted May 4, 2021 Share Posted May 4, 2021 (edited) @hansolo77 - I just read the whole thread, the memtest. How did it go? I'm no expert, at all. But in plugins, you can find a plugin called 'open files' - it shows if there is some open files, if files is open, it will make a unclean shutdown? Correct me if i'm wrong Edited May 4, 2021 by soerenderfor Quote Link to comment
hansolo77 Posted May 4, 2021 Author Share Posted May 4, 2021 (edited) I'm actually trying to get the memtest started.. I have it written to a USB flash drive.. plugged it in, my BIOS reads it, but if I have it boot via UEFI it just sits at a blank screen. If I boot it via regular, I get an error that it's not bootable and to replace it or select a different device. So I'm going to try an older version. My motherboard should be compatible with the latest version (released around September 2019), so I don't know what I'm doing wrong. ==== EDIT ==== Got the latest v4 release, got a new option in my BIOS for "USB Floppy". Using that, I've now been able to get it to boot the memtest, and it's running. If I remember right, and the program hasn't changed, I can just leave it running overnight. The longer it runs the more accurate the test. I'll report back tomorrow unless somebody says otherwise. Also of note.. I was able to successfully shutdown Unraid without any delay issues again before loading Memtest. I truly feel I've resolved the original issue. It had to have been a combination of using UnBalance Scatter while the NZBGet docker was attempting to repair a massive download that was missing parts. It was something like 54gb. Mover was set to go every hour, so it was probably trying to move files onto the array while trying to repair the download, while trying to Scatter. Everything all at once. I didn't even realize it was doing that until I had rebooted (unsafely) and saw it trying to resume the recovery after the fact. I'll know better next time. If my RAM checks out, I might have a new problem to address. I just completed another self-correcting parity check and it found 24 errors. That's a LOT better than the 138k I was getting before. I'm just concerned because it IS still 24 errors. I wish I could verify integrity of the data, but it's all media and I don't have anything to compare it against. No way to know if anything is corrupted until I try to access it and have issues. Oh well. Thanks again for the help though everybody. Still learning! Edited May 4, 2021 by hansolo77 Update Quote Link to comment
ChatNoir Posted May 5, 2021 Share Posted May 5, 2021 The only acceptable number of Parity error is 0. Quote Link to comment
hansolo77 Posted May 5, 2021 Author Share Posted May 5, 2021 It's been about 24 hours.. I've got 0 errors on the memtest with 2 passes and another 3/4 of a third pass. I believe the RAM is not the issue. Quote Link to comment
codefaux Posted May 6, 2021 Share Posted May 6, 2021 I know this is late, but the Unraid boot media should contain memtest and a boot entry to reach it unless the end user manually removed it. I can understand not noticing it, it's in a text menu that only shows up for five seconds or so when you first boot. And yes, basically run memtest for a really long time (it has a full pass counter and an error counter) and if there are errors either a) you're overclocking stop it, or B) your RAM is bad replace it. There are some rare edge cases (BIOS updates, cleaning contacts, etc) but generally it's more common to be bad RAM or RAM compatibility with the motherboard. Quote Link to comment
hansolo77 Posted May 6, 2021 Author Share Posted May 6, 2021 I saw the Memtest86 entry in the Unraid boot menu but when I select it, not thing happens. It might be that it's using the same version I was unable to load. But yeah, I think the memory is fine. I'm running another parity check, and it's still finding errors. I don't know what's causing it. I disabled all my dockers and vm, rebooted (perfectly) and started it. Nothing is running that could be accessing the array. Yet still it's finding errors after all these corrections. Quote Link to comment
JorgeB Posted May 6, 2021 Share Posted May 6, 2021 29 minutes ago, hansolo77 said: I saw the Memtest86 entry in the Unraid boot menu but when I select it, not thing happens. It won't work if you're booting UEFI, only for CSM/Legacy BIOS. Quote Link to comment
hansolo77 Posted May 7, 2021 Author Share Posted May 7, 2021 Completed another correcting parity check.. down to 20 errors. Still pretty concerned about it. Not sure what else to try. Is it possible my parity drives are just corrupted, and I could "rebuild" them by removing them and re-adding them? I've only been playing with Unraid about a year. Hard to believe I've got such a massive problem with my parity after all this. kyber-diagnostics-20210506-2242.zip Quote Link to comment
Vr2Io Posted May 7, 2021 Share Posted May 7, 2021 (edited) To be honest, I feel you just waste your time. (1) I highly recommend you use memtest86 v9 (UEFI boot only ) and set use all CPU. I know you have booting problem on v9, pls try troubleshoot why it fail. Actually, ones you create the USB, you should backup 3 folder+file, it use ~7MB. Then you no need create the USB everytime. Just copy it to a FAT32 USB stick could run memtest86 v9. For UEFI boot problem, pls check those BIOS setting and choice boot device with UEFI XXXXXXXX, if you can't boot it, pls reset BIOS setting. Pls note It will blank screen as longer as 1min+ and init another 3min+. (2) In Unraid parity test, you don't need waiting for test whole disks 12TB size, the no. of count is meaningless, you just need error free. You should mark the 1st error occur time and capacity point, i.e. in your log it just 15min to got 1st error .... then stop/start correct parity check, repeat-error-stop-repeat ....... if error still randomly happen, that means you still not fixing the root problem. May 5 20:35:48 Kyber kernel: Linux version 5.10.28-Unraid (root@Develop) (gcc (GCC) 9.3.0, GNU ld version 2.33.1-slack15) #1 SMP Wed Apr 7 08:23:18 PDT 2021 May 5 20:50:19 Kyber kernel: md: recovery thread: Q corrected, sector=142011384 (3) Rule out other hardware until error free, I would suggest next hardware will be HBA or try test use two spare disks with new config. 1 hour ago, hansolo77 said: Is it possible my parity drives are just corrupted, and I could "rebuild" them by removing them and re-adding them? I've only been playing with Unraid about a year. Hard to believe I've got such a massive problem with my parity after all this. No use. Edited May 7, 2021 by Vr2Io Quote Link to comment
JorgeB Posted May 7, 2021 Share Posted May 7, 2021 Note also that after the problem is fixed the first check may still find sync errors, but the 2nd one should find 0. Quote Link to comment
hansolo77 Posted May 7, 2021 Author Share Posted May 7, 2021 Thanks for the advice. 10 hours ago, Vr2Io said: Pls note It will blank screen as longer as 1min+ and init another 3min+. I didn't know the UEFI version of memtest would cause a blank screen for so long. I just assumed it wasn't working. I never got an error messae or anything, so maybe it was working the whole time and I was just wrong. I'll try it again. 10 hours ago, Vr2Io said: You should mark the 1st error occur time and capacity point, i.e. in your log it just 15min to got 1st error .... then stop/start correct parity check, repeat-error-stop-repeat ....... That makes too much sense! I never thought about that. I will do that next. When I do that test, should I be doing it as a repair or just a scan? Also.. what if the problems don't crop up until like the last hour? Bummer lol. 10 hours ago, Vr2Io said: (3) Rule out other hardware until error free, I would suggest next hardware will be HBA or try test use two spare disks with new config. I hope it's not the controller. I mean yeah, anything is possible. I bought this one from eBay "The Art of Server". Highly recommended seller. Is there a way to test the controller directly? 8 hours ago, JorgeB said: Note also that after the problem is fixed the first check may still find sync errors, but the 2nd one should find 0. After I posted last night with the diagnostic logs, I started another non-correcting check. It is almost 12 hours in and has found 0 errors. [fingers crossed] Quote Link to comment
JonathanM Posted May 7, 2021 Share Posted May 7, 2021 30 minutes ago, hansolo77 said: I hope it's not the controller. Do you have active cooling on the controller, or at least constant air movement? Server grade controllers assume they will be in rack mount or other flow through designs, they must not be in stagnant air. Quote Link to comment
hansolo77 Posted May 7, 2021 Author Share Posted May 7, 2021 (edited) 1 minute ago, jonathanm said: Do you have active cooling on the controller, or at least constant air movement? Server grade controllers assume they will be in rack mount or other flow through designs, they must not be in stagnant air. I don't think the controller has a fan on it, but I do have a fan blowing air across the entirety of my expansion cards. I also have all my fans set to be on full power rather than ramp up based on temps. Edited May 7, 2021 by hansolo77 Quote Link to comment
Vr2Io Posted May 7, 2021 Share Posted May 7, 2021 40 minutes ago, hansolo77 said: When I do that test, should I be doing it as a repair or just a scan? Repair (corrected), because scan can't rule out does error still randomly or unpredictable. Quote Link to comment
hansolo77 Posted May 7, 2021 Author Share Posted May 7, 2021 2 minutes ago, Vr2Io said: Repair (corrected), because scan can't rule out does error still randomly or unpredictable. Should I stop my current non-correcting scan then? It's still found 0 errors and I'm pretty confident it's passed the mark where the first error occured yesterday. Quote Link to comment
Vr2Io Posted May 7, 2021 Share Posted May 7, 2021 3 minutes ago, hansolo77 said: Should I stop my current non-correcting scan then? It's still found 0 errors and I'm pretty confident it's passed the mark where the first error occured yesterday. Let it continue. I notice you said parity check need 1.5day, would you told me when start parity check/correct, how much total bandwidth reach ?, it show on array statistics bottom part. Before buy another HBA, I will use two spare disk with new config to verify does no issue if test in onboard and problem on HBA only. Quote Link to comment
hansolo77 Posted May 7, 2021 Author Share Posted May 7, 2021 Total bandwith is averaging 3.2 GB/s. Each disk is hitting around 155.5 MB/s. When you say use 2 spare disks.. you mean a parity and a data right? If it comes to that, I would actually have to BUY some drives. I've been upgrading my drives to 12tb and have been giving my lower capacity drives to my brother, so I have no spares. Quote Link to comment
Vr2Io Posted May 7, 2021 Share Posted May 7, 2021 2 minutes ago, Vr2Io said: Let it continue. Sorry, I change my mind, if it need too much time to complete, then I will stop it and run UEFI memtest86. Quote Link to comment
hansolo77 Posted May 7, 2021 Author Share Posted May 7, 2021 2 minutes ago, Vr2Io said: Sorry, I change my mind, if it need too much time to complete, then I will stop it and run UEFI memtest86. Ok sounds good. I'm off work today so I can spend time working on this. Next days off are Sunday and Monday, then a long stretch of working. Whatever I need to do I'm on it. Quote Link to comment
Vr2Io Posted May 7, 2021 Share Posted May 7, 2021 (edited) 7 minutes ago, hansolo77 said: Total bandwith is averaging 3.2 GB/s. So great, to be hornest, my array just 10disks or 16diskas max, so I never reach such high bandwidth usage. Most people wouldn't reach this level. Edited May 7, 2021 by Vr2Io Quote Link to comment
hansolo77 Posted May 7, 2021 Author Share Posted May 7, 2021 I actually thought that was a little slow, considering the controller is supposed to do 6 GB/s. I dunno, it might be a limitation of the backplane used in the case, or the drives themsevles. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.