Jump to content
Joseph

[Resolved] 5 Errors After Every Parity Check

166 posts in this topic Last Reply

Recommended Posts

So did you flash the card on a windows machine or from command line? I am wondering (considering the relative low cost) if I should swap out my cards for these, I am not sure if my cards are an issue but I am still running with a single parity drive following two failures in creating the 1st parity. I am waiting for a new 8tb drive to finish preclear and will try again.

One parity drive failure was a dying drive but I am not sure what happened with the first one.

 

 

Share this post


Link to post
5 hours ago, crowdx42 said:

So did you flash the card on a windows machine or from command line? I am wondering (considering the relative low cost) if I should swap out my cards for these, I am not sure if my cards are an issue but I am still running with a single parity drive following two failures in creating the 1st parity. I am waiting for a new 8tb drive to finish preclear and will try again.

One parity drive failure was a dying drive but I am not sure what happened with the first one.

 

 

Probably wouldn't hurt to swap out if it offers better compatibility. fwiw, Didn't try the windows route; I flashed from a command line with a usb stick... Had to cover 2 pins with electrical tape to get it to initially boot. Kudos to johnnie.black for steering me in the right direction!

 

Like I said earlier, I experienced a similar problem... I was convinced it was a bad PS; turned out to be a combo of faulty Y splitter power cables and poor connections made by me. After re-seating everything & replacing el cheap-o cabling, that problem went away. Its a PITA but give it a shot; it might work. I'm running parity check number 2 and will let you know if the errors went away.

Edited by Joseph

Share this post


Link to post
21 hours ago, Joseph said:

Well I am currently starting a partiy build for the first parity drive, the second parity drive was fine. For the new parity drive got an 8tb WD which passed a preclear and I also upgraded from Corsair HX850 to an HX1200. I also moved the parity drive so that both parity drives are on separate controllers. 
All the drive power cabling is direct into the backplane of the Norco 4220 and the cables are all modular from the Corsair PSU. 
Fingers crossed everything runs without any errors. I have a second 8tb which I will install if the first parity builds with no issues :P
 

 

Share this post


Link to post

Unless you're just getting a batch of bum hard drives (possible, but highly unlikely), or the drives just can't handle the unRAID platform (I wound up getting rid of most of my old drives that didn't meet the criteria) then there's only a few things left that I can think of.... HBA, PS, data/power cabling and backplane. Doubt its the PS, since you just replaced it. (side note: as I'm sure you know, make sure you're in single rail mode.)

 

I guess you could try swapping 2 HDD cables around on the backplane and see if the problem follows the cable swap or stays to try to narrow down where the problem is or isn't. Then do the same for the HBA A/B port cables. Seems like that would narrow it down to one of the 3 remaining things. FYI, I know just enough to get myself in trouble, so consider the source. :D

 

after thougth: Are you using ECC memory?

Edited by Joseph
after thought

Share this post


Link to post

So I have done pretty much all you have mentioned. By moving the drive to a different controller, I did this by moving it to a different drive bay. The only thing that I have not swapped is the backplane BUT everything was working a little over a week ago with no parity errors. 
The weird thing I am wondering about is that the second parity disk checked parity without getting any errors, it is only the first parity drive that had errors BUT I did get unlucky with drives as the drive I swapped after the first parity issue did have bad sectors and then also had some smart errors and I am no RMAing that drive.

I am not using ECC memory, just two sticks of dd3.

Share this post


Link to post
28 minutes ago, crowdx42 said:

So I have done pretty much all you have mentioned. By moving the drive to a different controller, I did this by moving it to a different drive bay. The only thing that I have not swapped is the backplane BUT everything was working a little over a week ago with no parity errors. 
The weird thing I am wondering about is that the second parity disk checked parity without getting any errors, it is only the first parity drive that had errors BUT I did get unlucky with drives as the drive I swapped after the first parity issue did have bad sectors and then also had some smart errors and I am no RMAing that drive.

I am not using ECC memory, just two sticks of dd3.

yeah, I gotcha... then probably the 'easiest' thing to try is replacing the HBAs (look, we're back where we started! lol). Relatively speaking, its inexpensive. It just doesn't make sense everything was working, now its not. When you said you moved stuff to a new case, that's what got me focused on the cabling route. (again, I also based that on my personal experience. :S)

 

But if it doesn't fix, maybe try running a robust memory test overnight since its non-ECC; it sucks being down that long though. Also, you could recheck SMART on a known good box on some of the drives in question and see what happens. Grab another USB, put unRAID trial on it and just test SMART one drive at a time. Let me know how it goes.

 

after thought: you have the latest mobo & controller BIOSes? You could also 'stress test' via preclear on the known good box some of the drives in question if they pass SMART.

Edited by Joseph
after thought

Share this post


Link to post

I have not messed with the BIOS on MB or controllers BUT it would be weird that they worked and then did not work. Also everything was working in the new setup and it had done two consecutive weekend parity checks with no errors, then it gave me around 200 errors twice in a row and then freaked out with 17k errors which I assumed was a drive failing (which I did have a drive failing but not one of the parity drives.)

I have my fingers and toes crossed that the new parity drive will fix the issue. The frustration with issues like this is that it makes the full system not trustworthy and defeats the purpose of backing up to it etc :(

Share this post


Link to post
On 4/3/2017 at 10:04 AM, EdgarWallace said:

I have installed a SAS2LP in my Backup Server, disabled VT-d and still having parity check errors.
I ordered a Dell Perc H200 today and will report back If that card is going to resolve my issues.
Btw. is it safe to run the parity check once the new controller is installed with the "Write corrections to parity" option?

Yes, as long as your data drives don't have any corruption on them, writing corrections to parity will 'freshen' the parity data so a data drive can properly be rebuilt.

 

Currently, I'm doing 2 parity checks. One to 'fix' the 5 errors I was receving with the old controller; and the second one to see if the errors went away. I will post my findings when its through. If you (or I) are still receiving parity errors after replacing the hba, then there's something else going on that needs to be addressed.

 

Has your card come in yet?

Share this post


Link to post
9 minutes ago, crowdx42 said:

The frustration with issues like this is that it makes the full system not trustworthy and defeats the purpose of backing up to it etc :(

 

Indeed!!  I felt like that too... its been a long, interesting year trying to get it to work the way I want it to. I had to rethink some things I had on my old box and I went way over budget with hardware.  But, if I can resolve these last parity errors, I will be extremely happy with the scalability and functionality of unRAID.

 

The 2 remaining concerns are a little more esoteric: Combating data rot and ransom-ware, which I hope I can spend some time understanding then implementing solutions soon.

 

 

 

Share this post


Link to post

NEW CARD: DID NOT FIX PROBLEM!!

FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!!  FAIL!! 

 

Apr  8 15:54:27 Tower kernel: md: recovery thread: Q corrected, sector=3519069768
Apr  8 15:54:27 Tower kernel: md: recovery thread: Q corrected, sector=3519069776
Apr  8 15:54:27 Tower kernel: md: recovery thread: Q corrected, sector=3519069784
Apr  8 15:54:27 Tower kernel: md: recovery thread: Q corrected, sector=3519069792
Apr  8 15:54:27 Tower kernel: md: recovery thread: Q corrected, sector=3519069800

bum hard drive(s)?

I'm open to any ideas/suggestions.

 

Is there a 1-click SMART report that can be run on all drives in the array? Or does each one have to be done manually?

Edited by Joseph
update

Share this post


Link to post

Thanks man, I'm looking at the previous log to see if the sector is the same.

Share this post


Link to post

If you didn't reboot since both checks post the diagnostics.

Share this post


Link to post

Btw, are you on the latest unRAID version? I am on the latest and I am wondering if it could be related, if I recall correctly it was just before I started having issues that I updated to the latest revision, pretty unlikely the parity algorithm changed, but is making me wonder lol

Share this post


Link to post

from Apr 2 (old hba):

Apr  2 04:59:10 Tower kernel: md: recovery thread: PQ corrected, sector=3519069768
Apr  2 04:59:10 Tower kernel: md: recovery thread: Q corrected, sector=3519069776
Apr  2 04:59:10 Tower kernel: md: recovery thread: Q corrected, sector=3519069784
Apr  2 04:59:10 Tower kernel: md: recovery thread: Q corrected, sector=3519069792
Apr  2 04:59:10 Tower kernel: md: recovery thread: PQ corrected, sector=3519069800

 

from Today (replacement hba; 1st chk)

searched log; can't find  <~~~because unRAID was rebooted.

 

from Today (replacement hba; 2nd chk, still running)

Apr  8 15:54:27 Tower kernel: md: recovery thread: Q corrected, sector=3519069768
Apr  8 15:54:27 Tower kernel: md: recovery thread: Q corrected, sector=3519069776
Apr  8 15:54:27 Tower kernel: md: recovery thread: Q corrected, sector=3519069784
Apr  8 15:54:27 Tower kernel: md: recovery thread: Q corrected, sector=3519069792
Apr  8 15:54:27 Tower kernel: md: recovery thread: Q corrected, sector=3519069800

full syslog from today (2nd chk, still running) attached.

This is happening somewhere between 0% - 48.7% of the way complete.

 

Edited by Joseph
update

Share this post


Link to post
19 minutes ago, johnnie.black said:

If you didn't reboot since both checks post the diagnostics.

I just posted diags from about 5 or so minutes ago; that should have all the data from the reboot yesterday going forward.... correct? Or, should I post both logs?

 

Turns out I rebooted first thing this morning :S  Would you suggest cancelling parity check and re-running it and see if the sector or parity drive changes? It looks like the sectors are the same from the Apr2 report only the parity drive affected changed.

Edited by Joseph
corrected

Share this post


Link to post
10 minutes ago, crowdx42 said:

Btw, are you on the latest unRAID version? I am on the latest and I am wondering if it could be related, if I recall correctly it was just before I started having issues that I updated to the latest revision, pretty unlikely the parity algorithm changed, but is making me wonder lol

I'm running 6.3.3, but Its been going on for a while I think.

Share this post


Link to post
15 minutes ago, Joseph said:

I just posted diags from about 5 or so minutes ago; that should have all the data from the reboot yesterday going forward.... correct? Or, should I post both logs?

 

Turns out I rebooted first thing this morning :S  Would you suggest cancelling parity check and re-running it and see if the sector or parity drive changes? It looks like the sectors are the same from the Apr2 report only the parity drive affected changed.

 

Yes, cancel and start again, to see if the errors repeat and also since we can't see if the first check was correcting.

Share this post


Link to post
1 minute ago, johnnie.black said:

 

Yes, cancel and start again, to see if the errors repeat and also since we can't see if the first check was correcting.

thanks for your help. Will report back.

Share this post


Link to post

52.3% 100% complete and no errors...now I'm thoroughly confused! :S:( I probably won't check it any more tonight, but will post results in the am.

 

UPDATE:

Logs attached. I'm gonna start another parity check without rebooting. If all goes well I will reboot and run it again to see if the errors return on reboot.

 

Last check completed on Sun 09 Apr 2017 03:32:02 AM CDT (today), finding 0 errors.
Duration: 10 hours, 35 minutes, 47 seconds. Average speed: 104.9 MB/sec

 

Edited by Joseph
update

Share this post


Link to post

Well my parity build succeeded.  So now my dilemma is should I do a parity check now and once that is successful add the second 8tb parity drive or add it now and do the parity check once it completes rebuilding.

The only error I see popping up is the one below which does not seemed to have affected the parity rebuild. I also attached the full logs below.

 

sas: sas_eh_handle_sas_errors: task 0xffff880211e5e200 is aborted

 

unraid_6-diagnostics-20170409-0744.zip

Share this post


Link to post
44 minutes ago, crowdx42 said:

Well my parity build succeeded.  So now my dilemma is should I do a parity check now and once that is successful add the second 8tb parity drive or add it now and do the parity check once it completes rebuilding.

The only error I see popping up is the one below which does not seemed to have affected the parity rebuild. I also attached the full logs below.

 

sas: sas_eh_handle_sas_errors: task 0xffff880211e5e200 is aborted

 

unraid_6-diagnostics-20170409-0744.zip

 

I looked at your logs, but I'm afraid its over my head. o.O Not to muddy the waters but I found this:

https://forums.lime-technology.com/topic/56232-parity-check-found-errors/

 

With my new issue notwithstanding, it sounds like the SAS2LPs just won't cut it for some who have unRAID; so I'd encourage you to replace them first, then do parity checks.

If that isn't an option at this time, I'd run a check first just to see or you could try the 8tb and rebuild parity, then run a check on that to be sure afterwards... but again, if it was me, I'd get the cards replaced first.

Share this post


Link to post

Did your parity check succeed? I think I will wait to replace my SAS2LPs (I have 3 in total, 2 on the main and 1 on the backup) . I am also wondering if I should check to see what firmware they are on not sure how to do that without pulling them and putting them into a windows machine :(

Share this post


Link to post
7 minutes ago, crowdx42 said:

Did your parity check succeed? I think I will wait to replace my SAS2LPs (I have 3 in total, 2 on the main and 1 on the backup) . I am also wondering if I should check to see what firmware they are on not sure how to do that without pulling them and putting them into a windows machine :(

with the new card installed, the first parity check corrected 5 errors (this was to be expected) However the second parity check found the same 5 errors. Parity check 3 came back ok. I'm running check 4 and will post what I find. If it goes well, the plan is to reboot afterward and try it again.

 

re the SAS2LP flash If memory serves, the one I bought new needed to be flashed but they haven't had a new firmware update for awhile. I downloaded .1812 back in May of 2016.

https://www.supermicro.com/products/accessories/addon/aoc-sas2lp-mv8.cfm

 

Share this post


Link to post

Sounds like your issues may finally be resolved. I am not sure if you posted this, what motherboard, cpu, memory etc are you running? My main setup is all intel and so I am wondering if there is any connection.

Also, for the SAS card update, did you have to check via command line?

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.