Jump to content
Joseph

[Resolved] 5 Errors After Every Parity Check

166 posts in this topic Last Reply

Recommended Posts

15 minutes ago, johnnie.black said:

 

That means it's pass the 4TB mark, only your paritys are larger, so they're the only ones still being checked.

slightly off-topic, but I suspect that if unRAID is writing data to the array, then only the drives that are changed need to be recalculated along with the old parity bit(s) and then re-written to parity as the new bit(s); so the rest of the array can remain spun-down.  Is this correct?

Share this post


Link to post
7 minutes ago, Joseph said:

I'm fairly certain if there is a problem reading a drive in the array (whether its a physical problem or controller problem), then unRAID will knock it out off the array and let you know it needs to be rebuilt.

Yes, but the issue is in how it's handled. When a read error occurs, the first thing unraid does is spin up all the drives, compute from parity what SHOULD be in that spot, and issue a write command to put the correct data back, a write failure at that point causes a red ball and the problem drive is no longer accessed. If everything is working properly, and the write succeeds, this has the beneficial end result of allowing a problematic logical sector to be rewritten seamlessly, and if the drive is working properly the smart subsystem will remap that physical sector if necessary.

 

In this case, who know what data will be returned when unraid spins up all the drives? Will it be correct? Will it be corrupt? It's a gamble.

Share this post


Link to post
1 minute ago, Joseph said:

Is this correct?

 

Yes with normal writing mode, with turbo write enable all disks need to be spun up (except when there's more than a disk size and the user is writing to the end of a larger disk, in that case only disks of that size or larger need to be spun up)

Share this post


Link to post
4 minutes ago, johnnie.black said:

except when there's more than a disk size and the user is writing to the end of a larger disk, in that case only disks of that size or larger need to be spun up

Is reconstruct write that intelligent? Or does it always keep everything spun up anyways?

Share this post


Link to post
2 minutes ago, jonathanm said:

Is reconstruct write that intelligent? Or does it always keep everything spun up anyways?

 

No intelligence required :), e.g., you have an array with 4 and 8tb disks, you're writing to the last half of one of the 8TB disks, all 4 TB disks will be idle, they will spun down when they reach the set idle time.

Share this post


Link to post
8 minutes ago, johnnie.black said:

 

No intelligence required :), e.g., you have an array with 4 and 8tb disks, you're writing to the last half of one of the 8TB disks, all 4 TB disks will be idle, they will spun down when they reach the set idle time.

Hmm. I guess I just assumed the table of contents was within the first few sectors, thus any writes would need all drives spun up anyway.

Share this post


Link to post
4 minutes ago, jonathanm said:

Hmm. I guess I just assumed the table of contents was within the first few sectors, thus any writes would need all drives spun up anyway.

 

No they stay spun down, I've noticed this behavior on my mixed size arrays.

Share this post


Link to post
10 hours ago, johnnie.black said:

unRAID tunable settings also play a role.

ahhh.... I forgot about tunable! I ran it quite awhile back when I had the drive configuration all wrong (fast drives on slow controller and vice-versa.); that was around 6.0 or 6.1. So after I move everything off the Marvell controller and recheck parity, I'll run tunable again. Thanks for the tip. 8)

Share this post


Link to post
26 minutes ago, Joseph said:

ahhh.... I forgot about tunable! I ran it quite awhile back when I had the drive configuration all wrong (fast drives on slow controller and vice-versa.); that was around 6.0 or 6.1. So after I move everything off the Marvell controller and recheck parity, I'll run tunable again. Thanks for the tip. 8)

 

The utility to find the best tunables is not working optimally for v6.2/6.3, but taking a look at your current settings they are no good:

 

md_num_stripes="6008"
md_sync_window="2704"
md_sync_thresh="192"

 

When using an LSI md_sync_thresh need to be closer to md_sync_window, change to these:

 

md_num_stripes="4096"
md_sync_window="2048"
md_sync_thresh="2000"

 

 

Edited by johnnie.black

Share this post


Link to post
1 minute ago, johnnie.black said:

When using an LSI md_sync_thresh need to be closer to md_sync_window, change to these:

 

md_num_stripes="4096"
md_sync_window="2048"
md_sync_thresh="2000"

 

 

Thank you so much...will do!

Share this post


Link to post
21 hours ago, EdgarWallace said:

@Joseph, this was at the start of the parity check, the average was about 90MB/s, which is about the same speed I was getting with my AOC-SAS2LP-MV8.

 

@johnnie.black thanks a lot, see the outcome below....looks pretty good right?

Parity Check ended with 24 errors. I am going to run a Parity Check once again with "Write corrections to parity" option

 

SUCCESS

 

A second parity check after the xfs_repair /dev/md5 was showing 14 sync errors that were corrected. The final parity check reported 0 errors.

 

I rate my issues with the removal of the AOC-SAS2LP-MV8 and the insert of a DELL PERC H200 as fixed. Good to rely on my backup server (again).

 

Thanks to all who were helping, mainly @johnnie.black and @Fireball3

Share this post


Link to post

So after my last parity check I got 4 errors. I got the Dell HV52W PERC H310 off eBay and so I was looking at flashing it, from what I am seeing on the links here I should be using the LSI SAS2008 Controllers(P10).zip for this card? Obviously I don't want to brick the card and so would really appreciate confirmation. I have made a bootable USB stick with Rufus and copied the files from the zip to the usb drive and also renamed the 1_sas2flash_x86_P10.exe to sas2flash_x86_P10.exe so that the bat file executes that flash. Is this all correct or am I missing anything?
Thanks for all the help

Share this post


Link to post
41 minutes ago, Fireball3 said:

See my sig, there is a toolset for the H310 with a more up-to-date IT firmware.

So can I flash directly to the new firmware or do I have to flash to an interim firmware first? This card I believe should be running stock firmware.

Share this post


Link to post
1 hour ago, Fireball3 said:

See my sig, there is a toolset for the H310 with a more up-to-date IT firmware.

@Fireball, can you confirm if step 5.2 of P20 puts the interim firmware on the card, in this case P7, and then step 5.3 puts P20 on the card? Somehow I got all turned around and maybe the source of confusion on this thread. 

Share this post


Link to post
@Fireball, can you confirm if step 5.2 of P20 puts the interim firmware on the card, in this case P7, and then step 5.3 puts P20 on the card?

Yes, correct.
But the P7 step is mandatory on the way to P20!

Share this post


Link to post
On 13/04/2017 at 6:29 PM, jonathanm said:

Hmm. I guess I just assumed the table of contents was within the first few sectors, thus any writes would need all drives spun up anyway.

 

This made sense and made me doubt my earlier post, I was pretty sure I've seen that behavior but just checked on one of my servers and you are right, after each file is written there's a small number of reads from each disk, to update the TOC, so no disks will spin down during writes with turbo write enable, my mistake :$

Share this post


Link to post

 

On 4/10/2017 at 1:36 PM, johnnie.black said:

Follow up, this is what I would recommend:

 

Parity - currently on onboard intel port1 (sata3) - move to free intel port 3 (sata2)

cache2 - currently on onboard intel port2 (sata3) - leave as is

cache - curtrently on maverll - move to intel port1 (sata3)

parity2 - currently on marvell - move to free intel port4 or free LSI port

 

 

 

 

The above steps are complete....the array is online, but the syslog shows about 1000 errors on ATA4 and btrfs errors. Any thoughts?

Parity check on hold until I can get this sorted.  Thanks!!

 

Edited by Joseph

Share this post


Link to post
2 minutes ago, johnnie.black said:

Replace SATA cable on SSD serial ... 245F.

working on it.... you're gonna have to show me how you figured that out!

Share this post


Link to post
On 4/16/2017 at 5:31 PM, johnnie.black said:

Replace SATA cable on SSD serial ... 245F.

OK, so the cable has been replaced, however Cache & Cache 2 are online, but is no longer mountable.

Also, the error count is down to 34, doesn't appear to be ATA related, but do you think any of them need to be addressed?

 

FYI, I disabled the onboard Marvell controller in the bios.

Untitled.jpg

 

Edited by Joseph
additional info

Share this post


Link to post

So I have just finished the 3rd parity check after several tweaks and finally got zero errors. Is it normal that it should take multiple parity checks to finally get zero errors?

Share this post


Link to post
1 hour ago, crowdx42 said:

So I have just finished the 3rd parity check after several tweaks and finally got zero errors. Is it normal that it should take multiple parity checks to finally get zero errors?

My guess is there would be 2 parity checks. 1 to correct errors that were generated by the old card that's been replaced, and then the second one which should come up clean to 'prove' there are no more errors.

 

But then again, what do I know. I'm stuck with an unmountable cache and just recently learned about a table of contents on unraid....

 

hoping someone can provide answers for both crowdx42 and myself

Edited by Joseph

Share this post


Link to post
8 hours ago, Joseph said:

Also, the error count is down to 34, doesn't appear to be ATA related, but do you think any of them need to be addressed?

 

ATA errors are gone, ACPI errors can usually be ignored, bios update may help some.

 

Problem is that the bad cable corrupted your cache pool, assuming pool was default raid1 try powering down, disconnect cable to SSD ...245F, power back back on and see if it mounts, if not use this FAQ to try and recover your cache data:

 

 

Share this post


Link to post
11 hours ago, Joseph said:

My guess is there would be 2 parity checks. 1 to correct errors that were generated by the old card that's been replaced, and then the second one which should come up clean to 'prove' there are no more errors.

 

But then again, what do I know. I'm stuck with an unmountable cache and just recently learned about a table of contents on unraid....

 

hoping someone can provide answers for both crowdx42 and myself

This makes sense to me. So the change which seemed to make the difference for me was to turn off VT-D for virtual machine support. I still have the two Dell 310 cards which I have not installed yet and in that process I will be switching out the motherboard for a newer chipset with 3 PCIex16 slots instead of the current board which has 2, this allows me to add my 4 ports NIC.

 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.