Sync Error Detected


TODDLT

Recommended Posts

Last month I did a major upgrade involving both parities and the replacement or relocation of 4 out of 10 array drives. All of this happened without incident or error.  This included going from 3 to 6 TB parity.   Only one array drive is 6 TB.

 

Today if the first parity check and it's now past the 3TB mark IE there is likely no new data in the mix.  I am showing 5 detected errors and 3.5 hrs to go.

 

I forgot to turn off Plex last night and as it turns out the DVR was running for 3-4 hours of the parity check time.  In theory this shouldn't hurt it, but I have to wonder with errors and data being written at the same time over that duration.

 

There is only the 2nd time in 5 or 6 years of unRaid I've had errors on a parity check.  What is the next step in trouble shooting or would you just correct the errors and move on?   

 

My signature is up to date with configuration and drive types.

 

Thanks for any help.

Link to comment
8 hours ago, johnnie.black said:

There's a known issue with the SAS2LP where with some disks will always return 5 sync errors in every sync if done after a reboot, correct the current errors, then run another check without rebooting and if no errors are found another one after rebooting.

If this is a known issue, then are the errors in fact errors.  IE should they be corrected? Is the SAS2LP causing an error in the parity or is it falsely reporting an error exists?

 

There used to be an issue (previously fixed) with the SAS2LP causing a huge drag during parity checks.  I think that was a driver issue.  This is the first time I've had it cause parity check errors.  Maybe I need to start looking for another LSI controller because I dont want to have this issue every month.  It's not unusual to reboot at lest once during the month for some reason or another.

Link to comment

Thanks for this.  I see my original plea for help on the speed issue showed up in your thread and hopefully has been helpful to the community....   

 

There were two drive types introduced in this last upgrade.  Primarily 3 ea. HGST HDN726060ALE614 - 6 TB drives and 1 ea. 500 GB Samsung 860 EVO SSD.  The SSD was a replacement at the last minute as I appeared to be one of the few with a failing SSD.  Also of note, I do/did have several ST3000DM001 previously, but no idea which controller they were connected too then at the time and may never know.  

 

I doubt there has been any checking on that combination, but if you have any thoughts please do chime in.   

 

When I get home I will check which controller these drives are connected too.   I THINK the SSD and two of the new HGST drives are on the MB, and one of the HGST's is on the Marvell.  There may also be a warm spare HGST is on the Marvell as well but I'm assuming this is not playing in the issue.  I'll also see where my ST3000DM001's are connected and see if I can figure what controller.  

 

I have not run a correcting parity sync yet (I think it defaults to a non-correcting parity check each month), so when I get home (later tonight) I'll make sure all the HGST's are moved to my LSI controller and re-run the parity check.   

 

Any thoughts please let me know and I'll check here before proceeding.

Edited by TODDLT
Link to comment

I have long since moved on to a LSI 16 port controller, so not in a position to offer any real advice beyond what's in the thread.  And TBH, the sample size of drives tested is really small.  All I can say with certainty is that those particular drives as noted DID cause the 5 errors after a reboot.  The common denominator was the ATA version.

 

EDIT:  And I can't remember if I tested with non-correcting, and I'm not sure if they are "real" errors or not, so my suggestion is to swap stuff around, run a correcting check.  If / when you hit 5 errors, stop the check, shutdown and try something different.

Edited by Squid
Link to comment
4 hours ago, Squid said:

I have long since moved on to a LSI 16 port controller, so not in a position to offer any real advice beyond what's in the thread.  And TBH, the sample size of drives tested is really small.  All I can say with certainty is that those particular drives as noted DID cause the 5 errors after a reboot.  The common denominator was the ATA version.

 

EDIT:  And I can't remember if I tested with non-correcting, and I'm not sure if they are "real" errors or not, so my suggestion is to swap stuff around, run a correcting check.  If / when you hit 5 errors, stop the check, shutdown and try something different.

Running my first test now, and also looking for another LSI controller.  

Link to comment
4 hours ago, Squid said:

I have long since moved on to a LSI 16 port controller, so not in a position to offer any real advice beyond what's in the thread.  And TBH, the sample size of drives tested is really small.  All I can say with certainty is that those particular drives as noted DID cause the 5 errors after a reboot.  The common denominator was the ATA version.

 

EDIT:  And I can't remember if I tested with non-correcting, and I'm not sure if they are "real" errors or not, so my suggestion is to swap stuff around, run a correcting check.  If / when you hit 5 errors, stop the check, shutdown and try something different.

Well one of the two remaining ST3000DM001's is an ATA8-ACS.

All of the new drives were on the LSI Controller.  The ST3000DM001 with the ATA-ACS was on the Marvell.   

AND...  my 5 errors were at the exact same sector locations as your errors were in the thread above.  

 

So having said all that, about to start the parity check.  It will probably run across that location in the middle of the night so I'll check in the morning.

Link to comment
8 hours ago, johnnie.black said:

Those errors are real and can result in data corruption, another compared checksums after it happened and some files were corrupt.

OK so this may not be going the way I want things to but here it is.   

 

Last night I moved drives off the Marvell controller as recommended and noted above.  This morning there were 5 errors, presumably the same 5.  This was a correcting check.

 

Prior to my upgrade I had never seen the 5 errors occur.

During my "upgrade" a few weeks ago, this was the process I followed:.

1. A single old 2 TB array drive was replaced with a new ST3000VN001.  No Marvell controller changes made. 

2. The two parity drives were replaced simultaneously as a single operation with the HGST 6 TB drives.  No Marvell controller changes made.

3. I opened the case and reorganized drives.  This is the first time any actual controller connections were changed, and the 2 - ST3000DM001's that were originally in the array would have been placed into new positions.  As noted above only one of them is an ATA8-ACS drive.  

4. 4 more drives were individually replaced and rebuilt from parity. I believe I recall this order correctly.

 - 1st The HGST 6 TB drive replaced a 3 TB drive.  I am 98% sure the ATA8-ACS drive was not in the array at all at this time. 

 - 2nd The non ATA8-ACS drive was used to replace an older 2 TB drives

 - 3rd The ATA8-ACS drive was used to replace an older 2 TB drive

 - 4th a 500 GB SSD was replaced.  

 

The first actual parity check was done 11/1 when this thread started and I saw the 5 errors.

 

So my question is, would you think the 4 drives replaced AFTER step 3 above, probably/possibly now have 5 corrupt sectors?  It seems the errors are occurring around the 1.8TB mark so the 500GB SSD should be safe.  It also seems if I'm correct about the above order I am probably OK, but anything larger than 1.8TB replaced after the ATA8-ACS drive was put back into the array, is now suspect?

 

Thoughts?

 

OH, I DID order an LSI card last night too!!  

Link to comment
On 11/2/2018 at 2:14 PM, johnnie.black said:

Unless you have checksums it's impossible to tell if/which disks might have some corruption, just correct the sync errors, if data if for example mostly video/media files 5 errors should go unnoticed.

LSI card due Monday or Tuesday.  I assume a correcting parity check should be done after install.

Link to comment
On 11/4/2018 at 3:45 AM, johnnie.black said:

Yep

Installed, up and running and will start a parity check in about 30 mins when Plex DVR turns off for the night.  

 

2 questions. 

 

1.  Should I expect to see 5 errors again?  The box was restarted a couple times since the last parity check with the old card in it and that seems to have been the trigger from what I was seeing.  

2.  The LSI has it's own splash screen at startup, and you see the card ID itself, "spin up" and it shows connected drives.  I used to see this happen once for the LSI and once for the Supermicro.  Now I have 2 LSI cards installed.  4 drives on one, 8 on the other.  I only see one LSI card splash screen with 4 drives showing (this is the old card).  The new card has 8 drives connected and you never see it run the same startup.  Is this a self test that is not occurring? Is there a setting in the card's BIOS I should change?     

Link to comment
4 hours ago, TODDLT said:

1.  Should I expect to see 5 errors again?

Probably.

 

4 hours ago, TODDLT said:

2.  The LSI has it's own splash screen at startup, and you see the card ID itself,

Disks for both HBAs should appear on the same splash screen together, but it can depend on the LSI bios used/version, board used, etc, either way nothing to worry about, bios is optional.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.