Millions of sync errors 3 hours into parity check

PsionStorm · September 15, 2023

About three weeks ago, I added some new drives to my system, and upgraded my parity drive to a 14TB drive. Since then everything has been functioning without error, but this morning my monthly parity check fired up. I'm three hours in, and I've got a scary number of sync errors that I've never seen before. I'm hoping this is just a result of missing a step with the drive additions.

At this point, should I just cancel the scheduled parity check and force a correcting parity check? Or should I see this one through first?

excelsior-diagnostics-20230915-0259.zip

JorgeB · September 15, 2023

1 hour ago, PsionStorm said:

and upgraded my parity drive to a 14TB drive

Did you do a parity swap at this time?

itimpi · September 15, 2023

Has the parity check got beyond the size of the biggest data drive?

Did you use the parity swap procedure to upgrade the parity drive?

If so it seems that the space beyond the largest drive is not always correctly zeroed on the new parity drive and you need to run a correcting check once to fix this - after that future checks should be error free. However I do not think that is your problem here

It looks as if there may be some other problem that needs looking at first as in your syslog there are continual

Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: attempting task abort!scmd(0x000000001e0cffe4), outstanding for 30528 ms & timeout 30000 ms
Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: [sdi] tag#2226 CDB: opcode=0x88 88 00 00 00 00 00 9c b9 b4 40 00 00 00 08 00 00
Sep 12 14:46:15 Excelsior kernel: scsi target1:0:2: handle(0x000b), sas_address(0x5000cca23c0d2c39), phy(2)
Sep 12 14:46:15 Excelsior kernel: scsi target1:0:2: enclosure logical id(0x500605b0098b8100), slot(3) 
Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: task abort: SUCCESS scmd(0x000000001e0cffe4)

type messages in the syslog (sdi appears to be disk5). These seem to have started on Sep 8 - did you do anything to the system then? Is there any indication in the GUI of possible problems with disk5?

I would suggest you

Cancel the current check as no point proceeding if there are underlying hardware issues.
Do NOT at this point attempt to run a correcting check as if you have a hardware issue you are more likely to end up corrupting parity
Carefully check all connections (power and SATA) to disk5). Perhaps when changing drives you slightly disturbed an existing connection or did not quite perfectly seat one.

Not sure if at that point you should retry the non-correcting check or do something else such as an extended test on disk5

You may want to wait to see if anyone else (in particular @JorgeB has any other suggestions.

PsionStorm · September 15, 2023

1 hour ago, JorgeB said:

Did you do a parity swap at this time?

I thought I did, but I followed a different process. I added the new drive as a 2nd parity, then pulled the old one and set up a new config. I saw that recommended in a few places and didn't see the article you linked. Looks like that may have been my mistake?

1 hour ago, itimpi said:
Has the parity check got beyond the size of the biggest data drive?

Did you use the parity swap procedure to upgrade the parity drive?

If so it seems that the space beyond the largest drive is not always correctly zeroed on the new parity drive and you need to run a correcting check once to fix this - after that future checks should be error free. However I do not think that is your problem here

It looks as if there may be some other problem that needs looking at first as in your syslog there are continual
Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: attempting task abort!scmd(0x000000001e0cffe4), outstanding for 30528 ms & timeout 30000 ms
Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: [sdi] tag#2226 CDB: opcode=0x88 88 00 00 00 00 00 9c b9 b4 40 00 00 00 08 00 00
Sep 12 14:46:15 Excelsior kernel: scsi target1:0:2: handle(0x000b), sas_address(0x5000cca23c0d2c39), phy(2)
Sep 12 14:46:15 Excelsior kernel: scsi target1:0:2: enclosure logical id(0x500605b0098b8100), slot(3) 
Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: task abort: SUCCESS scmd(0x000000001e0cffe4)
type messages in the syslog (sdi appears to be disk5). These seem to have started on Sep 8 - did you do anything to the system then? Is there any indication in the GUI of possible problems with disk5?

I would suggest you

Cancel the current check as no point proceeding if there are underlying hardware issues.

Do NOT at this point attempt to run a correcting check as if you have a hardware issue you are more likely to end up corrupting parity

Carefully check all connections (power and SATA) to disk5). Perhaps when changing drives you slightly disturbed an existing connection or did not quite perfectly seat one.

Not sure if at that point you should retry the non-correcting check or do something else such as an extended test on disk5

You may want to wait to see if anyone else (in particular @JorgeB has any other suggestions.

Regarding those errors, I have done extensive troubleshooting trying to resolve them and have been unable to. They occur on all of my SAS drives, but only my SAS drives. See thread here:

Any advice would be appreciated. Since this post I've replaced the SAS card and have tried three different sets of cables with no different result.

Edited September 15, 2023 by PsionStorm

JorgeB · September 15, 2023

17 minutes ago, PsionStorm said:

then pulled the old one and set up a new config

And did you leave it as parity2 or assigned it to parity1? They are not interchangeable.

PsionStorm · September 15, 2023

4 minutes ago, JorgeB said:

And did you leave it as parity2 or assigned it to parity1? They are not interchangeable.

I moved it to parity1.

JorgeB · September 15, 2023

Then the sync errors are normal, just let it sync.

p.s. it will probably be much faster doing a new sync vs a correcting check.

PsionStorm · September 15, 2023

I believe the current sync is non-correcting - it was my scheduled monthly parity sync and I *think* I turned off correcting. Does it tell you in the logs?

JorgeB · September 15, 2023

4 minutes ago, PsionStorm said:

I believe the current sync is non-correcting

It is.

PsionStorm · September 15, 2023

So, I'll just let this one complete then. Anything you recommend doing after it's completed?

JorgeB · September 15, 2023

Not much point in letting that one complete, I would recommend doing a new sync since it will be faster.

PsionStorm · September 15, 2023

Sounds good. I'll start that when I get home. Thanks!

Millions of sync errors 3 hours into parity check

Recommended Posts

PsionStorm

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

PsionStorm

Link to comment

JorgeB

Link to comment

PsionStorm

Link to comment

JorgeB

Link to comment

PsionStorm

Link to comment

JorgeB

Link to comment

PsionStorm

Link to comment

JorgeB

Link to comment

PsionStorm

Link to comment

Join the conversation