Jump to content

Millions of sync errors 3 hours into parity check


PsionStorm
Go to solution Solved by JorgeB,

Recommended Posts

About three weeks ago, I added some new drives to my system, and upgraded my parity drive to a 14TB drive. Since then everything has been functioning without error, but this morning my monthly parity check fired up. I'm three hours in, and I've got a scary number of sync errors that I've never seen before. I'm hoping this is just a result of missing a step with the drive additions.

 

At this point, should I just cancel the scheduled parity check and force a correcting parity check? Or should I see this one through first?

excelsior-diagnostics-20230915-0259.zip

Link to comment

Has the parity check got beyond the size of the biggest data drive?   

Did you use the parity swap procedure to upgrade the parity drive? 

If so it seems that the space beyond the largest drive is not always correctly zeroed on the new parity drive and you need to run a correcting check once to fix this - after that future checks should be error free. However I do not think that is your problem here :( 

 

It looks as if there may be some other problem that needs looking at first as in your syslog there are continual

Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: attempting task abort!scmd(0x000000001e0cffe4), outstanding for 30528 ms & timeout 30000 ms
Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: [sdi] tag#2226 CDB: opcode=0x88 88 00 00 00 00 00 9c b9 b4 40 00 00 00 08 00 00
Sep 12 14:46:15 Excelsior kernel: scsi target1:0:2: handle(0x000b), sas_address(0x5000cca23c0d2c39), phy(2)
Sep 12 14:46:15 Excelsior kernel: scsi target1:0:2: enclosure logical id(0x500605b0098b8100), slot(3) 
Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: task abort: SUCCESS scmd(0x000000001e0cffe4)

type messages in the syslog (sdi appears to be disk5).   These seem to have started on Sep 8 - did you do anything to the system then?   Is there any indication in the GUI of possible problems with disk5?

 

I would suggest you 

  • Cancel the current check as no point proceeding if there are underlying hardware issues.
  • Do NOT at this point attempt to run a correcting check as if you have a hardware issue you are more likely to end up corrupting parity
  • Carefully check all connections (power and SATA) to disk5).  Perhaps when changing drives you slightly disturbed an existing connection or did not quite perfectly seat one.

Not sure if at that point you should retry the non-correcting check or  do something else such as an extended test on disk5

 

You may want to wait to see if anyone else (in particular @JorgeB has any other suggestions.

Link to comment
1 hour ago, JorgeB said:

Did you do a parity swap at this time?

I thought I did, but I followed a different process. I added the new drive as a 2nd parity, then pulled the old one and set up a new config. I saw that recommended in a few places and didn't see the article you linked. Looks like that may have been my mistake?

 

1 hour ago, itimpi said:

Has the parity check got beyond the size of the biggest data drive?   

Did you use the parity swap procedure to upgrade the parity drive? 

If so it seems that the space beyond the largest drive is not always correctly zeroed on the new parity drive and you need to run a correcting check once to fix this - after that future checks should be error free. However I do not think that is your problem here :( 

 

It looks as if there may be some other problem that needs looking at first as in your syslog there are continual

Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: attempting task abort!scmd(0x000000001e0cffe4), outstanding for 30528 ms & timeout 30000 ms
Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: [sdi] tag#2226 CDB: opcode=0x88 88 00 00 00 00 00 9c b9 b4 40 00 00 00 08 00 00
Sep 12 14:46:15 Excelsior kernel: scsi target1:0:2: handle(0x000b), sas_address(0x5000cca23c0d2c39), phy(2)
Sep 12 14:46:15 Excelsior kernel: scsi target1:0:2: enclosure logical id(0x500605b0098b8100), slot(3) 
Sep 12 14:46:15 Excelsior kernel: sd 1:0:2:0: task abort: SUCCESS scmd(0x000000001e0cffe4)

type messages in the syslog (sdi appears to be disk5).   These seem to have started on Sep 8 - did you do anything to the system then?   Is there any indication in the GUI of possible problems with disk5?

 

I would suggest you 

  • Cancel the current check as no point proceeding if there are underlying hardware issues.
  • Do NOT at this point attempt to run a correcting check as if you have a hardware issue you are more likely to end up corrupting parity
  • Carefully check all connections (power and SATA) to disk5).  Perhaps when changing drives you slightly disturbed an existing connection or did not quite perfectly seat one.

Not sure if at that point you should retry the non-correcting check or  do something else such as an extended test on disk5

 

You may want to wait to see if anyone else (in particular @JorgeB has any other suggestions.

 

Regarding those errors, I have done extensive troubleshooting trying to resolve them and have been unable to. They occur on all of my SAS drives, but only my SAS drives. See thread here: 

 

 

Any advice would be appreciated. Since this post I've replaced the SAS card and have tried three different sets of cables with no different result.

Edited by PsionStorm
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...