All drives with errors during parity check


Recommended Posts

Hey guys. Just finished swapping most of my hardware into a new case. I am upgrading from my trusty Define R5 to s Supermicro 846 for greater expandability. Previously, all my HDD were attached to an LSI 8i card (can't remember exactly which one) with sas to 4x sata breakout cables. Now it is sas to sas for the backplane. I accidentally ordered the 1m cables which are too long for my purposes, so I may just order the 0.5 cables and swap out anyways. I am wondering if the cable length could be playing a part in this issue. I am currently running a parity check (hadn't run one in 3 months prior). I did recycle known working hardware, but all the cabling was replaced with brand new cables. I am about 1/3 of the way through with the parity check and am kicking up udma crc errors on all of my drives. I know there is no way all the drives went bad at the same time. And I have to imagine that if the issue was the LSI card, then even without the parity check being run I would have had at least 1 of these errors pop up in the last 3 months right?

 

From reading through old threads on this site I reckognize that this has to do with data being sent through the cable going in one end and not coming out at the other end. I will of course reseat the sas cables and cut the zipties that are loosely binding them together for cable management once the parity check has completed, but from there it gets a little murky for me.

 

Should I run another parity check after I have made these changes to ensure that the error was corrected? I can't imagine it is an issue with a bad cable since they are all brand new. But I suppose, since there is only 2 cables from the HBA to the backplane, it would only take 1 defect in one cable to cause the issue. If I do reseat the cables and rerun parity check and still have these issues popping up, I will first order new sas cables and try them out. But if that ends up failing too, that leaves me with 2 options right? Either the HBA which was fine went bad, or the new to me backplane has an issue? I would like to rule out the backplane as I just got this case off of ebay (from a reputable seller) and would like to be able to report to them if there is an issue with the backplane. 

Edited by jebusfreek666
Link to comment

You really need to post up your Diagnostics file      Tools     >>>     Diagnostics

 

UMDA CRC errors are normally not disk errors.  They are errors in the transmission for the data from the drive to the MB.  The prime suspect is the SATA data cables first. 

 

The disk controller, at distant second, would the next candidate.   Hard drives would barely be a blip on the probability chart and more than one drive at a time-- not even remotely possible!

 

What did you do with the excess length of cable from those 1m cables.   IF you tied them all up to a bundle to make thing really, really neat, that would be my first suspect.  Cut the ties and allow them to spray EVERYWHERE.  Then double/triple check both the SATA data connectors on drive ends to make sure that you have 'friction' and that they are firmly seated.  (That can be another problem with tying the cable up neatly is that it can actually cause the SATA data connector to work loose when the drive vibrates.)  If you find any without friction, post back and I will tell what to look for!   Do the same for the SATA power connectors.  (I consider the entire SATA connector design to be a poster child for how not to design a connector system!)

Edited by Frank1940
Link to comment

Are you rebuilding/correcting parity or simply doing a non-correcting parity check?  If it is a non-correcting parity check, you don't have to wait.  Nothing bad will happen if you stop it.  With these errors occurring, you will (probably) be running the check again to see if what you did fixed the problem.

Link to comment
50 minutes ago, Frank1940 said:

Are you rebuilding/correcting parity or simply doing a non-correcting parity check?  If it is a non-correcting parity check, you don't have to wait.  Nothing bad will happen if you stop it.  With these errors occurring, you will (probably) be running the check again to see if what you did fixed the problem.

I am doing an error correcting parity, so I will be waiting until tomorrow morning. 

Link to comment
3 hours ago, Vr2Io said:

You mean parity check also got error ? Single or Dual parity, if Dual does all disk put in correct order ?

 

For UDMA error in expander backkplane, it likely backplane issue instead of cable.

 

During parity check is the only time I have gotten these errors. It is dual parity, both of which also show the same errors. Not sure what you mean by the correct order, but it was the same order as the old server. I really hope it is not the backplane. That would be disasterous. I will report back after the parity check finishes, and I have adjusted all the cabling. 

Link to comment
4 hours ago, Frank1940 said:

If it is a non-correcting parity check, you don't have to wait.  Nothing bad will happen if you stop it. 

 

3 hours ago, jebusfreek666 said:

I am doing an error correcting parity, so I will be waiting until tomorrow morning. 

 

I would say if you are getting errors while correcting parity there is also no reason to not stop it since you probably will not have valid parity after. Fix your hardware problems first.

Link to comment
1 hour ago, trurl said:

 

 

I would say if you are getting errors while correcting parity there is also no reason to not stop it since you probably will not have valid parity after. Fix your hardware problems first.

 

Wish I would have known that before I left for work! Oh well, I guess it wont hurt to have it wait until the morning. Still, I hate all this down time. 

Link to comment

I just had a thought and wanted to check if this might be causing the issue. Previously, these drives were in my tower server with SAS to 4x SATA breakout cables. These are shucked drives and I could not get them to work without taping the first few pins (3.3v issue). When I swapped the drives over to my new 846 case, I did not remove the tape. I am pretty sure that the backplane does not have this issue, so the tape is not needed. Is there any possibility that the tape is causing these errors? I assume not, as that is not on the data transfer side, but I just want to make sure it is not something stupid that I may overlook before I start ripping out hardware. 

Edited by jebusfreek666
Link to comment
2 hours ago, jebusfreek666 said:

I am pretty sure that the backplane does not have this issue, so the tape is not needed. Is there any possibility that the tape is causing these errors?

No!  The 3.3v issue is a problem with a change in the SATA spec regarding the power plug.  Someone decided that a 'drive reset signal' pin was needed and 'repurposed'  one of the pins previously used for the 3.3v supply to add it.  (This meant that all PS's built before the change now 'sent' a 'permanent' reset signal to the drives that included provision for it!)

Link to comment

Well, the parity check completed. I had 0 sync errors and parity is valid. Each one of my drives got the crc errors, but only between 3-12 on each. I shut down the server, removed the cables entirely, cut the zip ties holding them together, and reinstalled them making sure they were seated correctly and there was as little over lap as possible to avoid cross talk. I then restarted the server, and acknowledged each of the drives errors and began another non-correcting parity check to try and force more of the crc errors. It has been running for an hour now, and I haven't gotten any errors. Last time the errors started kicking up in the first 15 minutes. I think I will let it run for another hour just to be sure, but it appears that either poor seating or cross talk was my issue and it has now been resolved. Thanks guys for all your help!

Link to comment

CRC errors are suppose to be corrected but they do affect throughput so the fact the parity check completed without an error is not really unexpected.  Cross-talk in SATA data cables has always been a potential problem.  There are (or, use to be) shielded SATA cables but they are much bulkier and more expensive...

Link to comment

Spoke too soon I guess. I got 1 more crc at the 1hr 5 min mark. I did run the parity check for 2 hours and didn' get any more errors. So I am sure that it is something to do with the cables. I am leaning towards it being the cross talk. The cables I am using a longer than I need, they are the 1m variety. So even after freeing them from zip ties and rerouting them, there is still places where they over lap, like where I have them doubled over themselves due to the excess length. I have already ordered 0.5 m cables, so hopefully that will resolve this issue entirely. 

Link to comment

Replace by a shorter cable fine, if problem presist, then you should check those SAS/SATA socket which connect to disk at backplane, I agree use contact cleaner first, then check does any socket was broken, it is common issue.

 

You never know how the owner use those equipment in previous, many user will add too much force when insert disk, they never care does this will damage the socket/plug.

 

The worse was electronic problem, harder to fix.

Edited by Vr2Io
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.