UDMA CRC issues


Recommended Posts

I have a 3U supermicro system running 2X X5650 Xeons. I have 17 disks total and on at least 3 of them I am getting UDMA_CRC errors pretty frequently. Drives are running from 2X LSI 9211-81 in IT Mode through a Supermicro direct attach backplane. I have already replaced the cables from the HBA to the backplane and things have improved. I have been fighting this for a few weeks now with drives dropping out of the array etc.. I was finally able to get all drives added back into the array after the cable swap, but I am still seeing 3 drives producing UDMA_CRC errors. I have a new backplane on order and plan to switch that out. I am running a 920-SQ power supply (supermicro)

 

My question is, can the HBA be causing this, and if so, wouldnt it affect all drives, or do you think this is something in the backplane, I have swapped cables twice now, first set cheap from amazon caused more problems than i had originally, so ordered some better quality ones.

 

In the screenshot, it shows errors on 2 drives, but smart test says theyre ok.

 

Any advice on where to look next? Is replacing the backplane a valid idea?

Drives.PNG

 

mars-diagnostics-20180530-0838.zip

 

disk2.PNG

disk15.PNG

Edited by unsainted
added diagnostics
Link to comment
3 minutes ago, unsainted said:

In the screenshot, it shows errors on 2 drives, but smart test says theyre ok

Those are read errors, possibly related to connections issues, but parity sync will have some corruption.

 

4 minutes ago, unsainted said:

My question is, can the HBA be causing this, and if so, wouldnt it affect all drives, or do you think this is something in the backplane,

I would say the backplane is more likely the culprit, you could swap cables and disks on the affected slots and see if the CRC errors stay with the slots.

Link to comment
2 minutes ago, johnnie.black said:

Those are read errors, possibly related to connections issues, but parity sync will have some corruption.

 

I would say the backplane is more likely the culprit, you could swap cables and disks on the affected slots and see if the CRC errors stay with the slots.

I will definitely try that, its about 1/2 way through parity sync so once that is done, I can try your suggestion.

Link to comment
1 minute ago, johnnie.black said:

SMART for both disks looks fine, and both have a very high number of CRC errors, so more likely connection related.

 

You can let the sync finish but you'll need to do a correcting parity check after to fix those sync errors.

thanks! was looking through syslog and wow seems i have a few other issues too.

EDAC MC0: 369 CE error on CPU#0Channel#1_DIMM#0

 

bad ram stick(s)?

 

I have 96GB in the server but only 80 is being reported..

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.