Jump to content
firrae

Extreme UDMA CRC Error Counts

15 posts in this topic Last Reply

Recommended Posts

Hi there,


After digging around on Google and the forums I believe the issues with my array come down to the issue that I am getting UDMA CRC errors on a number of my drives, but honestly I'm not sure where to begin looking at the cause. In my eyes, and from reading, I believe it could be one or a combination of 3 things:

  1. My SAS to SATA cables (maybe they are cross-talking and the likely candidate?) - I've tried 2 different brands but still get the issue, though both brands the cables looked the same, just slightly different colours. - https://www.amazon.ca/gp/product/B0736J45V2/
  2. My drive cages: I have a Rosewill RSV-L4412 which came with 3 drive cages (can't remember the part number for them) - https://www.rosewill.com/product/rosewill-rsv-l4412-4u-rackmount-server-case-or-chassis-12-sata-sas-hot-swap-drives-5-cooling-fans-included/
  3. My SAS controller which is a Fujitsu (?) card flashed to be an LSI 9211-8i in "IT" mode

 

At this point I believe the cables but I'd be interested in hearing what others think. 8 of my disks use these breakout cables as the way they connect, the other 4 go directly to the motherboard SATA ports. What I find interesting is it seems like the drives on these breakout cables have the issue much worse, though this is only so far a short term observation since I read about this, and the cage that's wired directly currently only has 3 drives in it, the rest are fully loaded with 4.

I'm curious if people think I'd be better served with which of the potential options to try and solve this:

  1. Get different breakout cables.
  2. Get new drive cages.
  3. change out the controller.

In any case I'd be interested in seeing the recommendations people have on this.

This all comes from my seeing what I think are VERY high read error counts as I'm rebuilding my array after changing out a drive. Attached is my diagnostics file from the server. Its in the middle of building that drive as I mentioned, so whatever decision I make I'm a couple of days away from actually implementing at least assuming I can eve get the parts to do it at this point.

I'm interested to see what people think. Thanks!

tower-diagnostics-20200318-1415.zip

Share this post


Link to post

Quick update. After I finished writing this I noticed that my SATA cable based drives were also getting these errors, but not all of them. The Parity drive is reporting 0 over its entire life, but the drive nearest it is showing an increasing, but slower than other drive on the SAS to SATA breakouts, number of CRC errors. This maybe leads to a combination of the cables and the cages? I really don't know at this point.

Share this post


Link to post
Mar 18 13:03:31 Tower kernel: mpt2sas_cm0: LSISAS2008: FWVersion(20.00.00.00), ChipRevision(0x03), BiosVersion(07.39.00.00)

 

Update the LSI firmware to 20.00.07.00, it's a known issues with that one.

Share this post


Link to post

@johnnie.black is that a newer issue or has it been that way since before v6? I only got into UNRAID as v6 was launching and don't remember there being an issue until recently?

Share this post


Link to post
Posted (edited)
4 minutes ago, johnnie.black said:

Always, but they weren't reported before, the attribute wasn't monitored.

Interesting.

Otherwise, if you don't me asking, but does the system look fine at a cursory glance? I do still have UNRAID reporting high read errors as well. Other than the glaring "its rebuilding a drive" thing of course.

Edited by firrae

Share this post


Link to post
15 minutes ago, firrae said:

I do still have UNRAID reporting high read errors as well. 

Likely related to the same issue, do the firmware upgrade first and see how it goes, BTW no point in rebuilding a disk with multiple disk errors.

Share this post


Link to post
6 minutes ago, johnnie.black said:

BTW no point in rebuilding a disk with multiple disk errors.

What would be the path forward do you think then? I'm not sure what I should do. I have multiple disks reporting read errors, but none show issues other than the CRC errors in SMART. Should I stop the rebuild, flash the firmware, and then... what? Rebuild if a parity check goes well?

Share this post


Link to post

Well, I can't thank you enough @johnnie.black! After updating the firmware I see 0 read errors and the CRC errors have completely stopped increasing on all drives. Thanks a bunch for pointing this out, I would NEVER have thought of this being an issue.

Share this post


Link to post
On 3/18/2020 at 2:29 PM, johnnie.black said:

Mar 18 13:03:31 Tower kernel: mpt2sas_cm0: LSISAS2008: FWVersion(20.00.00.00), ChipRevision(0x03), BiosVersion(07.39.00.00)

 

Update the LSI firmware to 20.00.07.00, it's a known issues with that one.

Can you tell me how to update the LSI firmware, I have the same issue.

Thank you 

Share this post


Link to post
1 minute ago, Rene said:

Can you tell me how to update the LSI firmware,

Download the latest firmware package from Broadcom's support site then boot with a DOS flash drive and follow the instructions included with the package.

Share this post


Link to post
3 minutes ago, johnnie.black said:

Download the latest firmware package from Broadcom's support site then boot with a DOS flash drive and follow the instructions included with the package.

ok thank you

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.