UDMA CRC Error count increasing


breadman

Recommended Posts

10 hours ago, dev_guy said:

given the marginal foundation Unraid is based on?

This is nonsense but beyond the scope of the discussion dealing with CRC/DMA errors.

 

For sure, 100%, DMA/CRC errors are hardware faults, not caused by software, file systems, etc.  They are reported by physical controllers and indicate a physical h/w problem.  In my experience, these kinds of errors commonly originate with bad cables or connectors, or simply faulty components.  Another overlooked cause is faulty or overloaded power supplies.  Back when we offered server products, we always were careful to source single-rail PSU's so that full capacity of the power supply can be fed to the hard drives.  Servers with multi-rail PSU's might have a high overall wattage rating, but any one rail is a fraction of that; and, typically one rail would serve the entire hard drive array.  I'm sure you can deduce what the problem is with this arrangement.

 

I haven't looked at many low-level Linux device drivers for several years but I'll take a look at a few and see if they retry CRC/DMA errors.  Adding retry logic in md/unraid driver might be something for us to consider.

 

As has been stated correctly, Unraid only disables devices which fail writes because what else can you do if a write fails (and presumably all retries fail)?  But sure, if there is a lot of other activity in the server causing a transient dip in voltage, then maybe a retry would succeed.

  • Upvote 1
Link to comment
  • 3 weeks later...
On 1/4/2023 at 7:47 PM, limetech said:

This is nonsense but beyond the scope of the discussion dealing with CRC/DMA errors.

 

For sure, 100%, DMA/CRC errors are hardware faults, not caused by software, file systems, etc.  They are reported by physical controllers and indicate a physical h/w problem.  In my experience, these kinds of errors commonly originate with bad cables or connectors, or simply faulty components.  Another overlooked cause is faulty or overloaded power supplies.  Back when we offered server products, we always were careful to source single-rail PSU's so that full capacity of the power supply can be fed to the hard drives.  Servers with multi-rail PSU's might have a high overall wattage rating, but any one rail is a fraction of that; and, typically one rail would serve the entire hard drive array.  I'm sure you can deduce what the problem is with this arrangement.

 

I haven't looked at many low-level Linux device drivers for several years but I'll take a look at a few and see if they retry CRC/DMA errors.  Adding retry logic in md/unraid driver might be something for us to consider.

 

As has been stated correctly, Unraid only disables devices which fail writes because what else can you do if a write fails (and presumably all retries fail)?  But sure, if there is a lot of other activity in the server causing a transient dip in voltage, then maybe a retry would succeed.

 

Hi, Tom

 

i have been using unRAID for 13 years now. I love it.

 

however this issue did come up for me when using 6.** specifically. Opened a ticket with support and it was closed as a hardware issue which i accepted. So in getting that news i installed truenas ( some years ago). it still runs today with the same hardware without issue.

 

Money not being much of an issue for me as i use unRAID for my photography business. I invested another 5k in 2 systems. All brand new hardware , including SAS/SATA controllers / Cables. This error hit me after about 3 months of using it disabling 2 of my disks overnight. The second system lasted for about 5 months hitting 1 failed disk overnight.

 

Replacing the disks on both systems , This time failures on month 4 and 6 with different disks plugged into different cables. Then i proceeded to installed truenas on both systems , they have been running for a year and a bit.

 

I built a new unraid system at this time with new hardware. The problem cropped up around month 8 this time , using NAS rated WD RED's.

 

Im at a loss at this point , very torn. i love unRAID. but i cannot trust either hardware in general or unRAID.....im just not sure which one it is right now.

Link to comment
On 1/21/2023 at 5:24 AM, MacModMachine said:

 

Hi, Tom

 

i have been using unRAID for 13 years now. I love it.

 

however this issue did come up for me when using 6.** specifically. Opened a ticket with support and it was closed as a hardware issue which i accepted. So in getting that news i installed truenas ( some years ago). it still runs today with the same hardware without issue.

 

Money not being much of an issue for me as i use unRAID for my photography business. I invested another 5k in 2 systems. All brand new hardware , including SAS/SATA controllers / Cables. This error hit me after about 3 months of using it disabling 2 of my disks overnight. The second system lasted for about 5 months hitting 1 failed disk overnight.

 

Replacing the disks on both systems , This time failures on month 4 and 6 with different disks plugged into different cables. Then i proceeded to installed truenas on both systems , they have been running for a year and a bit.

 

I built a new unraid system at this time with new hardware. The problem cropped up around month 8 this time , using NAS rated WD RED's.

 

Im at a loss at this point , very torn. i love unRAID. but i cannot trust either hardware in general or unRAID.....im just not sure which one it is right now.

 

@MacModMachineThank you for sharing your experiences with Unraid. They mirror my own experiences. The simple fact is Truenas runs perfectly on the exact same hardware Unraid wrongly disabled perfectly good drives. And that's true right down to every SATA cable being the same. Tom, and the Unraid fan boys, need to acknowledge this is a real issue and stop blaming it on convenient excuses when it's clearly an Unraid issue. I no longer trust Unraid for storage and just use it as a last resort backup and docker platform given how many times Unraid has disabled perfectly good drives running on hardware that any other OS has no issues with. It's a very real problem and Unraid users shouldn't accept the whole hardware blame game. If Unraid disables a drive, if said drive passes diagnostics, just try a better NAS operating system instead of replacing your cable, controller, motherboard, power supply, etc, as the Unraid faithful insist is the problem. Unraid disables perfectly good drives on perfectly good hardware but the people who matter pretend the problem doesn't exist.

Edited by dev_guy
Link to comment
  • 2 weeks later...

dev_guy

 

Bro bro bro bro bro this is not how finding faults work in complex layered systems. Why are you arguing like this.

 

Relax and let them be wrong.

 

WHAT ARE THE FACTS bro ? Can you do lower level fact checking?

 

DO NOT FIGHT THE software author that's my job he he he. Tell him in what situation what happened, his job is to figure out to see if his software can somehow help with it.

 

Let's see the facts

 

unRAID - CRC erros

FreeNAS BSD - no CRC erros

 

what is different?

 

Quite a lot actually. The things that have to do with power management are different. Including CPU microcode and many parameters about each chip in the system starting with the CPU.

 

I vote move unRAID to Ubuntu based.

Link to comment
1 hour ago, breadman said:

ayo, thanks everyone, for helping out! turns out it was the sata connector port on the motherboard. The Sata port was chipped and blocked off. FIxed it using good old superglue

 

As I have said for years, the whole SATA connector scheme is a poster child for 'How not to Design a Connector System'.   It depends on a friction fit between two plastic parts which is provided a very small plastic nib in the cable end connector.  The actual friction force is often quite low.  Often just bumping the cable to will cause the plug to unseat to the point that the connection is maintained by the force of gravity holding the two parts together.  Now add a bit of vibration and you can see why there are CRC errors and occasionally worse problems!

 

The cables with the metal locks on them were a step forward but they usually leave the plastic nib out of the plug end.  Then Western Digital did this:  

        https://support-en.wd.com/app/answers/detailweb/a_id/15954

 

Now the metal lock has nothing to lock against and there is no nib to provide the force fit.  No idea if WD continued with this design 'concept' or if any other Drive manufacturers adopted it...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.