UDMA CRC Error count increasing


breadman

Recommended Posts

Just now, trurl said:

the disk can't be communicated with at all

And very often after fixing connection problems, the disk is perfectly fine, but it is out-of-sync due to the write error.

 

I don't know how other systems handle this situation. Do they just give up, the failed write is lost, and the disk isn't available for further writing? I'm pretty sure traditional hardware RAID just keeps on working by updating parity so the writes can continue and be recovered by rebuild.

Link to comment
10 minutes ago, trurl said:

Very often when a user posts diagnostics with a disabled disk, it is clear from those diagnostics that the disk can't be communicated with at all, since there will be no SMART report for it, and syslog has lots of entries where it has retried getting a response.

As I've said that's not been my experience at all. I've had the opposite experience. Every drive Unraid has disabled, and there have been several over the years, have been perfectly fine with often only a single CRC error in the SMART data. There have been countless other posts on this forum of similar experiences of drives that test fine but only have one or more CRC errors and Unraid disabled them.

 

The common wisdom is to replace the SATA cable, replace the disk controller, replace the power supply, etc. But, in reality, the user might be better off replacing Unraid with a more fault tolerant operating system that doesn't cry wolf or that the sky is falling when there is no real problem with hardware other NAS operating systems are perfectly happy with.

Edited by dev_guy
Link to comment
1 minute ago, dev_guy said:

Every drive Unraid has disabled, and there have been several over the years, have been perfectly fine with often only a single CRC error in the SMART data.

  

I have already agreed the drive can be perfectly fine several times.

 

But it is out-of-sync due to write failure. The emulated drive can continue to accept writes. The initial failed write, and any subsequent writes to the disabled/emulated disk, update parity so all those writes can be recovered by rebuilding from parity.

Link to comment
1 minute ago, trurl said:

  

I have already agreed the drive can be perfectly fine several times.

 

But it is out-of-sync due to write failure. The emulated drive can continue to accept writes. The initial failed write, and any subsequent writes to the disabled/emulated disk, update parity so all those writes can be recovered by rebuilding from parity.

 

And that goes back to my comments about Unraid having a fragile parity scheme. Let's face it. Unraid doesn't use a proven open source fault tolerant file system like the well proven Linux MDADM, ZFS, etc. AFAIK, it uses a unique proprietary parity system built on top of an ancient 1993 file system (XFS), the rather buggy unstable FUSE layer on top of that, and some proprietary "glue" to make it all sort of work. Yeah it has some advantages but it clearly has some disadvantages as well in performance and in disabling perfectly good drives.

Link to comment
5 minutes ago, trurl said:

If a disk that is being written can't be communicated with at all, what should Unraid do?

But that's the whole point. It's EXTREMELY unlikely the disk is that broken when the same disk, untouched, works perfectly. When Unraid disables a disk the first thing I do is shut down the server and reboot it into Linux from a USB thumb drive. And, guess what, the drive Unraid disabled works just fine, passes all diagnostics, has no unrecoverable errors in the SMART data, etc. 

 

If it's a full moon and the wind blows the wrong way Unraid seems to disable perfectly good drives. I don't know exactly why but that's very much my experience and the experience of many others. I sincerely doubt Unraid has any sort of robust retry logic and instead simply disables a drive at the first hint of trouble.

Edited by dev_guy
Link to comment
7 minutes ago, trurl said:

If a disk that is being written can't be communicated with at all, what should Unraid do?

And to clarify if a disk can't be written to at all, or is experiencing easily reproducible problems, I have NO issue with it being disabled and kicked out of the array. But that's not we're discussing here. Basically if a drive works perfectly with everything you can throw at it after Unraid disabled it, something is wrong with Unraid.

Link to comment

I have already said many times that the disk might be good. I have even said that is often the case. I'm not sure you are understanding everything I have already written since I am repeating myself and you are arguing about things I have already agreed to.

39 minutes ago, dev_guy said:

if a disk can't be written to at all,..., I have NO issue with it being disabled and kicked out of the array

This is exactly what happens.

 

Often the only reason a disk can't be written at all is because it can't be communicated with. Bad connections are much more common than bad disks. If you take a good disk that had a bad connection and you put it in another system it might work perfectly fine since it is a good disk and it has been reconnected.

 

If a disk can't be written for whatever reason, something has to be done at that time. Unraid emulates the failed write by updating parity, and disables the disk since it is now out-of-sync. The emulated disk can still be written many times after that, and all those writes update parity, and all those emulated writes can be be recovered by rebuilding. And updating parity to emulate failed writes is what is done with many RAID and other realtime parity implementations.

 

We see good disks disabled in many, many threads. We never suggest replacing a disk should always be the way to fix a disabled disk. Blindly rebuilding without even trying to determine the cause is something I would never recommend. We always want to see its SMART report, and maybe run SMART self-tests on it if it is unclear whether the disk or something else is to blame.

 

Replacing for rebuild is safer in one respect. If you keep the original with its contents intact, then if there are problems during rebuild you still have the contents on the original disk assuming it is good enough to be read.

 

But disks aren't free, and if it is a good disk we often try to clear up whatever the actual problem is and rebuild to the same disk. And if there are still problems, rebuild can be retried after fixing the problem. And there are usually still options in many cases even when another disk has a problem during rebuild.

 

Not only has my experience with my own Unraid systems been good, but I have also had mostly good experiences helping other people on this forum recover from disabled disks and much worse.

 

 

Link to comment
16 hours ago, trurl said:

Unrelated to this thread, but I have reviewed some of your post history and find it is often knowledgeable and helpful. I don't disagree with some of your criticisms on other threads.

I appreciate that. I'm an electrical engineer by education with extensive experience in the IT world. The SATA interface was designed to be robust and seems to work very well for years on end even with really marginal hardware, cables, etc, in most every application EXCEPT Unraid which is all too happy to disable even new NAS-grade prime drives connected with quality cables to expensive SATA cards for no good reason.

 

I'm also repeating myself in that I don't have log data to analyze because, by default, Unraid log data is destroyed by even a reboot. But the evidence is Unraid is well known to disable perfectly good drives for no good reason.

 

You keep bringing up the point of what is Unraid suppose to do if there's a drive error. How about a few retries? Honestly, there's lots of evidence Unraid doesn't attempt any sort of reasonable error recovery when it encounters what could be be a brief transient error reading or writing from a drive. That's perhaps the biggest issue here. Unraid seems to just fall on its face at the slightest I/O error. 

 

As I mentioned previously my dog might be chasing his toy and accidentally run into my Unraid server sitting on the floor. Unraid might be doing some file I/O when the server is jostled and the result seems to be Unraid would rather disable a drive from my clumsy dog rather than retry the same operation a second later when it would work perfectly. That's the problem the Unraid fan boys don't want to acknowledge. 

 

The whole "I've been running Unraid for years with no issues" argument is roughly the same as "I've never had a drive fail so I don't need to do backups" argument. My anecdotal evidence with multiple perfectly good Unraid drives being disabled is equally as valid as anyone else's anecdotal evidence of not having Unraid problems. 

 

I'd really like to hear Tom weigh in on this issue as it's a genuinely serious issue. Disabling a drive is not a trivial thing. The recovery process can easily result in permanent unrecoverable data loss and that risk is entirely preventable if Unraid wouldn't disable perfectly good drives in the first place.

Link to comment
12 hours ago, dev_guy said:

I appreciate that. I'm an electrical engineer by education with extensive experience in the IT world. The SATA interface was designed to be robust and seems to work very well for years on end even with really marginal hardware, cables, etc, in most every application EXCEPT Unraid which is all too happy to disable even new NAS-grade prime drives connected with quality cables to expensive SATA cards for no good reason.

 

I'm also repeating myself in that I don't have log data to analyze because, by default, Unraid log data is destroyed by even a reboot. But the evidence is Unraid is well known to disable perfectly good drives for no good reason.

 

You keep bringing up the point of what is Unraid suppose to do if there's a drive error. How about a few retries? Honestly, there's lots of evidence Unraid doesn't attempt any sort of reasonable error recovery when it encounters what could be be a brief transient error reading or writing from a drive. That's perhaps the biggest issue here. Unraid seems to just fall on its face at the slightest I/O error. 

 

As I mentioned previously my dog might be chasing his toy and accidentally run into my Unraid server sitting on the floor. Unraid might be doing some file I/O when the server is jostled and the result seems to be Unraid would rather disable a drive from my clumsy dog rather than retry the same operation a second later when it would work perfectly. That's the problem the Unraid fan boys don't want to acknowledge. 

 

The whole "I've been running Unraid for years with no issues" argument is roughly the same as "I've never had a drive fail so I don't need to do backups" argument. My anecdotal evidence with multiple perfectly good Unraid drives being disabled is equally as valid as anyone else's anecdotal evidence of not having Unraid problems. 

 

I'd really like to hear Tom weigh in on this issue as it's a genuinely serious issue. Disabling a drive is not a trivial thing. The recovery process can easily result in permanent unrecoverable data loss and that risk is entirely preventable if Unraid wouldn't disable perfectly good drives in the first place.

Yes.

 

Why does unraid not just "pause" the disk sort of. where it will keep the drive in the array and use it still , skipping the failed write and retry in 10 seconds. If that fails 5 times (50 seconds of retry) than show a warning that the write failed. Let the user decide if they want to continue with the potential of data loss, show them hey you may have a hardware problem and pause the array state.

 

My main job is perfromance engineering for a fortune 500. I use disks daily and interact with large disk arrays. This issue genuinely boggles my mind. As does not saving syslogs on system failure. I had to setup a syslog server for this specifically.

 

limetech knows this software is not going on enterprise hardware in most cases , bad sata cables , crap power and faulty hotswap disk cages are common, they should have accounted for this possibility and have some sort of mechanism to deal with it instead of relying on customers to trouble shoot for days trying to correct it. Even though in every case my hardware was ok , i still ran into the issue which is another story.

 

I feel like the current implementation was just , kick the disk and let the customer figure it out. This issue is not just us 2 users complaining , i see it on reddit alot. Users complain of CRC errors , Fix it and low and behold that same user months later with the same issue. They are replacing and rebuilding on a potential problem that may not even exist.

 

Link to comment
On 12/20/2022 at 2:38 AM, splendidthunder said:

Yes.

 

Why does unraid not just "pause" the disk sort of. where it will keep the drive in the array and use it still , skipping the failed write and retry in 10 seconds. If that fails 5 times (50 seconds of retry) than show a warning that the write failed. Let the user decide if they want to continue with the potential of data loss, show them hey you may have a hardware problem and pause the array state.

 

My main job is perfromance engineering for a fortune 500. I use disks daily and interact with large disk arrays. This issue genuinely boggles my mind. As does not saving syslogs on system failure. I had to setup a syslog server for this specifically.

 

limetech knows this software is not going on enterprise hardware in most cases , bad sata cables , crap power and faulty hotswap disk cages are common, they should have accounted for this possibility and have some sort of mechanism to deal with it instead of relying on customers to trouble shoot for days trying to correct it. Even though in every case my hardware was ok , i still ran into the issue which is another story.

 

I feel like the current implementation was just , kick the disk and let the customer figure it out. This issue is not just us 2 users complaining , i see it on reddit alot. Users complain of CRC errors , Fix it and low and behold that same user months later with the same issue. They are replacing and rebuilding on a potential problem that may not even exist.

 

 

@splendidthunder I appreciate your input and support around this being something Unraid could greatly improve if they'd only acknowledge the issue.

 

There are ways to have Unraid write log data to a SSD cache drive/array which I, unfortunately, didn't know about until relatively recently as I never encountered that option in any of the docs, setup guides, etc. It's such a basic thing they should make it more obvious and arguably even put the logs, by default, on your SSD cache if you have one. And even if you don't have a cache, they could write the logs to the system folder on the array if you don't enable drive spin down.

 

But instead, by default, your log is destroyed every time you power down or reboot your Unraid server which conveniently destroys the evidence of what happened to your wrongly disabled perfectly good drive. I've learned the hard way to reboot ASAP into linux to get the data off disabled drives out of fear the emulated data will also become corrupted, as it beats on all the drives for every access to the emulated data and/or during the rebuild. I've lost data that way. But the reboot, by default, destroys the logs.

 

You are absolutely correct Unraid should be optimized for consumer grade hardware and not put data at risk by wrongly disabling drives that suffer a brief transient problem. Disabling drives very significantly puts data at risk and should be a last resort instead of happening all too often when there is no ongoing problem. I get that Unraid is trying to maintain the integrity of the parity but kicking perfectly good drives out of the array is not the best option and greatly increases risk of data loss.

 

And you are also correct this is a common issue and the Unraid fans just blame your cables, drives, power supply, SATA interface, etc. The real issue for me is the exact same hardware works perfectly with TrueNAS, Open Media Vault, etc. and the disabled drive/cable/controller/power supply passes even extensive extended diagnostics. This seems to be something Unraid just wants to pretend isn't a problem when it really is. People complain about TrueNAS/FreeNAS being demanding on hardware but, honestly, it's been way more trouble free on the exact same hardware Unraid didn't like.

Link to comment
On 12/19/2022 at 4:45 PM, dev_guy said:

I don't have log data to analyze

We always hope users will get diagnostics before rebooting. But you can also setup syslog server.

 

Have you ever looked at syslog before rebooting? Lots of retries but the drive just doesn't respond because it has disconnected. We see this all the time in diagnostics. And I am not talking only about my personal anecdotal experience. I have looked at literally thousands of diagnostics from other users. It is easy to find this evidence in many threads. Can you link me to your evidence?

 

 

Link to comment

Why do you keep talking about perfectly good drives when I am talking about bad connections to perfectly good drives? The whole point of parity is so the system can continue to work even when a drive can't be accessed, and this includes the ability to continue to write data by updating parity and then recover that data by rebuilding.

Link to comment
2 minutes ago, trurl said:

We always hope users will get diagnostics before rebooting. But you can also setup syslog server.

 

Have you ever looked at syslog before rebooting? Lots of retries but the drive just doesn't respond because it has disconnected. We see this all the time in diagnostics. And I am not talking only about my personal anecdotal experience. I have looked at literally thousands of diagnostics from other users. It is easy to find this evidence in many threads. Can you link me to your evidence?

 

I've already explained multiple times I don't have log data of a drive being disabled due to Unraid's unfortunate default log setup. I don't know what you mean by drives being "disconnected" as I don't think that's part of the SATA interface. The drives I'm talking about are very much still connected in that you can read the SMART data, run self tests on them, etc. but Unraid has disabled them from the array. If something is "disconnecting" drives it's likely Unraid NOT the hardware.

Link to comment
3 minutes ago, trurl said:

Why do you keep talking about perfectly good drives when I am talking about bad connections to perfectly good drives? The whole point of parity is so the system can continue to work even when a drive can't be accessed, and this includes the ability to continue to write data by updating parity and then recover that data by rebuilding.

 

Because the drives, cables, everything test fine without touching anything including the connections. So your concept of "bad connections" has the same end result which is it needlessly puts data at risk by disabling drives that no other software or operating system seems to have any issues with. Rebuilding data can be a 24+ hour process of continuous pounding on all the drives in the array which can just trigger additional problems when the array is already in a fragile state.

Link to comment
13 minutes ago, trurl said:

 

Looking at the syslog for the above I see a bunch of I/O errors in immediate succession. I don't see any retries of the same I/O command or sector. The drive (sdi) is also obviously still connected and communicating as it later passes a SMART read and is successfully unmounted with both logged before the server is shut down. I'm not saying the issue sometimes isn't a bad cable or connection. I'm just saying Unraid could handle such errors better and seems far more sensitive to such issues.

Link to comment

Yes a bunch

 

line 5135 beginning attempts to read disk and continue with practically nothing else happening in syslog except attempts to read the disk

 

line 6209 attempts to writeback to the disk, which is where it actually gets disabled, and those continue with practically nothing else happening in syslog except attempts to write the disk

 

until line 7231

 

so over 1000 attempts to access the disk before it is disabled.

Link to comment
3 hours ago, trurl said:

Yes a bunch

 

line 5135 beginning attempts to read disk and continue with practically nothing else happening in syslog except attempts to read the disk

 

line 6209 attempts to writeback to the disk, which is where it actually gets disabled, and those continue with practically nothing else happening in syslog except attempts to write the disk

 

until line 7231

 

so over 1000 attempts to access the disk before it is disabled.

 

Putting it in perspective, none of those attempts appear to be retries. Unraid was just blindly trying to proceed when there obviously was a problem. And those 1000 attempts happened in around 1 second or less. So there was no attempt to allow a normal period for recovery of a mechanical hard drive.

 

As a hypothetical example if a drive gets jostled during an I/O operation it may issue errors and then has to mechanically recalibrate the heads. If Unraid just continues to beat on the on the drive while it's recalibrating it's likely to keep getting errors until the recalibration is completed. By the time drive has sorted itself out Unraid has already disabled it.

 

A much better strategy would be for Unraid to retry on the FIRST error. If the retry fails, wait a few seconds, and try it again. Instead Unraid seems to just want to blindly go ahead trying new I/O operations and push what's often a perfectly good drive off the edge of a cliff. Unraid seems to be creating its own mess here often triggered by a brief transient glitch.

 

EDIT: I should add the problem could be at different levels. Slackware is not an especially well regarded distro these days and struggles with newer hardware. Unraid apparently further modifies the Slackware kernel and layers other things on top of it like the known buggy Fuse. So, in terms of file I/O, Unraid isn't built on a very solid foundation especially compared to mdadm, zfs, etc. 

Edited by dev_guy
Link to comment
54 minutes ago, anpple said:

Try to unplug and replug,or change a SATA cable plz.

This is the typical advice for those who get a drive disabled by Unraid when the problem is most often Unraid itself. Hello Microsoft? "My Windows PC crashed and won't boot!" Microsoft: "have you tried unplugging it and plugging it back in?" Yeah, that's useless advice in this case.

 

The main issue here is how Unraid handles, and perhaps even creates, file I/O errors. It's not about SATA cables when any other operating system or even drive diagnostic software is perfectly happy with the same cable, drive, etc. 

Link to comment
  • 2 weeks later...
On 12/19/2022 at 1:45 PM, dev_guy said:

That's the problem the Unraid fan boys don't want to acknowledge. 

Respectfully, please refrain from using that term, it's not helpful.

 

We can continue this discussion if you want, I am open to an honest technical exchange, and making code changes if necessary.  Some brief comments:  re: CRC errors: those typically indicate some kind of corruption in the data path between the device controller and the device itself.  Usually as a result of bad cables, connectors, or power supply issues.  In general Unraid relies on the Linux device drivers to handle retries and assumes if a write has failed, then the driver and the device itself has exhausted all attempts at recovery and it would be pointless to waste more time re-issuing the same command over and over.

 

re: SMART: it's well known that many drives fail which have perfectly clean SMART reports.  The fact you see a disabled drive and the only thing in the SMART report is a single CRC error is suspect.

  • Like 1
Link to comment
6 hours ago, limetech said:

Unraid relies on the Linux device drivers to handle retries and assumes if a write has failed, then the driver and the device itself has exhausted all attempts at recovery and it would be pointless to waste more time re-issuing the same command over and over.

 

Thank you @limetech for responding. The above is exactly the point I've been trying to make. A supposedly fault tolerant storage array is not basic Linux file I/O and Unraid should not rely on an ancient 1993 file system (XFS), a Linux distro that has very few fans (Slackware), and Fuse that has many known problems. But, as you said, that's exactly what Unraid is doing.

 

Unraid does not handle drive I/O errors in a fault tolerant way. That's why there are so many documented examples of perfectly good drives being disabled by Unraid on this forum, Reddit, and elsewhere. Blaming it all on SATA cables and power supplies is just a convenient excuse when the exact same drive, cable, and power supply work perfectly with other storage operating systems. This is very much an Unraid problem.

 

Why does Unraid disable drives that are not eligible for warranty replacement and work perfectly even when subjected to a 24+ hour torture test with the exact same cable, power supply, SATA interface, motherboard, etc? And why do those same drives, cables, etc, work perfectly with TrueNAS, Open Media Vault, etc?

 

This problem is even worse in that a wrongly disabled drive leaves the array in a fragile state and the parity rebuild, that can take days, can cause another I/O error and unnecessary data loss from disabling a perfectly good drive. This is not a minor problem.

 

This is very much an Unraid issue and it would be great if Unraid addressed it instead of blaming it on cables, etc. What's the harm in adding some decent retry code especially given the marginal foundation Unraid is based on?

 

To put it really simply Unraid should not disable drives that work perfectly under other operating systems even with the same cable, same power supply, same SATA interface, same motherboard, are not eligible for warranty replacement, etc. By kicking perfectly good drives out of the array Unraid is needlessly putting user data at risk.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.