UDMA CRC Error count increasing


breadman

Recommended Posts

Howdy y'all, first I'd like to preface that I'm a noob when it comes to servers and such. Recently I built my first unraid build and just recently started to notice this error (udma crc error count: increasing number).

 

After digging around people had suggested the sata connectors or PSU, which I checked and reseated the connections and moved them. The one major call out I see is that my disk 1 and disk 1 have 2 different usages on the screenshot below. I'm assuming I need to either rysync the array somehow or rebuild my array. 

 

 

UDMA CRC error count 12.PNG

Link to comment
2 hours ago, breadman said:

I'm assuming I need to either rysync the array somehow or rebuild my array. 

 

NO!  CRC errors are always corrected.  Their main effect is that they can slow down data transfer because of the time required to re-transmit the data until it is received correctly (as verified by the calculated CRC matching the received CRC).   If you will post up your diagnostics file, we will have a look at it.  In general, you can ignore CRC errors after the fact.  (I currently have one drive with 1823 CRC errors that I first found in August of this year.   Reseated the connectors.   There has been no increase since then, so I consider that the problem is resolved.)  You can not reset the count!  If I remember correctly, on the Dashboard page, if you click on the error for that disk ('SMART' column), you can set it to ignore the error.  You will not see it again unless the count increases. 

Link to comment
17 minutes ago, breadman said:

@Frank1940

 

I ended up reseating the sata connectors but sadly its still increasing the count. I am using an M.2 to 5 adapter on my ITX motherboard. The onboard sata connectors are all dead on the motherboard so a quick fix was buying the adapter, not sure if that would also alert the CRC notification

 

For some reason CRC errors are an ongoing issue with Unraid. I've never had any of my other NAS builds, servers, PC's, etc. have issues with CRC errors. It's only been Unraid. By default Unraid will forever flag a drive with even a single CRC error as being a problem drive unless you tell Unraid to ignore CRC errors. But, even worse, Unraid seems to remove drives from the array for a single CRC error which is just wrong and can often be harmful.

 

Everyone agrees CRC errors are generally minor such as someone bumped the cable causing a one time tiny harmless glitch that's automatically corrected. But, in some cases, Unraid wants to forever condemn that drive from the array which is just wrong when that same drive will pass 24 hours of extensive testing with zero errors.

 

To put it simply I've had many drives kicked out of Unraid arrays for even a single CRC error but none of them have been shown to have any actual problems. It's something that needs to be addressed by Unraid.

Link to comment
13 minutes ago, dev_guy said:

Unraid seems to remove drives from the array for a single CRC error

This is not true!   Unraid will only disable a drive if a write to it fails for some reason.  A CRC error by itself never causes this, but if the recovery from the CRC error continually fails then the write will fail.

  • Like 1
  • Upvote 1
Link to comment
19 minutes ago, itimpi said:

This is not true!   Unraid will only disable a drive if a write to it fails for some reason.  A CRC error by itself never causes this, but if the recovery from the CRC error continually fails then the write will fail.

I have multiple experiences to the contrary as supported by many posts here where there are zero write errors but Unraid still disables a drive. I agree Unraid should not disable a drive unless a write fails, but that's not how Unraid works in reality. For example there are many examples where Unraid disables a drive during a parity rebuild where there are zero writes to any drive except the parity drive.

Edited by dev_guy
  • Upvote 1
Link to comment
17 hours ago, breadman said:

@Frank1940

 

I ended up reseating the sata connectors but sadly its still increasing the count. I am using an M.2 to 5 adapter on my ITX motherboard. The onboard sata connectors are all dead on the motherboard so a quick fix was buying the adapter, not sure if that would also alert the CRC notification

 

I googled this and figure out what you are doing.  My comment is that the problem could be in this adapter.  Most of these type of items of Chinese manufacture by unknown companies and then distributed by vendors who have only PO boxes as a company location.  I would be willing to bet that they never considered that this item might be used in a server environment where all five SATA ports are active at one time.  (Look at the heat sink on the LSI cards which are intended for server use.) 

 

As to my experience, about ten years ago, I purchase a very cheap two port SATA card for use in my Unraid server.  (The MB had only four SATA ports on it.)   One day, I looked at the SMART attributes on that server and found I had close to 20,000 CRC errors on one disk.  I bit the bullet and ordered a used LSI card.  Replaced that two port card and never had another CRC error on the drive.  (And no, @dev_guy, the drive was never disabled!)

 

I can't (and won't) say that your M.2 to 5 converter is bad but you should be evaluating it.  Does it have a heat sink on it to keep the chip set cool?  Does the adapter run hot when you are doing a parity check?  Does it look like it is quality built?   What do other users say about it in their reviews?  

Link to comment

The worst CRC situation I have seen did involve a drive being disabled by unRAID but I was able to recover. 

 

Notifications one day indicated CRC errors were growing rapidly on a particular drive and it accumulated over 1100 errors in about 15 minutes.  Finally, the drive was red-balled before I was able to get to the server to look at the problem.  All those CRC errors eventually resulted in a write error.  I told unRAID to temporarily ignore the errors on that drive and rebuilt the drive onto itself after replacing the SATA data cable and making sure it was well connected on both ends.

 

Of course, you cannot reset the CRC count on a drive so it continues to report 1108 CRC errors but it has never increased in more than a year.

 

That is the one and only case in 11 years with unRAID where I "lost" a drive in the array due to CRC errors.  In fact it is the only disabled drive situation I have ever seen in my two unRAID systems (knock on wood).

Edited by Hoopster
Link to comment
6 hours ago, JorgeB said:

Give us an example with the logs showing that.

 

I believe the post linked below is an example but not sure there's the log to definitively prove it. I will admit I'm mostly going by anecdotal data here from various users on the forum and my own personal experience. I have had multiple drives disabled where the only SMART issue are a few CRC errors. The drives test perfectly outside of Unraid. I'm aware if Unraid thinks the file system is corrupted on a drive it may disable it, which could be the case in the post linked below, but I've also had disabled drives pass the XFS file system check.

 

I've copied the entire contents of nearly full disabled drives as a precaution using Linux without a single read error. Many here have reported similar experiences where they have a disabled drive that has nothing wrong with it. It could be the CRC error occurred during a write which should normally just cause a RETRY rather than a hard ERROR.

 

I have an LSI SATA controller with high quality cables. I do think a backplane design with NO cables is a better solution but one relatively few seem to be using with Unraid. But, weirdly, in all my years of working with PCs, other DIY NAS builds using other operating systems, commercial NAS products, etc, I've never had any CRC errors that I even know of let alone having them be a problem. Nor have I ever had a drive be disabled that wasn't obviously defective. There seems to be something about Unraid that possibly even creates CRC errors and causes it to disable perfectly good drives.

 

I will certainly capture the log and diagnostic if I have it happen again. But here's an example of the sort of thing I'm talking about:

 

Link to comment
1 hour ago, Frank1940 said:

 

As to my experience, about ten years ago, I purchase a very cheap two port SATA card for use in my Unraid server.  (The MB had only four SATA ports on it.)   One day, I looked at the SMART attributes on that server and found I had close to 20,000 CRC errors on one disk.  I bit the bullet and ordered a used LSI card.  Replaced that two port card and never had another CRC error on the drive.  (And no, @dev_guy, the drive was never disabled!)

 

I can't (and won't) say that your M.2 to 5 converter is bad but you should be evaluating it.  Does it have a heat sink on it to keep the chip set cool?  Does the adapter run hot when you are doing a parity check?  Does it look like it is quality built?   What do other users say about it in their reviews?  

 

The LSI controllers need heat sinks as most are an old design and generally capable of doing onboard hardware RAID with the right firmware. I have two of them and they waste power given the servers they're in are generally powered 7/24. You can get a modern SATA controller that use the same ASMedia family chip that's used on many motherboards such as the ASM1064. These run far cooler, are far more power efficient, and work just fine with Unraid. There have been issues Jmicron chips and Chinese "unknown" chips. But you can get something like an Syba or or IO Crest controller that uses an ASM series chip and they're only a few dollars more than the no-name Chinese junk. So the size of the heatsink isn't the sole indicator of the suitability of a SATA card and I'd argue most more modest builds are likely better off with a more modern, smaller, more power efficient controller.

Link to comment
18 hours ago, dev_guy said:

For example there are many examples where Unraid disables a drive during a parity rebuild where there are zero writes to any drive except the parity drive.

If a READ command fails, Unraid returns the parity calculated value and attempts to WRITE that value back to the drive where the read failed. If the write succeeds, the "errors" column is incremented, and things continue. If the write fails, Unraid disables that disk, and all further read and write operations to that data slot are calculated and implemented using parity.

 

So, it is quite possible to have a data read operation result in unraid disabling a disk, but only if the address that failed the read ALSO failed the resulting write.

  • Like 1
Link to comment
18 minutes ago, JonathanM said:

If a READ command fails, Unraid returns the parity calculated value and attempts to WRITE that value back to the drive where the read failed. If the write succeeds, the "errors" column is incremented, and things continue. If the write fails, Unraid disables that disk, and all further read and write operations to that data slot are calculated and implemented using parity.

 

So, it is quite possible to have a data read operation result in unraid disabling a disk, but only if the address that failed the read ALSO failed the resulting write.

 

Thanks for the helpful explanation. That might explain a lot about why Unraid drives get disabled when, in fact, there's nothing wrong with them besides perhaps a very brief unfortunately timed SATA/CRC issue. A single failed write, to me, is not cause to condemn a drive but perhaps issue a warning and try again. Unraid is the only time I've used XFS. All my other servers have used Ext4, Btrfs, and ZFS. And Unraid, of course, is also using a somewhat unique parity scheme.

Link to comment
29 minutes ago, dev_guy said:

A single failed write, to me, is not cause to condemn a drive but perhaps issue a warning and try again.

 

The problem is that the drive is now of out sync with the parity information.  The parity information is actually correct and the data drive information is wrong!   This could be further compounded if another drive were to fail completely (say head damage to the platter).   If a rebuilt of this drive were to be made by using the first drive, the rebuilt drive would also have bad rebuilt due to the wrong data on the first drive.  This is the reason that a single write failure will disable a drive!!!   (EDIT:  The drive will then be emulated using the parity information.  Writes can actually be made to this emulated drive.  Unraid will behave as if it still has a physical drive attached in that position.)

 

Regarding a retry of the write.  I believe that most modern hard drives already do this in the background.   A modern hard drive is a very complex proprietary software/mechanical system designed to recover from most situations and only throws an error when things are really, really busted!  

Edited by Frank1940
  • Like 1
Link to comment
6 hours ago, Frank1940 said:

 

The problem is that the drive is now of out sync with the parity information.  The parity information is actually correct and the data drive information is wrong!   This could be further compounded if another drive were to fail completely (say head damage to the platter).   If a rebuilt of this drive were to be made by using the first drive, the rebuilt drive would also have bad rebuilt due to the wrong data on the first drive.  This is the reason that a single write failure will disable a drive!!! 

 

Regarding a retry of the write.  I believe that most modern hard drives already do this in the background.   A modern hard drive is a very complex proprietary software/mechanical system designed to recover from most situations and only throws an error when things are really, really busted!  

 

The read should be re-tried multiple times before any attempt is made to write. If the read passes on a retry I don't see how the parity would be corrupted? It would be better to log the read error and issue a warning. Unraid should arguably NOT try to write to a parity protected drive it can't read as that could indeed corrupt the parity as you suggest.

 

I have various disk diagnostic software capable of a variety of tests with some granularity as to the options. It's normal for something like a surface scan test to have, by default, 3 retries for any read or write errors. That's on top of whatever the drive might be doing internally. Things like simple vibration, say your dog bumps into your server, can cause a single I/O operation to fail. Unraid should be more robust than that.

 

You are correct modern drives have a lot going on. But they also log errors in their SMART data. And Unraid seems unique in how often it disables perfectly good drives with, at most, a few CRC errors and no other errors with not even a single uncorrectable read error. And the same drive will pass even extensive worst case diagnostics.

 

If the devs think the current Unraid scheme is best that's fine I get everything is a trade off. But it doesn't change the fact Unraid creates headaches by disabling perfectly good drives. It forces the user to take worrying and usually time consuming action to address the non-issue with the server being in limp mode for an extended period with a higher risk of data loss. It's not an issue I've ever had with OpenMediaVault (which, like other OS options, uses simple well proven mdadm RAID), FreeNAS (using well proven ZFS), Qnap, Synology, AsusStor, etc. All of these NAS/servers have their own issues but kicking perfectly good drives out of their arrays isn't one of them.

 

We also need to accept XFS is a thirty year old ancient file system that Unraid layers additional code, and Fuse, on top of which is arguably a recipe for problems compared to say mdadm. I get there are a lot of Unraid loyalists on this forum and the level of support is great. But it would be nice if more would acknowledge areas that could be improved instead of defending the status quo. 

 

The screenshots show an example of one of my Unraid servers. Disk 2 is perfectly fine. It has exactly 1 CRC error and that's it. It passes extensive diagnostics outside of the server without a single issue. Yet Unraid disabled it. Here are screenshots and the SMART report for the perfectly healthy drive Unraid disabled. I don't have a log of the actual event as the server has since been rebooted so I could remove the presumably "bad" drive for testing and possible data recovery. Unraid unfortunately doesn't keep persistent logs due to the USB thumb drive based operating system. But the drive turned out to have nothing wrong with it.

 

Does anyone want to explain how it was acceptable for Unraid to disable this perfectly functioning drive with all the hassle and worry that creates?

 

Screen Shot 2022-11-14 at 10.38.49 AM.png

Screen Shot 2022-11-14 at 10.40.23 AM.png

WL6000GSA6457_WOL240371821-20221114-1039 2.txt

Edited by dev_guy
Added screenshots
Link to comment
6 hours ago, dev_guy said:

The screenshots show an example of one of my Unraid servers. Disk 2 is perfectly fine. It has exactly 1 CRC error and that's it. It passes extensive diagnostics outside of the server without a single issue. Yet Unraid disabled it.

The problem is Unraid can't determine WHAT made the write fail, only that it did, indeed, fail to properly record the write that was sent to it. The drive is most likely not at fault, but the fact remains, Unraid sent data, the drive was unable to send back the signal that the write succeeded. Whether it was a loose cable, a HBA burp, RAM corruption, whatever, doesn't really matter to Unraid, the fact remains that data sent to the drive wasn't acknowledged, and must be treated as lost.

 

You are correct in that decisions were made to expedite the continuity so there is as little disruption to the flow of data overall as possible. I suppose it could be possible to change that behaviour so the data stream would grind to a halt while Unraid tries multiple times to get communication going again with the drive, but I suspect you would have even more complaints about how slow Unraid is at file I/O.

 

Bottom line, a healthy system just doesn't have these sorts of issues. A read is issued, the data is returned. A write is issued, it's acknowledged. If that doesn't happen, the issue that's causing the problem needs to be addressed.

Link to comment
1 minute ago, JonathanM said:

The problem is Unraid can't determine WHAT made the write fail, only that it did, indeed, fail to properly record the write that was sent to it. The drive is most likely not at fault, but the fact remains, Unraid sent data, the drive was unable to send back the signal that the write succeeded. Whether it was a loose cable, a HBA burp, RAM corruption, whatever, doesn't really matter to Unraid, the fact remains that data sent to the drive wasn't acknowledged, and must be treated as lost.

 

You are correct in that decisions were made to expedite the continuity so there is as little disruption to the flow of data overall as possible. I suppose it could be possible to change that behaviour so the data stream would grind to a halt while Unraid tries multiple times to get communication going again with the drive, but I suspect you would have even more complaints about how slow Unraid is at file I/O.

 

Bottom line, a healthy system just doesn't have these sorts of issues. A read is issued, the data is returned. A write is issued, it's acknowledged. If that doesn't happen, the issue that's causing the problem needs to be addressed.

 

I'm sorry but this is complete nonsense. Given that disk I/O errors are extremely rare unless a drive is seriously failing (not what we're discussing here) it will make zero difference to Unraid's performance to have a few retries, that can happen in a fraction of a second, to help prevent drives being disabled for no good reason in the extremely rare instance there's an I/O error.

 

I'm also suggesting this all often starts with READ errors which, as you explained, results in Unraid trying to then write to the drive that it couldn't even read from. That just seems wrong especially given the write attempt may corrupt the parity of the entire array creating an even bigger problem. But, regardless of if it's a read or write error, a few retries are going to make zero difference in real world performance unless something is seriously wrong. And my whole point is Unraid disables drives when there is literally NOTHING wrong with the drive. 

Link to comment
12 hours ago, dev_guy said:

Given that disk I/O errors are extremely rare unless a drive is seriously failing

That's where you are mistaken. In this example the disk we are discussing is perfectly fine, the fault lies in the communication path, not the drive itself. Cables, HBA, bad RAM, all can cause bits to not make the trip intact.

 

12 hours ago, dev_guy said:

given the write attempt may corrupt the parity

No. The write keeps parity in sync. If the write is NOT completed successfully, parity is no longer in sync with the disk in question so the disk must be removed from the parity equation. If the read failed, the parity equation is used to determine what should have been returned.

 

12 hours ago, dev_guy said:

few retries are going to make zero difference in real world performance unless something is seriously wrong

Absolutely correct. It's the situation where continual retries fail that causes performance issues. The SATA command chain ALREADY does multiple retries internally before it throws in the towel and says the data is unable to be read or written. Once that happens, Unraid assumes the hardware did its job correctly and acts on the result.

 

12 hours ago, dev_guy said:

And my whole point is Unraid disables drives when there is literally NOTHING wrong with the drive. 

Yes, that happens fairly regularly. But until the problem that caused the data to not be read and written correctly is solved, there is nothing Unraid can do about it. Either the drive acknowledged a successful write or it didn't. Whether that's the drive's fault or not isn't relevant to this discussion.

Link to comment

@JonathanM Thanks for helping explain things further. I'm not familiar enough with what say a Linux mdadm RAID 5 does if there's an I/O error but it appears to be far more resilient of transient SATA issues than Unraid. I agree there's likely a glitch somewhere up stream of the healthy drive being disabled but those glitches, including even a single CRC error, are enough to have a drive be disabled when arguably it should not be.

 

The retry issue is easily managed. Unraid already keeps error counters and could simply issue a warning and wait to disable a drive until the retry count grows above some threshold. That would prevent retries from ever impacting performance. It's obvious the SATA data link is not robust enough to solely rely on given Unraid's parity scheme. This is especially true for the sort of inexpensive DIY-grade builds Unraid encourages using whatever old hardware someone has laying around.

 

I want to stress in many cases there's nothing significantly wrong with the drive, cable, controller, motherboard, etc. The same disabled drive can end up working just fine on the same cable, and same SATA port, after going through all the hassle of getting Unraid to re-adopt it.

 

I guess this is just another significant downside (another being write performance) to Unraid's parity scheme. And, like I said, all things are a trade off. I appreciate the education and would encourage the powers at be to consider a simple retry scheme which should greatly reduce all the issues around perfectly healthy drives being kicked out of the array. Such a scheme has zero downside and would save a lot of headaches.

Link to comment
21 minutes ago, dev_guy said:

And, like I said, all things are a trade off. I appreciate the education and would encourage the powers at be to consider a simple retry scheme which should greatly reduce all the issues around perfectly healthy drives being kicked out of the array.

In all the logs I have seen there are at typically something like a dozen retries done before the write is actually failed.

Link to comment
1 hour ago, itimpi said:

In all the logs I have seen there are at typically something like a dozen retries done before the write is actually failed.

 

I very much wish Unraid had persistent logs so I'd have a record of all my drives that have been disabled yet are perfectly fine. But I understand why Unraid doesn't want to trash the USB thumb drive or spin up array drives to make log entries. It would be nice on systems with SSD cache, where the System folder is located on the Cache, if Unraid could store persistent logs there instead of only in RAM. It also would protect the logs in the event of power loss. That's perhaps another feature request for a future release?

 

The above is in fact an issue with Unraid alternatives that don't use a dedicated drive for the OS. The drive spin down feature on a Qnap, for example, is basically worthless as periodic OS writes are constantly spinning up the array. But OpenMediaVault, for example, puts the OS on its own drive, typically a small SSD, and that works great. It does eat up a SATA port (or m.2 slot) but with quality small SSDs starting around $25 it's a great solution. And any Unraid system that has a cache drive could benefit from having the logs saved there.

I certainly will be saving the logs next time this happens to me before powering the server off to remove, test, and potentially backup, the "failed" drive. The issue remains, and @JonathanM even agrees, Unraid is known for disabling perfectly good drives while no other NAS or server OS I've had experience with is anywhere near as prone to doing. Something is different with Unraid and causing this problem. I wish I had the logs to know more.

Edited by dev_guy
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.