UDMA CRC Error count increasing


breadman

Recommended Posts

4 minutes ago, Hoopster said:

Are you looking for something other than the syslog server functionality in unRAID?

 

I have a persistent syslog setup on both my servers.  They write to a flash drive (not the same one as unRAID boot flash).  One of my syslogs goes back almost a year.

 

Thanks for that! I assumed that functionality required a network server to log the data to (as is one of the options). I didn't realize you could have a local syslog file on a cache SSD. I will absolutely be enabling that on my Unraid servers immediately. 

Link to comment
28 minutes ago, Hoopster said:

Although mine is a very specific use case, I documented my syslog setup in this post.  It uses a <custom> syslog location.

 

Unraid has arguably more support than any of the other DIY NAS/Server OS options except perhaps FreeNAS/TrueNAS and that's great. I've read countless Unraid guides, spent time with the official Unraid docs, watched countless videos by Spaceinvader and many others, all on how to optimally set up Unraid, and somehow I've never seen anything about local persistent logs.

 

Unraid tries to be a beginner friendly OS but it quickly turns into a geek fest in so many ways especially when things go wrong like having a perfectly healthy drive disabled in your array. It's all too easy to make obscure mistakes that could cost you a big portion of your data. Many on this forum have ended up restoring from corrupt parity, having to reformat a drive disabled by Unraid destroying all the data on it, having data lost during a unnecessary lengthy parity rebuild, being confused about how Unraid emulates a disabled drive, trying to replace a drive in the array and losing data doing so, having their array completely disabled because Unraid mistakenly thinks two drives have failed, having Unraid's parity become corrupted because their server was simply physically bumped, etc. The list goes on. 

 

Unraid has a lot of compelling advantages which is why I paid for multiple licenses for my multiple Unraid servers. But Unraid also seems to be a fragile snowflake when compared to most any other network storage solution which are generally far more robust. At least I'll now be able to have persistent logs to capture Unraid's failures so thanks again for that.

Link to comment
34 minutes ago, dev_guy said:

But Unraid also seems to be a fragile snowflake when compared to most any other network storage solution

Since I have never used anything but unRAID, I am not in a position to compare.  All I can say is that in the 11 years I have used unRAID, I have never lost any data except via user error.  My wife accidentally deleted an array folder full of thousands of photos.  Fortunately, I keep multiple backups and that problem was easily solved.

 

I have been through a couple of rounds of drive replacements/rebuilds with every disk on both my arrays (including cache drives/pools) without a single problem.

 

I am not an unRAID "power user" as I don't run many VMs or take full advantage of all the features of unRAID; however, I am also not a basic user using it only as a NAS.  I have a fair amount of Docker containers in active use and run several users scripts.  I also use unRAID as the Windows backup location for three PCs in the house and frequently access it remotely.  It gets a pretty good workout.

 

I think the more it gets pushed to its limits (I have no idea how you use unRAID), the more likely it is that something "breaks."  The good thing in my view is that there are a lot of limits that can be pushed in unRAID and those keep expanding.  Yes, it is a double-edged sword.

 

I recognize that you have had legitimate issues with unRAID and am not attempting to minimize that.  My one "concern" with unRAID is that it has been growing so fast in popularity lately that many true "noobs" are attracted to it and serious mistakes can easily be made.  I started my unRAID journey knowing zero about Linux but at least with some technical background and understanding.  It was a simple NAS then and I was able to grow with unRAID as it added new features and functionality.

 

unRAID started out as a "geek toy (not said disparagingly)" and still has some of those roots.  Not everything is bullet proof and a lot of bullets get fired by new and, sometimes, even experienced users.  I don't know of any OS that is bullet proof.

 

The team at Limetech does a great job listening to users and addressing concerns.  Could they be more proactive is preventing issues from occurring in the first place? Perhaps.  However, I am satisfied with where unRAID is now and where it is going in the future.

Edited by Hoopster
Link to comment

@Hoopster Thanks for all that. I agree there's a lot to like about Unraid. But I'm not sure about "The team at Limetech does a great job listening to users and addressing concerns" as they often ignore well documented problems release after release. I'm not at all trying to push Unraid to it's limits as you suggest. For me Unraid falls on its face just trying to be even the most basic file server when it disables a perfectly good drive in the array creating all sorts of headaches in doing so. To me it doesn't matter if Unraid is great at hosting Docker apps, or passing through a graphics card to a VM, if it often fails at the basic task of being a file server by kicking perfectly good drives out of service and creating lots of pain in doing so. That's an issue the "team at Limetech" seems to want to ignore.

Link to comment
8 hours ago, dev_guy said:

Unraid falls on its face just trying to be even the most basic file server

It works for thousands of people. You insist it's the OS fault, but it works fine for the vast majority of folks. Very few people sign up to the forums just to say things are working well, so you mostly just see the issues.

 

Yes, there are some combinations of hardware that cause issues, but Limetech can't control that, all they can do is offer suggestions of what hardware seems to work well. The worst part is that hardware list is a moving target, as the linux kernel evolves and new products come to market.

Link to comment
3 hours ago, JonathanM said:

Yes, there are some combinations of hardware that cause issues, but Limetech can't control that, all they can do is offer suggestions of what hardware seems to work well. The worst part is that hardware list is a moving target, as the linux kernel evolves and new products come to market.

 

I appreciate your support and the support of others here. It's one of the best things Unraid has going for it. I do understand about hardware compatibility issues, the changing linux kernel, etc. I've run Unraid on everything from a Supermicro Xeon rack server with ECC RAM to a J5040 build that draws only 12 watts from the wall with the drives in standby along with a few other hardware platforms. With various builds I've used only the motherboard SATA ports, 2 different LSI 8 port PCIe cards, and ASMedia PCIe cards as well. I've used a wide variety of drives with Unraid including brand new WD and Seagate NAS drives.

 

In all my years of using Unraid on a wide variety of hardware one thing that's been consistent are CRC errors resulting in perfectly good drives being disabled. As I've said it's not a problem I've ever had with any other NAS/server OS or commercial NAS products like Qnap. It's also not been specific to just one hardware platform, type of SATA controller, drive, etc.

 

The reason I have two LSI 8 port boards is I replaced the first one when I started having CRC issues. I first changed out the cables, as is common wisdom here and, when that didn't solve the problem, I replaced the entire LSI board as is also common wisdom here. This was in an enterprise grade Supermicro Xeon rack server. It wasn't the cables, LSI controller, or the drives. The problem, in my opinion, is Unraid. That exact same hardware, right down to the drives Unraid wrongly disabled, worked flawlessly for over a year with FreeNAS/TrueNAS and ZFS. So yeah, I blame the Unraid OS for some things. But I'll stop beating the dead horse. I do appreciate the support here from everyone regardless.

Link to comment
25 minutes ago, dev_guy said:

In all my years of using Unraid on a wide variety of hardware one thing that's been consistent are CRC errors resulting in perfectly good drives being disabled.

 

In "every" case, where CRC-Errors came up, it was "always" the Cable - cheap SATA-Cables...

No Problems with the Controllers or Drives - never.

Edited by Zonediver
Link to comment
  • 5 weeks later...
On 11/13/2022 at 6:20 PM, dev_guy said:

 

For some reason CRC errors are an ongoing issue with Unraid. I've never had any of my other NAS builds, servers, PC's, etc. have issues with CRC errors. It's only been Unraid. By default Unraid will forever flag a drive with even a single CRC error as being a problem drive unless you tell Unraid to ignore CRC errors. But, even worse, Unraid seems to remove drives from the array for a single CRC error which is just wrong and can often be harmful.

 

Everyone agrees CRC errors are generally minor such as someone bumped the cable causing a one time tiny harmless glitch that's automatically corrected. But, in some cases, Unraid wants to forever condemn that drive from the array which is just wrong when that same drive will pass 24 hours of extensive testing with zero errors.

 

To put it simply I've had many drives kicked out of Unraid arrays for even a single CRC error but none of them have been shown to have any actual problems. It's something that needs to be addressed by Unraid.

thank god someone said this. IT NEEDS to be said.

 

Since 6.** this issue has existed. its almost out of control now.

 

i have done around 15 builds in the last 2 years. All exhibit crc errors and kicked disks every few months. Hardware is all fine , has been swapped , disk controllers swapped, cables ,power everything.

 

Issue still exists.

 

I did an experiment with 2 builds 1 intel 10400 cpu Adaptec 71605 , 1 with ryzen 5600g and LSI 9201. 4 8 TB disks in each. Ran them until the inevitble CRC errors happened(60-70 days). 1 had a kicked disk.

 

Wiped both machines and put Truenas core on them. They ran for 6 months without any errors , no CRC errors nothing. (touching nothing in the systems , they were still in the rack)

 

Sure there will be people saying its cables , power ect. i never have issues or the like.

 

I have been using unraid for 14 years now. I have never had so many problems as i have had in the last few years. I have never bought so many HBA's and packages of SAS cables in my life to try to fix a problem.

Link to comment

@splendidthunder

@dev_guy

Two basic misconceptions here.

 

Unraid CANNOT cause UDMA CRC errors. UDMA CRC errors are recorded by the drive firmware in the drive firmware when the drive receives inconsistent data as determined by checksum. Unraid does not generate or check the checksum that is all up to the firmware.

 

Unraid DOES NOT disable a disk due to UDMA CRC errors. Unraid only disables a disk when it has a write failure. It does this because the failed write makes the drive out-of-sync with parity. That initial failed write and any subsequent writes to the (now emulated/disabled) disk can be recovered from parity by rebuilding.

 

Often a write failure will not even be accompanied by a UDMA CRC error because the drive never received any data at all to perform the checksum on. And often UDMA CRC error will not be accompanied by a write failure since the data is resent. It is of course possible that UDMA CRC error and write failure are caused by the same hardware issue so you could get both at the same time.

 

UDMA CRC errors will not reset, these are permanently recorded in the drive firmware. Unraid monitors certain SMART attributes by default, UDMA CRC errors are one of those. You can acknowledge the current count of any monitored SMART attribute by clicking the SMART warning on the Dashboard and Unraid will warn again if it increases.

 

You have control over which SMART attributes are monitored in Disk Settings (for all disks) or for a specific disk in its settings. You also have control over how these are reported in Notification Settings.

 

Since this was your first post to the forum, if you continue to have problems start a new thread with your diagnostics and maybe we can help you get to the bottom of it.

Link to comment
2 hours ago, trurl said:

@splendidthunder

@dev_guy

Since this was your first post to the forum, if you continue to have problems start a new thread with your diagnostics and maybe we can help you get to the bottom of it.

 

I have 107 posts to the forum and have been an Unraid user for many years so this is hardly my first post. But I appreciate your comments. I'm aware of how CRC errors work and know they're a function of the SATA protocol, firmware, etc. My point is they only seem to be an issue with Unraid. I've literally had Unraid disable a drive and then used the exact same hardware, drives, cables, motherboard, chassis, power supply, etc. to run TrueNAS Core and it's still running perfectly with zero issues.

 

TrueNAS is perfectly happy with the exact same hardware Unraid wasn't happy with. I've also never had Open Media Vault, Synology, or Qnap disable a perfectly good drive. I realize that's anecdotal evidence but many others here have had similar experiences with Unraid disabling perfectly good drives that test fine and/or no other OS has an issue with. There's a lot of evidence it's an Unraid issue.

 

Unraid seems unique in its propensity to disable perfectly good drives for even a single CRC error. I can't prove if the CRC error triggered a write error but I've changed my logging to try and trap such events in the future. Regardless, in my experience, 4 other NAS operating systems don't have this problem but Unraid does. It's as simple as that.

 

It can be a serious problem for Unraid to wrongly disable a drive as the rebuild process can create other problems and/or data can be lost in other ways. Part of the issue here is Unraid's parity system is inherently fragile.  A single bit error anywhere in the array can destroy your ability to recover from a drive failure. And, to make it worse, the process of rebuilding parity after a failure is rather likely to trigger a new CRC error and perhaps disable another drive. If two drives are disabled a normal Unraid config falls on its face. The good news, if you can call it that, is you're only likely to lose the files on the failed drive(s) not the entire array.

 

In many cases, the data on the "failed" Unraid drive is just fine and is 100% readable despite Unraid disabling the drive. But a user has to know the best practice is arguably to get the data off that drive before they attempt any recovery or replacing the drive. Unraid creates problems and endangers data while TrueNAS, OMV, Synology, and Qnap don't have similar issues even, in some cases, on the exact same drives/hardware.

Link to comment
7 hours ago, dev_guy said:

this is hardly my first post

I was referring to the user who quoted you.

 

Very rare in my nearly 12 years on Unraid that I have a drive disabled. Most of my rebuilding has been for upsizing drives. I have never had more than 9 drives in either of my servers. but I expect most people with many more drives don't have these frequent disabled disks either. I often wonder if power distribution isn't really the problem for some people instead of SAS/SATA cables and connections.

 

These CRC errors really are just about the hardware, Unraid only reports them. They are not the cause of disabled disks, just another symptom of a problem that might result in a disabled disk.

 

As for disabled disks, what should Unraid do when a disk can't be written? It keeps working by emulating the disk so you can recover the missed writes from parity.

 

Link to comment
1 minute ago, trurl said:

I was referring to the user who quoted you.

 

Very rare in my nearly 12 years on Unraid that I have a drive disabled. Most of my rebuilding has been for upsizing drives. I have never had more than 9 drives in either of my servers. but I expect most people with many more drives don't have these frequent disabled disks either. I often wonder if power distribution isn't really the problem for some people instead of SAS/SATA cables and connections.

 

These CRC errors really are just about the hardware, Unraid only reports them. They are not the cause of disabled disks, just another symptom of a problem that might result in a disabled disk.

 

As for disabled disks, what should Unraid do when a disk can't be written? It keeps working by emulating the disk so you can recover the missed writes from parity.

 

You seem to live in a fantasy world. I've had multiple Unraid severs disable perfectly good drives multiple times. You seem to want to ignore the real issues here.

Link to comment
3 minutes ago, dev_guy said:

You seem to live in a fantasy world. I've had multiple Unraid severs disable perfectly good drives multiple times. You seem to want to ignore the real issues here.

You seem to be a devout Unraid fan boy where Unraid can do no wrong. But, the reality is Unraid increasingly does a lot wrong as can be easily proven.

Link to comment
1 hour ago, dev_guy said:

disable perfectly good drives

Often the case that a drive is perfectly good but can't be written for some reason. I always say bad connections are much more common than bad disks.

 

1 hour ago, trurl said:

what should Unraid do when a disk can't be written? It keeps working by emulating the disk so you can recover the missed writes from parity.

And this is pretty much what parity does for you in more traditional RAID implementations.

Link to comment
On 12/16/2022 at 8:31 AM, trurl said:

@splendidthunder

@dev_guy

 

 

Since this was your first post to the forum, if you continue to have problems start a new thread with your diagnostics and maybe we can help you get to the bottom of it.

I have already done this with support , 3 different systems. Everytime support said it was hardware , i replaced everything including power , cables , controllers. This is not my normal account , i cannot remember which email i used on here before.

 

Support wants to blame hardware for everything. They had no interest in what i had shown them.

 

on 6 of the customer systems , we swapped to truenas with the hardware in place. All backups have been working without any errors (including CRC) for a year and a few months now.

 

So to blame hardware (yes im very aware UDMA CRC errors are hardware related)  is just the easy way out. The problem could be how unraid reads smart data , how it registers there is infact a CRC error or how it handles what happens when a crc error happens. Either way im not going to be a QA for unraid , Thats just unacceptable.

 

I guess customer support is right though , i had 12 total servers in the feild doing backups of data and they all had hardware problems causing CRC errors ( 3 different batches of different hardware too , different disks in all of them (customer supplies the disks new). Does not make sense , the hardware is fine , truenas has proven that.

Link to comment
4 hours ago, splendidthunder said:

I have already done this with support , 3 different systems. Everytime support said it was hardware , i replaced everything including power , cables , controllers. This is not my normal account , i cannot remember which email i used on here before.

 

Support wants to blame hardware for everything. They had no interest in what i had shown them.

 

on 6 of the customer systems , we swapped to truenas with the hardware in place. All backups have been working without any errors (including CRC) for a year and a few months now.

 

So to blame hardware (yes im very aware UDMA CRC errors are hardware related)  is just the easy way out. The problem could be how unraid reads smart data , how it registers there is infact a CRC error or how it handles what happens when a crc error happens. Either way im not going to be a QA for unraid , Thats just unacceptable.

 

I guess customer support is right though , i had 12 total servers in the feild doing backups of data and they all had hardware problems causing CRC errors ( 3 different batches of different hardware too , different disks in all of them (customer supplies the disks new). Does not make sense , the hardware is fine , truenas has proven that.

 

Thanks for supporting my position. I'm kind of amazed how the Unraid fans try to dismiss these issues as being some other problem when everything points to the real issue being Unraid. There's a huge amount of evidence to support Unraid wrongly disables perfectly good drives. Yet, rather than acknowledge the issue, Unraid, and their fan boys, seem to prefer to ignore and/or discredit anyone who suffered potential data loss from this very real issue. It's really disappointing.

 

And, as is nearly always the case lately, Tom Mortensen the man responsible for Unraid, is silent on the issue despite being all too happy to take our money. 

Edited by dev_guy
  • Upvote 1
Link to comment
8 hours ago, splendidthunder said:

how unraid reads smart data , how it registers there is infact a CRC error or how it handles what happens when a crc error happens

It reads the SMART data the same way everything else does. The drive supplies the data the same way to anything that asks for it. All it does with these is report them. It doesn't "handle" them in any other way. Disabled disks are only caused by write failures to the disk.

On 12/16/2022 at 7:31 AM, trurl said:

Often a write failure will not even be accompanied by a UDMA CRC error because the drive never received any data at all to perform the checksum on. And often UDMA CRC error will not be accompanied by a write failure since the data is resent. It is of course possible that UDMA CRC error and write failure are caused by the same hardware issue so you could get both at the same time.

 

 

Link to comment
3 minutes ago, trurl said:

Just for completeness. A read error can cause Unraid to try to get the data from the parity calculation and write it back to the disk. If that write fails, the disk is disabled. ONLY write errors disable disks.

 

I can only repeat what I've already said that a large number of Unraid users have had perfectly good drives disabled while this almost never happens with other NAS operating systems. Even some very experienced users here have admitted it's a side effect of how Unraid's parity works. I can't factually dispute if there is or isn't a write error but it doesn't really matter if the result is a perfectly good drive being disabled. A drive that will pass 24+ hours of continuous testing and then go on to work perfectly in a TrueNAS ZFS or Open Media Vault RAID array for a year or more. And, as I've said, this is true even with the exact same SATA cables, controller, motherboard, power supply, chassis, etc. 

 

It's also true every drive Unraid has disabled on me are not eligible for warranty replacement as they pass all diagnostics and work perfectly with everything but Unraid. 

Link to comment
On 12/16/2022 at 8:28 PM, trurl said:

Often the case that a drive is perfectly good but can't be written for some reason. I always say bad connections are much more common than bad disks.

 

2 minutes ago, dev_guy said:

every drive Unraid has disabled on me are not eligible for warranty replacement

No need to replace a good disk that has been disabled, you just need to rebuild it to recover failed and emulated writes so it is back in sync.

Link to comment
1 minute ago, trurl said:

 

No need to replace a good disk that has been disabled, you just need to rebuild it to recover failed and emulated writes so it is back in sync.

 

The problem is it can take 24+ hours of pounding on ALL the drives in the array to rebuild a drive that was wrongly disabled and that opens the door to causing another failure during the rebuild process. There is clearly something more sensitive and failure prone in the criteria Unraid uses to disable drives.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.