dev_guy

Members
  • Posts

    126
  • Joined

  • Last visited

  • Days Won

    1

dev_guy last won the day on November 16 2022

dev_guy had the most liked content!

Converted

  • Gender
    Undisclosed

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

dev_guy's Achievements

Apprentice

Apprentice (3/14)

15

Reputation

1

Community Answers

  1. @MacModMachineThank you for sharing your experiences with Unraid. They mirror my own experiences. The simple fact is Truenas runs perfectly on the exact same hardware Unraid wrongly disabled perfectly good drives. And that's true right down to every SATA cable being the same. Tom, and the Unraid fan boys, need to acknowledge this is a real issue and stop blaming it on convenient excuses when it's clearly an Unraid issue. I no longer trust Unraid for storage and just use it as a last resort backup and docker platform given how many times Unraid has disabled perfectly good drives running on hardware that any other OS has no issues with. It's a very real problem and Unraid users shouldn't accept the whole hardware blame game. If Unraid disables a drive, if said drive passes diagnostics, just try a better NAS operating system instead of replacing your cable, controller, motherboard, power supply, etc, as the Unraid faithful insist is the problem. Unraid disables perfectly good drives on perfectly good hardware but the people who matter pretend the problem doesn't exist.
  2. Thank you @limetech for responding. The above is exactly the point I've been trying to make. A supposedly fault tolerant storage array is not basic Linux file I/O and Unraid should not rely on an ancient 1993 file system (XFS), a Linux distro that has very few fans (Slackware), and Fuse that has many known problems. But, as you said, that's exactly what Unraid is doing. Unraid does not handle drive I/O errors in a fault tolerant way. That's why there are so many documented examples of perfectly good drives being disabled by Unraid on this forum, Reddit, and elsewhere. Blaming it all on SATA cables and power supplies is just a convenient excuse when the exact same drive, cable, and power supply work perfectly with other storage operating systems. This is very much an Unraid problem. Why does Unraid disable drives that are not eligible for warranty replacement and work perfectly even when subjected to a 24+ hour torture test with the exact same cable, power supply, SATA interface, motherboard, etc? And why do those same drives, cables, etc, work perfectly with TrueNAS, Open Media Vault, etc? This problem is even worse in that a wrongly disabled drive leaves the array in a fragile state and the parity rebuild, that can take days, can cause another I/O error and unnecessary data loss from disabling a perfectly good drive. This is not a minor problem. This is very much an Unraid issue and it would be great if Unraid addressed it instead of blaming it on cables, etc. What's the harm in adding some decent retry code especially given the marginal foundation Unraid is based on? To put it really simply Unraid should not disable drives that work perfectly under other operating systems even with the same cable, same power supply, same SATA interface, same motherboard, are not eligible for warranty replacement, etc. By kicking perfectly good drives out of the array Unraid is needlessly putting user data at risk.
  3. This is the typical advice for those who get a drive disabled by Unraid when the problem is most often Unraid itself. Hello Microsoft? "My Windows PC crashed and won't boot!" Microsoft: "have you tried unplugging it and plugging it back in?" Yeah, that's useless advice in this case. The main issue here is how Unraid handles, and perhaps even creates, file I/O errors. It's not about SATA cables when any other operating system or even drive diagnostic software is perfectly happy with the same cable, drive, etc.
  4. Putting it in perspective, none of those attempts appear to be retries. Unraid was just blindly trying to proceed when there obviously was a problem. And those 1000 attempts happened in around 1 second or less. So there was no attempt to allow a normal period for recovery of a mechanical hard drive. As a hypothetical example if a drive gets jostled during an I/O operation it may issue errors and then has to mechanically recalibrate the heads. If Unraid just continues to beat on the on the drive while it's recalibrating it's likely to keep getting errors until the recalibration is completed. By the time drive has sorted itself out Unraid has already disabled it. A much better strategy would be for Unraid to retry on the FIRST error. If the retry fails, wait a few seconds, and try it again. Instead Unraid seems to just want to blindly go ahead trying new I/O operations and push what's often a perfectly good drive off the edge of a cliff. Unraid seems to be creating its own mess here often triggered by a brief transient glitch. EDIT: I should add the problem could be at different levels. Slackware is not an especially well regarded distro these days and struggles with newer hardware. Unraid apparently further modifies the Slackware kernel and layers other things on top of it like the known buggy Fuse. So, in terms of file I/O, Unraid isn't built on a very solid foundation especially compared to mdadm, zfs, etc.
  5. Looking at the syslog for the above I see a bunch of I/O errors in immediate succession. I don't see any retries of the same I/O command or sector. The drive (sdi) is also obviously still connected and communicating as it later passes a SMART read and is successfully unmounted with both logged before the server is shut down. I'm not saying the issue sometimes isn't a bad cable or connection. I'm just saying Unraid could handle such errors better and seems far more sensitive to such issues.
  6. Because the drives, cables, everything test fine without touching anything including the connections. So your concept of "bad connections" has the same end result which is it needlessly puts data at risk by disabling drives that no other software or operating system seems to have any issues with. Rebuilding data can be a 24+ hour process of continuous pounding on all the drives in the array which can just trigger additional problems when the array is already in a fragile state.
  7. I've already explained multiple times I don't have log data of a drive being disabled due to Unraid's unfortunate default log setup. I don't know what you mean by drives being "disconnected" as I don't think that's part of the SATA interface. The drives I'm talking about are very much still connected in that you can read the SMART data, run self tests on them, etc. but Unraid has disabled them from the array. If something is "disconnecting" drives it's likely Unraid NOT the hardware.
  8. @splendidthunder I appreciate your input and support around this being something Unraid could greatly improve if they'd only acknowledge the issue. There are ways to have Unraid write log data to a SSD cache drive/array which I, unfortunately, didn't know about until relatively recently as I never encountered that option in any of the docs, setup guides, etc. It's such a basic thing they should make it more obvious and arguably even put the logs, by default, on your SSD cache if you have one. And even if you don't have a cache, they could write the logs to the system folder on the array if you don't enable drive spin down. But instead, by default, your log is destroyed every time you power down or reboot your Unraid server which conveniently destroys the evidence of what happened to your wrongly disabled perfectly good drive. I've learned the hard way to reboot ASAP into linux to get the data off disabled drives out of fear the emulated data will also become corrupted, as it beats on all the drives for every access to the emulated data and/or during the rebuild. I've lost data that way. But the reboot, by default, destroys the logs. You are absolutely correct Unraid should be optimized for consumer grade hardware and not put data at risk by wrongly disabling drives that suffer a brief transient problem. Disabling drives very significantly puts data at risk and should be a last resort instead of happening all too often when there is no ongoing problem. I get that Unraid is trying to maintain the integrity of the parity but kicking perfectly good drives out of the array is not the best option and greatly increases risk of data loss. And you are also correct this is a common issue and the Unraid fans just blame your cables, drives, power supply, SATA interface, etc. The real issue for me is the exact same hardware works perfectly with TrueNAS, Open Media Vault, etc. and the disabled drive/cable/controller/power supply passes even extensive extended diagnostics. This seems to be something Unraid just wants to pretend isn't a problem when it really is. People complain about TrueNAS/FreeNAS being demanding on hardware but, honestly, it's been way more trouble free on the exact same hardware Unraid didn't like.
  9. I appreciate that. I'm an electrical engineer by education with extensive experience in the IT world. The SATA interface was designed to be robust and seems to work very well for years on end even with really marginal hardware, cables, etc, in most every application EXCEPT Unraid which is all too happy to disable even new NAS-grade prime drives connected with quality cables to expensive SATA cards for no good reason. I'm also repeating myself in that I don't have log data to analyze because, by default, Unraid log data is destroyed by even a reboot. But the evidence is Unraid is well known to disable perfectly good drives for no good reason. You keep bringing up the point of what is Unraid suppose to do if there's a drive error. How about a few retries? Honestly, there's lots of evidence Unraid doesn't attempt any sort of reasonable error recovery when it encounters what could be be a brief transient error reading or writing from a drive. That's perhaps the biggest issue here. Unraid seems to just fall on its face at the slightest I/O error. As I mentioned previously my dog might be chasing his toy and accidentally run into my Unraid server sitting on the floor. Unraid might be doing some file I/O when the server is jostled and the result seems to be Unraid would rather disable a drive from my clumsy dog rather than retry the same operation a second later when it would work perfectly. That's the problem the Unraid fan boys don't want to acknowledge. The whole "I've been running Unraid for years with no issues" argument is roughly the same as "I've never had a drive fail so I don't need to do backups" argument. My anecdotal evidence with multiple perfectly good Unraid drives being disabled is equally as valid as anyone else's anecdotal evidence of not having Unraid problems. I'd really like to hear Tom weigh in on this issue as it's a genuinely serious issue. Disabling a drive is not a trivial thing. The recovery process can easily result in permanent unrecoverable data loss and that risk is entirely preventable if Unraid wouldn't disable perfectly good drives in the first place.
  10. And to clarify if a disk can't be written to at all, or is experiencing easily reproducible problems, I have NO issue with it being disabled and kicked out of the array. But that's not we're discussing here. Basically if a drive works perfectly with everything you can throw at it after Unraid disabled it, something is wrong with Unraid.
  11. But that's the whole point. It's EXTREMELY unlikely the disk is that broken when the same disk, untouched, works perfectly. When Unraid disables a disk the first thing I do is shut down the server and reboot it into Linux from a USB thumb drive. And, guess what, the drive Unraid disabled works just fine, passes all diagnostics, has no unrecoverable errors in the SMART data, etc. If it's a full moon and the wind blows the wrong way Unraid seems to disable perfectly good drives. I don't know exactly why but that's very much my experience and the experience of many others. I sincerely doubt Unraid has any sort of robust retry logic and instead simply disables a drive at the first hint of trouble.
  12. And that goes back to my comments about Unraid having a fragile parity scheme. Let's face it. Unraid doesn't use a proven open source fault tolerant file system like the well proven Linux MDADM, ZFS, etc. AFAIK, it uses a unique proprietary parity system built on top of an ancient 1993 file system (XFS), the rather buggy unstable FUSE layer on top of that, and some proprietary "glue" to make it all sort of work. Yeah it has some advantages but it clearly has some disadvantages as well in performance and in disabling perfectly good drives.
  13. As I've said that's not been my experience at all. I've had the opposite experience. Every drive Unraid has disabled, and there have been several over the years, have been perfectly fine with often only a single CRC error in the SMART data. There have been countless other posts on this forum of similar experiences of drives that test fine but only have one or more CRC errors and Unraid disabled them. The common wisdom is to replace the SATA cable, replace the disk controller, replace the power supply, etc. But, in reality, the user might be better off replacing Unraid with a more fault tolerant operating system that doesn't cry wolf or that the sky is falling when there is no real problem with hardware other NAS operating systems are perfectly happy with.
  14. The problem is it can take 24+ hours of pounding on ALL the drives in the array to rebuild a drive that was wrongly disabled and that opens the door to causing another failure during the rebuild process. There is clearly something more sensitive and failure prone in the criteria Unraid uses to disable drives.
  15. I can only repeat what I've already said that a large number of Unraid users have had perfectly good drives disabled while this almost never happens with other NAS operating systems. Even some very experienced users here have admitted it's a side effect of how Unraid's parity works. I can't factually dispute if there is or isn't a write error but it doesn't really matter if the result is a perfectly good drive being disabled. A drive that will pass 24+ hours of continuous testing and then go on to work perfectly in a TrueNAS ZFS or Open Media Vault RAID array for a year or more. And, as I've said, this is true even with the exact same SATA cables, controller, motherboard, power supply, chassis, etc. It's also true every drive Unraid has disabled on me are not eligible for warranty replacement as they pass all diagnostics and work perfectly with everything but Unraid.