Jump to content

c3

Members
  • Posts

    1,175
  • Joined

  • Last visited

Posts posted by c3

  1. Durp...

     

    blindly updating the docker from time to time, never knew to do the upgrade from the GUI... Now I get

    This version of Nextcloud is not compatible with PHP 7.2.
    You are currently running 7.2.14.

    My docker containers (mariadb-nextcloud and nextcloud) are up-to-date, but I guess that does not help.

     

    Without a gui, how can I proceed?

  2. You need to decide which factor is your primary concern, data durability (data loss), or data availability. As mentioned backups dramatically improve data durability. But if you are after data availability, you'll need to handle all the hardware factors power supplies (as mentioned), memory (ECC and DIMM fail/sparing), cooling, and probably networking (lacp, etc).



    SME Storage


    Some posts expressed worry that in case unraid encounters a bad SATA connection a hot spare will kick in. (when available)
    Exactly what I would want! First priority is the health of the array.



    The sparing process can be scripted. As a subject matter expert, and your vast experience, this will be straight forward. Perl and python are available in the Nerd Tools. This may allow you to worry less while working.

    However, I am not sure it would be "hot" as the array must shutdown to reassign the drive. You could implement NetApp's maintenance garage function, to test, and then resume or fail the drive.
  3. 17 hours ago, SSD said:

    But unfortunately we can't arrange such a study.

    That is a great idea!

     

    It would be a simple plugin to gather, anonymize, and ship the data. I already have a system which tracks millions of drives, so a baby clone of that could be used. It could be frontended with a page showing population statistics.

     

    I wonder how many would opt-in to share drive information from unRAID servers?

  4. Yeah, I too have a bunch of these and do not have any bad reports. The fans fail, but I have had them a long time, so that is to be expected. I have never had a speed problem. I stopped using them because of cost. If you are putting together lots of drives, the 24/48/60 drive units are cheaper than bolting these in, and the backplane is nicer than lots of cables. If you just need 5-10 drives, these are great.

  5. 3 hours ago, jeffreywhunter said:

     

    Hey thanks for the suggestion.  But I had already tired that again when the error showed up this time.  Yep, it worked then and cleared the log, but when it showed up again this time the error continues (and why did it come back?).  So apologies for a terse post, should have included that history, will try to do better next time.

     

    Naw, I am still thinking the config needs to be changed to remove the old directory names, and include the new if you want them.

     

  6. 45 minutes ago, jeffreywhunter said:

    I renamed a couple shares and I'm seeing this in my log.  How do I correct?  When I go into settings, the old directories are not in the list.

     

    
    Jan 26 16:40:25 HunterNAS cache_dirs: ----------------------------------------------
    Jan 26 16:40:25 HunterNAS cache_dirs: ERROR: included directory "DocArchive" does not exist.
    Jan 26 16:40:25 HunterNAS cache_dirs: ERROR: included directory "ISO\ Files" does not exist.
    Jan 26 16:40:26 HunterNAS cache_dirs: cache_dirs process ID 9360 started

     

    Thanks!

     

     

    Wow, the same thing as you reported in March 1, 2017, might want to try the same thing this year and see if it works.

  7. 13 hours ago, Lev said:

    First depends on your filesystem type. I'll assume XFS, cause IMHO that's really the only one worth picking.

     

    My advice is this based on finding nothing and only though simply experimentation over the last two months on this very question. At 8TB drives, I set 1% free as a target. That's roughly 120GB IIRC in my settings that I set on shares as a minimum free space limit to enforce. In the Disk Settings as well I set 99% as the warning level and 100% as the critcal level simply cause I don't want to see the color red in the GUI. I wish this setting allowed decimal place, cause I'd set 99% as the warning and 99.5% as the critcal if I could.

    Thanks for the XFS, as dev I can not really come here and promote it.

    1% of 8TB is 80GB, your 120Gb is 1.5% which you probably did to avoid the warning at 99%.

     

    I :+1: your request for getting the setting to allow decimals as I am constantly dealing with people who claim out of space 100%, yet have 100s of GB available. When you have room for thousands/millions more of your average filesize, you're not out of space, just because you see a 100% from df.

    • Like 1
  8. It depends...

     

    If the data is written, not deleted, changed, or grown, you can fill a filesystem very full. However once you begin making changes, called making holes, or fragmenting, things can become very ugly. Depending on the filesystem, you may notice as soon as 80% full, but certainly in the 90+% range the filesystem begin doing a lot of extra work if files are being changed.

     

    Also, almost all filesystems keep working on this to improve.

     

    That said, you want to put 8TB of cold storage, (what about atime, etc?), you can go very full 99%, just be sure it is truly cold. I would even single thread the copy to avoid fragments, but that is really just a read speed thing.

  9. On 5/5/2017 at 8:25 AM, axipher said:

    So going from 6 DATA drives down to 4 drives would be just the following:

     

    - Unassign Data Drive 1

    - Shutdown

    - Unplug Data Drive 1 and Plug in new Drive

    - Reboot

    - Assign new Data Drive 1

    - Let it do something???

     

    Repeat for Data Drive 2, 3, 4

     

    Then what about Drives 5 and 6, just unassign and unplug them and then let it rebuild the data on to the drives 1 through 4?  Or would I need to stay with 6 data drives now?

    For drive 5 and 6 (or 3 and 4) 

    At the end you will need to change the config and rebuild parity.

  10. I am sorry for the delay, I was busy working on things like data checksum on the XFS roadmap. Everyone there understood why, it was just the timing and who would do the work.

     

    I did take time to have discussions with several disk drive manufacturers about the ECC performance, which remains RS. Two of them indicated I might get my hands on the details under NDA, of a non current device (like from 2TB). We spent a lot of time talking about the impact adding read heads (TDMR) will have on this whole process. There was a pretty good joke about going from helium to vacuum drives would help stabilize the head, but then how to fly in vacuum, maglev. I guess you had to be there.

     

    Since I was told RS is still being used, Lemma 4 ends with (emphasis is mine);

    "If that occurs, then Lemma 4 implies that there must have been more than e errors. We cannot correct them, but at least we have detected them. Not every combination of more than e errors can be detected, of course—some will simply result in incorrect decoding".

    This is the foundation of UDE(paywalled), which drives the need for filesystem level checksum.

     

    UDE is a larger set of conditions, especially when talking spinning rust. You can see where TDMR will help.

     

    To improve the chance of avoid anything that might have my ideas, work, or decisions; use Microsoft products, but only the newer filesystems.

     

    In other news, quietly slipped into unRAID 6.3.2 was f2fs. Which is very exciting (or worrisome), especially post 4.10, probably the next roll of unRAID. f2fs now has SMR support (took 2 years). But the stuff I work on takes a long time to do and even longer to get rid of.  SMR is just the beginning, but the whole concept of decoupling the write head from the read head/track width was fundamental to driving density. Other filesystems will follow, and/or use dm-zoned. Doing so will have benefits for flash write durability. Best of luck avoiding.

     

    But things like filesystem checksum will be needed outside the spinning rust world, and the OP is probably grateful.

  11. If all you are looking for is weakness in ECC, there are plenty of papers on solomon-reed (and it's replacements). It comes down to a game of show me your ECC method and there is a dataset which will be undetected. As mentioned early on, the more recent case of this is SHA. Widely used and thought to be collision "impossible", until 2005 when a non brute force method was published. Now the non brute force method has been improved and weaponized. It is important to note the "impossible" was always really just impractical. The time/probability was so large people just ignored it. Some organizations, like CERN, work at larger scale and understand that these can not be ignored, but measured at scale.

     

    If the data you are trying to protect is x bits long, protecting is with fewer bits y, leaves a gap. In the case of the 512byte sector, the ECC is (40) 10 symbol fields. So,  512bytes is protected by 50 bytes. This is all done at the DSP level so the bits to bytes, baud rates etc apply. Now in the 4k sector the ECC is 100 bytes. Yeah, more data protected with a smaller portion allocated to protection. And remember, unlike SHA, this 100bytes has to not only detect errors, but correct errors.

     

    This double edge sword is played against the manufacturer's goal of quality vs cost. Each quality improvement becomes an opportunity to reduce cost. When the spindle motor wobble/vibration is improved, the track width is narrowed. The error rates goes down with one and up with the other. The current range is 10^14 to 10^17. Some will take that as impossible.

     

    I am sorry for the old papers. I'll stop doing that. They are from when I worked for an enterprise storage manufacturer. I have since moved up the stack to OS and filesystems, where we flat out tell disk drive manufacturers there is no expectation of perfect, and work to build robust systems on imperfect hardware.

  12. While everyone wants to talk about Annual Failure Rate (AFR), there are plenty of other measures, some more important depending on how the device is used. SSDs have a lower AFR than HDD. SSDs also have higher uncorrected errors. This means SSDs lose data more often, but keep working. You can read the study from last year.

     

    Many years ago the BER number from drive manufacturers were confirmed by CERN. Pretty much everyone agrees, there is no guarantee that data will be written and read without corruption. CERN found 500 errors in reading 5x10^15 bits.

     

    "Silent data corruption is an unrecognized threat that may affect as many as 1 in 90 SATA drives." from NEC as an enterprise storage provider.

  13. Quote

    Untrue! Rather than (try to) get into the theory of error detection and correction [at the level implemented in HDD firmware] (which I doubt ANY reader of this forum is competent to do--I'm surely not!), consider this: If a collision was easy (Hell! forget easy; if a collision was even *possible*), hard drives would only be used by careless hobbyists.

     

    Unfortunately, I do see drives returning corrupt data on a regular basis. Enterprise storage manufacturers actually fight over who does this better. Enterprise storage uses drives which offer variable (512/520/524/528 bytes per sector) configurations for additional checksum storage. They don't do this just because they want you to buy more drives.

     

    Additional sources

    https://bartsjerps.wordpress.com/2011/12/14/oracle-data-integrity/

    http://blog.fosketts.net/2014/12/19/big-disk-drives-require-data-integrity-checking/

     

    • Upvote 1
  14. Look at the OP. First line, SMART errors and the disconnect from them meaning data loss. For me, data corruption equal data loss.

     

    The drive is reporting uncorrected errors, these are not ECC collisions, these are known bad/lost data.

     

    The topic title talks about reallocation and corruption. I was trying to make it clear that reallocation is the best thing that happened in this whole mess. The drive has learned to not use the bad place. But the learning was costly. Again the OP, first line, the reallocation is not the problem. It was the errors before the reallocation.

     

    I don't believe the SMART data was polled fast enough to state that there were no pending. Pending is not a permanent state.

     

    I have not quoted anyone, and would hope that people will understand that this means nothing is directed at anyone directly.

  15. I have said this many times before, SMART counters do not predict errors, only reports errors which have already occurred, after data loss.

     

    ECC has a very limited ability to detect errors, and a more limited ability to correct. What a reported uncorrected error means, is the drive has data it knows is bad because it failed ECC, but was unable to correct. But a collision (easy with ECC) means you can not really trust data just because it passes ECC. Read up on SHA collisions if you want to know what this means or how bad it is.

     

    Enter check-summing.

     

    BTRFS is one filesystem which has checksums for the data (ZFS is another). XFS and others have checksums for the metadata, it's deemed more important. Filesystems like WAFL/CephFS/OneFS actually do correction in the event of a checksum error.

     

    The general direction is less reliable storage devices, at lower costs. If this event had been a video file, the data loss would probably be unnoticeable. And if you don't want to lose data, a higher level protection scheme (Replication(backup)/RAID/EC) needs to be in place. Which is exactly what happened in this case. The data was protected by a higher level scheme, a backup.

     

    One last thing, a reallocation does not corrupt or lose data. The reallocation is long after the loss. Reallocation is done on the write of new data to a location previously found to be unreliable. The loss is what caused the reallocation. Prior to a reallocation you might see a pending, or ECC corrected and uncorrected.

     

     

     

     

  16. 15 minutes ago, bjp999 said:

     

    True, but if you are copying fresh data to the drive, seems it'd be somewhat unlikely you'd be making hits inside the persistent cache (depends on the block size in the persistent cache compared with unRAID read block size). Drives already have some level of caching which would likely come into play with very localized activity.

    The persistent cache is a write cache, there is no "hit" involved.

     

    The persistent cache is measured in GB, the cache you are referring to is measured in MB. Might be better to call it disk buffer. The way that buffer is used is different on SMR vs PMR. Using it for writes puts the data at risk to power failures. Persistent cache does not, hence the name persistent.

  17. 1 hour ago, garycase said:

    I think that'd be a great choice.   In fact, you can even use an archive for parity if you want -- as long as you don't have a lot of simultaneous writes by different processes/users that's very unlikely to hit the shingled performance "wall".

     

    Actually, the SMR drives like Seagate's 8TB archive, offer better performance than PMR for simultaneous random writes, up to the persistent cache size. The SMR drives put those writes into the persistent cache without the seek overhead of the PMR drives. This is very similar to the performance of MLC/TLC/QLC used as SLC cache on SSDs, for example the Hellfire has ~17GB of cache, less than the Seagate 8TB archive. And WD Black went with ~5.5GB.
     

    • Upvote 1
  18. It should be noted that hot adding and hot swapping are very different. Hot swapping requires much more than AHCI. unRAID does not support hot swapping, or even hot adding drives to the array.

     

    On the Intel chipsets, under AHCI, the hot swap feature must be enabled via BIOS and software.

     

    That being said, I recently hot plugged a drive into a linux machine (which I have done thousands of times ala echo "- - -"), and had it crash. The machine crashed at the time of insertion, not detection. I suspect a physical interaction. Power cycled and never looked back.

     

    Given the lack of support by unRAID, and the potential for mishap, best practice would be to avoid hot plugging. At the very least, be aware there is risk, and with unRAID limited upside.

     

    Regarding hot swap drives, I assume if I stopped the array and lets say, added a new drive by just popping it in an empty slot, that there is no way for unRAID to know about that drive until a reboot? Likewise, if I wanted to replace a drive with a larger one, or even replace a failed drive, reboots would be in order for any of these scenarios?

    unRAID will not know about the drive until you config it. But linux will likely detect the drive after insertion, if not you can use the echo "- - -" and have the hba scan.

×
×
  • Create New...