robobub

Members
  • Posts

    75
  • Joined

  • Last visited

Recent Profile Visitors

1114 profile views

robobub's Achievements

Rookie

Rookie (2/14)

5

Reputation

1

Community Answers

  1. I think you just encountered the issue I did with temperature pause/resume. From the comments, it looks like the logic also applies to pause/resume from mover action. I made a pull request here. Making these changes in /usr/local/bin/parity.check seem to be working for my case, though I did not check the second case yet. TESTING logs before the change for context
  2. So it turns out the issue I stated earlier (RAM that went bad) was actually something else: a bad SATA cable or PSU after some more googling. Since replacing RAM and verifying on memtest, some writes were submitted that were ignored and the sata link has to be reset and thus actually fits this scenario more precisely than just bad RAM. However, there is indeed a problem by not having the checksumming and parity integrated together, as apandey's intuition suggested, at least when set up for copy-on-write filesystem (which is the default for BTRFS). There is still a solution I used to recover without a full rebuild, but it's advanced. Some more details: While I was able to recover the particular file that had this corruption (which manifested as "Input/Output Error", parity is still out of sync. Why? Not from the recovery process, it was already out of sync. The emulated drive did not have those BTRFS checksum errors when accessing the file, and the array's stats showed 0 in the write column for all drives when recovering (and a bunch of similar reads even though only the emulated device was mounted, which shows emulation was working). But the physical drive did have those BTRFS checksum errors. The problem: updating/deleting the file on a copy-on-write filesystem does not modify the corrupted sectors. As I mentioned earlier, while parity had the correct bits and correctly wrote the file, parity was out of sync because the physical drive had incorrect bits. When updating or deleting the file, BTRFS by default will place the new data in a new sector if it was not created with the chattr attribute to disable COW. That means the physical sectors that were used to hold the data, while now unreferenced by the BTRFS filesystem, still hold data that is out of sync with the parity. So if one needed to restore from those sectors on one of the other hard drives, it will be incorrect. Eventually BTRF will reclaim that space, which could be sped up via balance or defrag, but that's not really acceptable. The solutions is to tell unRAID to parity sync specific sectors. You can do this with dd or other advanced low-level tools after looking up the physical locations associated with the file by inspecting the BTRFS filesystem metadata. But I'll just strongly recommend against it unless you know what you are doing. I was able to get it working with dd, btrfs-progs, and help from #btrfs on irc. Some snippets of the errors. When trying to access a particular file, "Input/Output Error" is received. I did the steps that I outlined and After re-mounting with the disk, a parity check has shown no sync errors so far. Steps (WARNING, you will lose data unless you know what you are doing and how to navigate low level filesystem metadata) stop array (optional) backup /boot/config select "no device" for disk with error start array in maintenance mode mount device with `-o ro,loop,norecovery` Copy file in question to somewhere not on the disk/array. I put it on my `/tmp` ram but /boot or an unassigned device can also work stop array Tools -> New Config, preserve all configurations (some settings like shutdown timeout are reset, can also restore /boot/config) start array with "Parity is valid" write 0's for the sectors with sync errors For btrfs, you have to look up the logical offsets for each extent associated with the file, and zero the those extents for the size of those extents (optional) start a parity check, there should be no corrections, but you can uncheck write corrections to see if you messed something up. There is some irony in performing this step, as it will cause essentially the same wear/workload as rebuilding, at least for HDDs where reading/writing are likely similar in wear
  3. I am playing around with a USB enclosure. I knew this enclosure didn't report serial numbers or SMART correctly and in the device page it was no surprise to essentially see an empty list under device attributes. But I can download the report manually under Self-Test and query with smartctl. Then days later, I see some notifications in the WebGUI and emails emails about SMART changes for the fields it's told to monitor: It even updates the GUI to report an error that I need to acknowledge, with a tooltip showing the specific data that changed, but clicking on attributes again still shows the blank table above. So clearly unRAID is actually parsing the SMART data and handling changes, but it just doesn't show up under the table. I tried all the different options under SMART controller type. Also it appears the SN does show up under device -> identity, it's just not reportedly normally I suppose. The model is ORICO 4 Bay Hard Drive Enclosure WS400U3 B0813KKWFJ B08D994JVT with a Jmicron JMB394 and JMS567 I actually spent a bit of time writing scripts that would query the smart data with smartctl, recording diffs, ignoring fields like Power Cycle Count, but seems that wasn't really necessary.
  4. To expand on this, pointing directly to /mnt/cache instead of /mnt/user, it will skip the unraid shfs layer which incurs some CPU cost. It depends on the access pattern, but my theory is it is a fixed cost per request, e.g. many small requests, perhaps certain database queries, incur greater cost. One advanced way you can try to get a handle on this is checking for a lot of cpu usage of the shfs from the dockers accessing cache. You can check the cpu usage using `htop` or the glances docker app and look for shfs. All normal array access is through here though, so lots of access to the array (instead of cache) can result in high shfs usage. To see what is being requested through the shfs layer, you can run `lsof -c shfs` and see if the usage is more cache or array.
  5. Well certainly sounds like a bug as this is not what occurs for multiple small files selected (as opposed to a folder) I've attached a log file of this scenario, let me know if you need any more information for this scenario or if I am doing something incorrectly Thanks! unbalance.log
  6. At this point it's clear you aren't reading what I write. I told you exactly how you will know, checksums, which are built into btrfs. Let me change my language then, if the btrfs filesystem says the emulated drive from parity is good, then it's good, not "highly likely"— the only reason I used that terminology is to be mathematically precise. File integrity plugin will also get you to essentially the same place in a less real-time fashion with more steps. If you read all those same bits and the checksum comes out the same, then for all effective purposes (an essentially impossible random hash collision), it is good. There is no guessing here, please read up on cryptographic functions. As for "to an extent" that was more of a commentary on knowing "how off" your physical filesystem is. You can't really know how much of the block (or file in the plugin's case) but well, it's not really important how damaged your physical filesystem is if your emulated parity drive with checksums is good. Yes, integrating in those pieces together like zfs has advantages, particularly in terms of usability, automation, and ease, but all the information is still there. We're not going in circles, I've been explaining the same scenario since my first comment. I'm more than willing to continue to educate you on this as long as you want to respond. I've already laid out a practical implementation that works, given the other comments here stating when data is and is not written from the unraid side. You don't have to use these utilities to safeguard your data or streamline the restoration process if you don't want to.
  7. Indeed, we still have a communication breakdown. I'll try to clarify again. No, you are covering a different case than the scenario I laid out. In your scenario, the parity is synchronized with the corrupted filesystem. It is numerically valid, but it does not satisfy my condition 2: that the parity has the correct bits, as in uncorrupted. I explicitly did not say whether or not the parity is invalid in this scenario, because it depends on the point of reference. In this scenario, it is synchronized with the uncorrupted filesystem, and out of sync with the physical corrupted filesystem. In my original comment's scenario, restoring from backup incurs a cost that is not desirable: remaining offsite backup that didn't fail is annoying to access: on a tape drive, in a different geographic location, on the cloud with expensive restore costs, etc. There are various scenarios where a write can be successful on one disk and not on another. The one I had thought up was that the connection with an externally connected SAS drive could be disrupted while the connection with the parity drive is intact. And then my actual scenario with RAM corruption, any particular computation could be corrupted, and the parity computation is separate from writing the block metadata or data itself. You could also have sata bus or CRC errors due to a bad cable. I did state these conditions up front for my scenario, if this was your big issue you could've brought this up in the beginning. I made no claims about the likelihood of this scenario, in fact in my first comment I stated it's probably rare. It doesn't mean rare cases should be ignored. The filesystem I mentioned already, btrfs, checksums both the data and metadata of every write. So yes, you can with a very high likelihood know if your parity is "correct" by emulating it, and also to know, to an extent, how "off" your physical filesystem is. And you'll get similar results with the file integrity plugin and other filesystems: once you notice your checksums are off with your physical disk, you can mount the emulated disk and see if the checksum is correct. This, of course, requires that you haven't done a parity check that writes corrections to parity, something that is explicitly recommended avoiding by those that use that plugin. There are plenty of advantages of unraid over zfs for my and other's use cases, why would I move to zfs if I can get some protection by adding additional layers on top of unraid's parity?
  8. So it actually does not occur with mount -o remount,compress-force=zstd by itself. My disks still had btrfs set compress on specific folders, which at least for my setup, causes this issue. When removing per-folder settings, I do not get this error. It still would be nice to set higher compression levels and more expensive algorithms on certain folders (at least for whatever is different in my setup than JorgeB's), but this is good enough for now, so I'll downgrade this to "Annoyance"
  9. Ah, I see where the communication breakdown happened. I'll modify the first requirement to be: know a particular drive had a failure in particular files Yes, it's not by default in unraid that you get this, but there are tools that exist that give you this and would certainly to use them. In my case recently, it was btrfs (and it's fsck/associated tools) that alerted me to the specific files and folders with errors. It's also likely the file integrity plugin that I use would also have done so. I grabbed those files from backup. If I had known about the option I put forth in this thread, I would've at least investigated to see requirement 2 (know parity has the right bits) was satisfied by just mounting the emulated drive as true read-only and grabbing a checksum of the file. In my case, the issue ended up being bad RAM, so I would guess it's equally likely that the error was propagated both the parity and the drive vs just one of the two. I also did know that requirement 3 (the rest of the disk was good) was satisfied after extensive testing of the disk in another system and discovering the root cause.
  10. Just thought I'd mention some unintuitive behavior for scatter: if I select multiple individual files, it attempts to place each one on a separate drive, even if there is enough space on a single drive. If there are more individual files than target drives selected, it claims there is not enough space in the target disks. E.g. I'm moving a several small log files (10 MB each), and my disks have several TB free. It will place the first log file on the first target disk, second on the next target disk, and then if there are more individual files than disks available, it will say "PLANNING: The following items will not be transferred, because there's not enough space in the target disks" I selected scatter because I want to move them from a specific full drive to other drives.
  11. Interesting, thanks for trying it out. Just now I reproduced the operation not supported again on my system after recreating my array (though not reformatting all of my disks). Any thoughts as to what it could? You are on 6.11.5 as well? What filesystem/encryption type do you have on cache and disk1 (I am using xfs-encrypted and btrfs-encrypted for reference)? Is disk1 the next drive to be written to based on your allocation method? Do you have space_cache=v2 on your btrfs mount options `mount | grep btrfs`? EDIT: Or a diagnostics would be sufficient instead of these individual questions . My diagnostics for the original 6.8.2 and with 6.11.5 are in the OP If you are up for trying more commands, can you do a `ls -l /mnt/cache/test; ls -l /mnt/disk*/test` ? Also perhaps a `tree /mnt/cache/test` and `tree /mnt/disk*/test` before and after the mkdir and the touch steps? And maybe with `mount -o remount,compress-force=zstd /mnt/disk1`? You can undo it with `mount -o remount,compress-force=none /mnt/disk1` EDIT: Some further testing: I tried making /mnt/disk1/test/1/2/3 with compress=no then changing compress=zstd on just /mnt/disk1/test and it succeeds. Then I did the inverse, and (folders made with compress=zstd, then setting the root to compress=no) and it fails. I then made folder test/1/2 both with with compress=no, then went into that and made 3/4/5 after setting test/1/2 to compress=zstd, and then verified that when touching a file and having operation not supported, it made /mnt/cache/1/2/ (e.g. the number of folders deep with compress=no) then stopped Any thoughts as to what it could be are welcome.
  12. So then a single file restore with the steps I mentioned is actually currently possible then without any modifications. The required steps are to mount the disks with explicit options to do no writes: "ro,norecovery" with btrfs. xfs, reiserfs, and ext* also have similar options, and though there is some debate whether they actually work, creating a "ro,loop" device guarantees it with any block device. There is an additional layer if encrypted, so one would have to verify luks opening doesn't write any data either just in case. Thanks, this is definitely good to know and can potentially save days of rebuilding then under very specific scenarios.
  13. Ok, then can you answer my original question? What data, if any, is written when just starting the array maintenance mode by itself? This is before any filesystems are interacted with. I did find an answer to the second half dealing with the filesystems, specifically btrfs: If it is mounted with the options "ro,norecovery" then indeed absolutely no data is written to the disk, and this step would not create a divergence between the physical disk and parity/emulated disk. So the remaining question is the preceeding steps.
  14. Well, I guess thank you for indulging me as much as you two have so far. I'll just re-iterate from my original comment: The discussion I attempted to provoke was for an advanced use-case only. I thought it was clear from all the bold warnings in my original comment that it is not something I would suggest implementing without knowing more details of the how and when information gets written, of which apparently none of us here know.
  15. You implied an answer to my question, but it wasn't explicit. I think you may have misunderstood my question based on adding an explanation on parity/emulation, so I'll restate it. My question is whether you can prevent writes to all disks by starting in maintenance mode and mounting as read-only, or if some amount of data is written on the emulated disk/parity through any of these steps. My understanding of parity/emulation hasn't changed since my original comment with your explanation, so if there is a specific detail I'm getting wrong, please point it out specifically. In my original comment, I acknowledge that anything that gets written when emulating would be dangerous and cause the physical disk and parity/emulated disk to go out of sync. Now if it is impossible to prevent writes currently when using maintenance mode and read-only mounting, my follow up question is if would it theoretically be possible (with OS level changes to unraid) to prevent those writes? I think it'd be advantageous to be able to inspect/modify the array without any writes to anything, as is the standard for dealing with issues on individual disks, even ignoring this specific scenario.