Jump to content

11rcombs

Members
  • Content Count

    44
  • Joined

  • Last visited

Community Reputation

2 Neutral

About 11rcombs

  • Rank
    Advanced Member

Converted

  • Gender
    Undisclosed
  1. So in the end, what was the actual bug here, and how did it manifest? I'm mostly wondering if there's anything libsqlite's doing that relies on particular implementation-defined kernel behavior that isn't actually guaranteed, in which case I'd want to report that to the sqlite devs with a description of a repro case.
  2. >Yes maybe docker related - using a bind mount of what? Of whatever the affected sqlite database is on, just as opposed to using a volume. Again, I don't know how those features are implemented, but there's a chance it might hit on a different code path. >This is going to generate an absolute mountain of data because often you have to run days or weeks before you see corruption. Could maybe do something clever with `logrotate` + something to kill the affected service soon after it detects db corruption? But otherwise yeah sometimes you just need to generate several gigabytes of logs and then sift through to find the useful couple megs.
  3. If it's an sqlite bug, it's certainly not an sqlite _regression_, since it happens on Plex (we ship our own sqlite) and in Docker containers. Here's a question: has anyone seen this on a standalone PMS install, or only on Docker? I haven't run into this issue since updating to the affected versions, and I run PMS directly on the host. I've seen rare cases of sqlite corruption on my system in the past, but they were attributable to the cache/mover bug (which can be ruled out as the cause of this issue, since it reproduces on systems without cache drives, and even when bypassing shfs altogether by using an individual disk md mount). I'm not especially familiar with how Docker volumes are implemented, but has anyone tried using a bind mount instead? Could be interesting to see if that gives different results. Otherwise, I'd suggest having someone who sees this frequently attach strace to an affected app, watching all filesystem I/O syscalls, and see exactly what's going on when the corruption occurs. Just make sure to use `-f` to get all threads and spawned children, and direct the output to somewhere with a lot of free space.
  4. I've just tested on 6.7.3-rc4 and confirmed that I can still reproduce the mover issue I described here: Is it possible that this issue is actually just a manifestation of that one? That would be consistent with what I've seen with that issue in the past. Has anyone seen this issue on a setup with no cache drive, or a share set not to use it (or to use it exclusively)?
  5. Super, from what I've seen that sounds like it should be robust. Thanks for the detail.
  6. Any news on this? Just want to make sure it doesn't slip through the cracks.
  7. Yeah, there's no great way around that (short of mounting the filesystem with mandatory locks enabled and using those, but not a lot of software expects to need to handle that on linux), and the only way the kernel provides to find out if a file's open at a given time is indeed iterating /proc like fuser or lsof does. You could keep doing the fuser check as a best-effort attempt to avoid clobbering writes directly to /mnt/cache (it's not like it's getting _worse_, at least), but still lock within shfs to guarantee no writes are dropped when using /mnt/user.
  8. I'm actually only seeing test1 on the array disk after the test. It looks like the procedure goes like this: move runs in_use to check if the file has open handles having passed, move calls ioctl 0xe008e001 to invoke the move routine in shfs shfs open()s the file on both /mnt/cache and /mnt/disk[…] shfs calls sendfile() to instruct the kernel to copy the file's contents from cache to array with low overhead shfs close()s both files shfs unlink()s the original on /mnt/cache During the period where two copies of the file exist (starting at when it's open()ed on /mnt/disk[…] and ending when it's unlink()ed from /mnt/cache), if you open() the file on /mnt/user, shfs will give you the one on /mnt/cache. This means that any write to a point in the file before the position sendfile() has copied so far (or any position, if the write takes place after sendfile() closes) will be dropped. Important life lesson I learned while researching this: do not pipe `strace -fp [shfs PID]` to a file on the user filesystem. You could probably cut a fair bit of the cost by calling fork()+execve() yourself (using the full path to fuser and losetup, and presumably calling getxattr directly instead of shelling out to getfattr), instead of doing what I assume is a call to system(), which expensively invokes /bin/sh to in turn invoke /bin/bash to run in_use. But yeah, doing actual locking here would solve the issue properly.
  9. No more than ever before, but that's not really saying much. As far as I can tell, this issue has always been around (both with the current C move program and the previous rsync/shell approach).
  10. I noticed that the move program does attempt to avoid moving files that are currently open, by calling the `/usr/local/sbin/in_use` shell script (which in turn calls fuser). This prevents most issues, but misses the case where the file isn't opened for writing until _after_ the check is complete; thus, the race condition.
  11. I wrote a very basic test program to verify this: #include <stdint.h> #include <unistd.h> #include <fcntl.h> int main() { uint32_t val = 0; int fd0 = open("test1", O_RDWR | O_CREAT); write(fd0, &val, sizeof(val)); close(fd0); while (val < 100000) { uint32_t val2 = 0; int fd1 = open("test1", O_RDWR | O_CREAT); int fd2 = open("test2", O_RDWR | O_CREAT); read(fd1, &val2, sizeof(val2)); close(fd1); val++; val2++; fd1 = open("test1", O_RDWR | O_CREAT); write(fd1, &val2, sizeof(val2)); write(fd2, &val, sizeof(val)); close(fd1); close(fd2); } return 0; } On a normal disk, the resulting "test1" and "test2" files will always contain identical 4-byte integers (100000). We can see this with a hex dump: $ cat test1 test2 | hexdump -C 00000000 a0 86 01 00 a0 86 01 00 |........| However, if the move program is run on test1 while the program runs, we can desynchronize: $ echo /mnt/cache/[path]/test1 | move -d 2 $ cat test1 test2 | hexdump -C 00000000 9f 86 01 00 a0 86 01 00 |........| Note that the two files now differ by 1. Losing writes can result in a lot of unexpected behaviors; I think it might be responsible for corruption I've seen in files downloaded by Transmission, as well as in sqlite databases (I saw corruption in my Plex Media Server database last night that appears consistent with lost writes, and happened about the same time as I ran the mover script). I'm not sure what the best solution for this problem is, as I'm not familiar with the internals of the mover program or the shfs ioctl it uses. One route could be to do the copy to a tmp file on the destination drive, then while holding an internal lock on the file as exposed by fuse, verify that it hasn't changed since the copy started, and only then take the place of the source. Alternately, while a file is being moved, shfs could expose it to userspace such that reads come from the source file, but writes go to both the source and the destination.
  12. There's a minor regression in (or shortly before) version 6.5. Icons for plugins are now searched for in [plugin path]/images/[image name], whereas they were previously searched for simply in [plugin path]/[image name]. Could PageBuilder.php be adjusted to search in both locations, for backwards-compatibility?
  13. Currently, shfs gives non-stable (they appear to be incrementing) inode numbers for stat() calls. This results in some problems in applications that assume inode changes indicate file content changes. For instance, nano warns about the file having changed when saving if its inode has changed, and git considers an inode change to be a tree change for some purposes, which can prevent some operations from succeeding. Passing through the inodes from the underlying disk does potentially create a problem of multiple files in /mnt/user sharing the same inode. This could be worked around by returning `underlying_inode * nb_data_drives + drive_idx` (which is potentially problematic if the drive is near the 64-bit limit on inodes, but that seems extremely unlikely, and you'd have the same issue with incrementing-at-read-time inodes anyway).