Spitko

Members
  • Posts

    25
  • Joined

  • Last visited

Everything posted by Spitko

  1. If you're going zfs -> zfs, look at znapzend. It allows you to both automate your main pool's snapshots, but also send additional snapshots to a remote pool. This lets you have all the benefits of zfs snapshotting, but in the remote pool as well. The setup has no GUI, but it's not too hard to get going.
  2. Going to tentatively mark this as fixed, I did some jiggerypokery with my zfs backups so they don't share pool names. I think this exposes how fragile the exclusive access system is though, and there should probably be a way to force it on, especially now that you can have lots of tiny little purpose built pools. I don't want to worry that one of them might have a folder that shares a name with an unraid share, both for performance reasons and because suddenly having FUSE start smudging files together is a bug factory
  3. This is a weird situation I've been tracking for a while, but I've now caught it twice now so I'm convinced it's not a fluke. My setup: Pool A: A pure NVME pool with 4x4tb drives in raid0. This has one unraid share on it. The share is configured with the zfs pool as the primary storage, and the secondary storage set to none. I use this share quite a bit for various tasks. Pool B: 2x16tb spinning rust. This pool contains no unraid shares, and is not exposed to any docker containers. znapzend sends shapshots to it at midnight every night. I've confirmed this behavior with the znapzend log file, and by looking at snapshot data in zfs. One would expect pool B to only spin up once a night, at midnight, for the snapshots to run. However, access to pool A will sometimes spin pool B up. Normally I just randomly notice it spun up, but I've caught it doing it live twice now, doing the same specific thing. One of the jobs that writes to this disk occasionally produces invalid filenames (containing | characters). So every now and then I clean it up by opening the terminal, and running this exact sequence of commands: root@jibril:~# cd /mnt/user/fast/hf/v3/ root@jibril:/mnt/user/fast/hf/v3# find -name '*|*' -print0 | sed -ze "p;s/|/-/g" | xargs -0 -n2 mv the "fast" mount here is the share on pool A. Yet somehow this will sometimes spin up both disks on share B. It does not spin up array disks. I can't find a rational explanation for this, and can only assume at this point that there's some underlying ZFS tomfoolery afoot. To get in front of the common problem: No, I'm not running zfs master, or any other ZFS related plugins other than znapzend. I had suspected it might somehow be at fault so I killed the daemon before running the test the last time, and it still spin up pool B. I was about to post diagnostics but ended up finding the root of the problem in the process: Unraid has somehow determined that the backip pool contains files that the primary pool also contains, and has linked them. This is... extremely non-ideal. How do I prevent unraid from doing this? The share is set up with pool A as the primary storage with no secondary, so I don't understand what logic unraid is using to snoop pool B and make this determination.
  4. mdns info won't cross through bridge mode even with forwarded ports, at least in my testing. You can still grab logs and flash it, but they always show offline and won't show for adoption (Unless you use the ping mode, which also prevents adoption discovery). There's a PR now to accept "false" in this case but it hasn't been accepted yet.
  5. The ESPHome docker template has `ESPHOME_DASHBOARD_USE_PING` defined by default. This should really be removed, as *any value* counts as true here; meaning everyone has this flag forced on unless they delete the entry from their install. The default network mode also needs to be host.
  6. Currently, even if a share is cache only, exporting it to smb will go through /mnt/user and therefore shfs. shfs has some fairly well documented perf limitations in certain applications. I have a narrow case where this was problematic, so I build a ZFS pool specifically for these files, and it's been working great. However, I had to jump through some hoops to get there and it seems like something unraid could streamline relatively easily. The proposal would be to simply bypass shfs if the share is targeting volumes which don't need it. In my use case, I'm targeting a ZFS pool, though ideally it would either detect all cases where shfs isn't in play, or simply give the user a checkbox to control this behavior. This lets users more easily establish high performance pools for the narrow cases where they're needed, while still taking advantage of shfs for the majority use case. Currently, I've found workarounds via manually adding a share to the smb extras conf, but it shouldn't be necessary to do it this way.
  7. I hit an interesting bug, I've not had a chance to re-test it but it should be easy to repro. 1) Create a ZFS volume 2) Create a share for it. Set the share to Cache: Only. 3) Rename the ZFS volume (Done in the disk manager) The share is now broken. You can still read from it just fine, but it will report 0 bytes free and new files can't be copied to it. The workaround is easy; just edit the share and re-assign the zfs volume to it, but Unraid's UI should probably handle this automatically.
  8. Some notes on getting a corsair commander pro working. I needed to go this route since my Asus mobo didn't offer any PWM control out of the box. 1) The fan control plugin does not properly detect it; you'll get errors when you try to get the fan list. You can work around this though by manually specifying it. For example, in my case I have PWM controller set to corsaircpro - pwm1 (this is detected properly) and the PWM fan manually set to /sys/class/hwmon/hwmon2/fan1_input. Change the fan and hwmon numbers to match your device. 2) Trying to specify a fan from the cpro in the system temp program completely breaks the interface and indeed the entire sensors command. To get it working again, modify sensors.conf and change corsairpro-hid-3-2 (or similar) to just be corsairpro-*. Ref this github issue: https://github.com/lm-sensors/lm-sensors/issues/369 As a note, this needs to be re-applied every reboot; I presume there's a way to fix this properly but unsure.
  9. As a datapoint for future generations, I use the 8x1 version of this dock ( MB998IP-B ). The build quality and physical design is identical, but there are some minor differences in the back: - Two MiniSAS HD plugs instead of SATA. - Dual fans - SATA power plugs instead of molex. I only use it for SATA SSDs, but haven't had any issues with heat in light duty use. It's worth noting that the bays here are NOT tall enough to fit most larger 2.5" HDDs, at least my 1TB doesn't fit. I believe they officially specify a 7mm limit, but I'd have to double check. Durability has been great through a bunch of swaps, and the fans are quiet enough that I can't hear them over the other fans in my system. The bay latches are all metal and haven't shown any signs of wear. Honestly my only complaint is that it's too expensive, but there's just not a lot of competition.
  10. I moved the 10g card to the 4x link on the chipset, and that appears to have solved the issue. Hopefully that helps someone else should they stumble across this thread. x399 has a single dedicated 4x outbound link so odds are if your motherboard has exactly one 4x slot just hanging around, it's probably that one. In the case of the ROG Zenith Extreme, it's slot 3.
  11. I was afraid of that. However this still leaves me with the crux of the problem: How can I test the system with the NIC in, without dropping more disks on the array?
  12. Ok this one's spooky. I'm in the middle of a network upgrade, and one the tasks was to pop in a 10g network card to the unraid box as I slowly move towards multi-gig. This went mostly fine; I pulled the slot 1 GPU out and popped the ROG Areion my board came with in its place. I don't really use unraid in GUI mode so this is (mostly) fine. (I'll deal with the GPU passthrough issues later). About an hour after installation, one of my drives had a few read errors and went into emulated mode. Well.. that sucks, but wasn't TOO surprising, given it was the oldest drive in the array, and a holdover from initializing the box. A bit early for an Ironwolf but still under warranty. Except... no errors? All smart tests passed, no bad sectors... every scan I could throw at it passed. I couldn't find (read: Forgot to order) a cold spare and it was going to take 2 weeks for the one I ordered to arrive, so I figured might as well rebuild it and see what happens. Rebuild goes fine. No issues. Huh. The next day I do a reboot while chasing down the exciting new GPU passthrough issues and suddenly a *different* is disabled. But... the logs don't in any way indicate why. Even the alert just says "disabled", no errors logged. All tests green as before. Diagnostics from immediately after I got the alert are attached, but I don't even see the disable event in the log, so maybe it happened during shutdown and I didn't get the alert until reboot? At this point I'm fairly convinced somehow this NIC is causing the errors, but I don't have any mental model for *how*. The NIC itself was working fine and the second drive is now rebuilding without any incident. There are no weird errors in the log I've pulled the card for now because it's clearly dangerous, but this leaves me with two core problems: 1) I don't know how to validate the system stability with the card in 2) I don't know how to actually test the card without constantly putting the array into rebuild mode. How should/can I proceed from here? I'd like to eventually have the card in service, but the current behavior just isn't tenable.
  13. I see this as well on 6.9.2. Not sure if it's new, I was inspecting the logs after switching to a 10g network card and noticed it.
  14. This usually happens while compiling code. The behavior is as follows: - First, the compile will hang at a certain spot. The system is still responsive, but CL processes just spin forever - In task manager, disk usage for the C drive is now stuck at 100%, though actual IO is generally fairly low at this point. - Around this time, Windows will start complaining in the event log: "Reset to device, \Device\RaidPort2, was issued.". This happens frequently. - Eventually, Visual Studio itself hangs, and the system continues to become less and less responsive until it requires manual restart. You can't kill the stuck CL processes, so something's likely hung deep in the driver. The VM has three disks: <disk type='file' device='disk'> <driver name='qemu' type='raw' cache='writeback' discard='unmap'/> <source file='/mnt/user/vms/Windows 10/vdisk1.img' index='2'/> <backingStore/> <target dev='hdc' bus='scsi'/> <boot order='1'/> <alias name='scsi0-0-0-2'/> <address type='drive' controller='0' bus='0' target='0' unit='2'/> </disk> <disk type='block' device='disk'> <driver name='qemu' type='raw' cache='writeback' discard='unmap'/> <source dev='/dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_S3Z8NB0M305963H'/> <target dev='hdd' bus='scsi'/> <address type='drive' controller='0' bus='0' target='0' unit='3'/> </disk> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/> </hostdev> The compile is happening on the NVME drive that's passed through at the bottom, but the error points to one of the above drives. I would suspect the first entry (the OS is installed on this one) given the error and cause, as the middle drive is entirely idle. Ideas? For now I've copied the image to a raw NVMe device which appears to work around the problem, but this is obviously less than ideal from a scaling perspective. As a starting point, I ran memtest overnight and it came back clear. Hardware: AMD Threadripper 1950x Asus ROG Zenith Extreme LSI Logic SAS 9207-8i Nothing in the unraid logs (VM or system) correspond to the event.
  15. Thirding this, my sensors suddenly all vanished (Beta 25 if that helps). Manually editing sensors.conf and removing an extra dash from the jc42 sensor fixed this for me after my sensors all dropped off the face of the earth suddenly. Upon further sleuthing, this doesn't ACTUALLY fix the issue though, the sensors command whines and the labels don't work properly if you have overlaps (ie, "temp1" isn't properly pinned to jc42). It seems the "correct" fix is to add a bus statement before the chip, eg chip "k10temp-pci-00c3" label "temp2" "CPU Temp" bus "i2c-0" "SMBus adapter" chip "jc42-i2c-0-19" label "temp1" "MB Temp" fixes the error, though I'm still having trouble getting the label command to work properly, time to hit the man pages I guess. HOWEVER, it's worth noting that jc42 are smbus memory sensors, so this has been mostly a goose chase; mobo temp can't yet be read as we're still missing a driver in unraid for the ITE IT8665E That said, selecting jc42 forever breaks the plugin since it won't remove the line from sensors.conf, and one selected it writes a bad line that breaks the sensors command. The current fix is to remove the line manually from sensors.conf, the plugin should either handle smbus sensors properly or, as a hotfix, just blacklist jc42 and handle the failure mode better.
  16. Two bugs with auto fan it have cropped up since updating to the latest version and unraid 6.9 beta: 1) It does not respect min PWM, and will shut fans down when below the temp range. This is unexpected behavior and should be either toggleable, or at least clearly worded as many servers use the same fans for general system airflow. 2) It seems to incorrectly detect highest disk temp. As an example, logs: Jul 2 20:40:56 jibril autofan: Highest disk temp is 36C, adjusting fan speed from: OFF (0% @ 0rpm) to: 136 (53% @ 0rpm) Jul 2 20:42:03 jibril autofan: Highest disk temp is 35C, adjusting fan speed from: 136 (53% @ 4021rpm) to: OFF (0% @ 3448rpm) Jul 2 20:45:13 jibril autofan: Highest disk temp is 36C, adjusting fan speed from: OFF (0% @ 0rpm) to: 136 (53% @ 0rpm) Jul 2 20:46:20 jibril autofan: Highest disk temp is 35C, adjusting fan speed from: 136 (53% @ 4000rpm) to: OFF (0% @ 3579rpm) Jul 2 20:49:30 jibril autofan: Highest disk temp is 36C, adjusting fan speed from: OFF (0% @ 0rpm) to: 136 (53% @ 0rpm) Jul 2 20:50:37 jibril autofan: Highest disk temp is 35C, adjusting fan speed from: 136 (53% @ 4043rpm) to: OFF (0% @ 3448rpm) Jul 2 20:52:45 jibril autofan: Highest disk temp is 36C, adjusting fan speed from: OFF (0% @ 0rpm) to: 136 (53% @ 0rpm) Jul 2 20:53:52 jibril autofan: Highest disk temp is 35C, adjusting fan speed from: 136 (53% @ 4043rpm) to: OFF (0% @ Meanwhile I'm getting high temp alarms on two drives, coldest spinning is 44C. Of possibly interesting note: Disk 1 was spun down at the time so unraid didn't show its temp, but I noticed later the logs do seem to roughly match that drive. It's also the only non-SAS drive in my array.
  17. Your issue sounds unrelated to mine... you should probably open a new thread.
  18. I've had to turn caching off on all shares as anything writing to cache for more than a few moments brings the whole server to a crawl. Writing lots of smaller new files does seem to be more stable than large file writes though; not quite sure if that's a useful datapoint yet. Also if it helps, the SSDs are both ADATA SU635 (ASU635SS-240GQ-R). I knew going in that QLC drives were fairly flawed, but I don't think anything I'm doing here should be hitting the limitations of the tech. The drives are rated for 520/450MB/s R/W. While some people do report lower speeds, they're still an order of magnitude above what I'm getting here.
  19. I've seen a few threads on slow cache, but the performance here isn't "Oh that could be better", it's typically worse than just writing to straight to disk. As a test, I ran `dd if=/dev/zero of=file.test bs=1024k count=8k` and, well, a picture is worth a thousand words: 853789+0 records in 853789+0 records out 6994239488 bytes (7.0 GB, 6.5 GiB) copied, 298.39 s, 23.4 MB/s btrfs filesystem df: Data, RAID1: total=84.00GiB, used=81.58GiB System, single: total=4.00MiB, used=16.00KiB Metadata, single: total=1.01GiB, used=156.73MiB GlobalReserve, single: total=84.41MiB, used=0.00B No balance found on '/mnt/cache' I have the dynamix trim plugin installed. I also tried manually trimming /mnt/cache right before running the test just to make sure it didn't error and was really running. Pool setup: Unraid 6.7.2, no useful log output.
  20. I'm a programmer, so I'm all too familiar with unix time stamps. The files were created with this docker image, by mounting a remote SMB share and copying the files from the old server to the local one. It looks like creating files locally doesn't reproduce this bug; it's likely specifc to SMB->local transfers.
  21. Update! I think I found the issue; it might be a mixture of a Samba/SMB bug and possibly an Unraid bug, or alternatively a bug in Krusader (as shipped by binhex) I did a bit more digging and statted two (different) files, one that worked and one that didn't. File: Bad.mp4 Size: 120134495 Blocks: 234640 IO Block: 4096 regular file Device: 21h/33d Inode: 4157 Links: 1 Access: (0666/-rw-rw-rw-) Uid: ( 99/ nobody) Gid: ( 100/ users) Access: 1969-12-31 15:59:59.000000000 -0800 Modify: 2019-08-16 15:28:00.169449270 -0700 Change: 2019-08-27 19:07:52.107529449 -0700 Birth: - File: Good.mp4 Size: 182610839 Blocks: 356664 IO Block: 4096 regular file Device: 21h/33d Inode: 3967 Links: 1 Access: (0666/-rw-rw-rw-) Uid: ( 99/ nobody) Gid: ( 100/ users) Access: 2019-08-27 19:21:43.859255029 -0700 Modify: 2019-08-27 19:52:42.613932984 -0700 Change: 2019-08-27 19:52:42.613175754 -0700 Birth: - The only real difference I can see here is that the bad file has an invalid/missing access time. So, as a test, I did touch Bad.mp4 and suddenly it works fine. As a note, opening the file in media player doesn't seem to update the access time; I assume this is a (reasonable) optimization, meaning the only way to unstick the bad files it to write a script to touch them all. Which might be a fine workaround, but before I do that, does anyone want to dig deeper here, or have a slightly less brute-force solution?
  22. Nope, permissions are identical between the files in question. Names as well; I even tried renaming the old folder, making a new one with the same name as the old, and everything worked fine in that one (ie, my test program could now create, read, and write files to the new folder). If I rename the old folder back the problem recurs. I checked permissions ls -n as well, and ensured there weren't just two groups named "users" or something; the permissions are absolutely identical unless there's some additional bit/flag I'm not aware of that doesn't show in ls -ln It's honestly the weirdest dang thing. Also, to be clear: the folder exists in /mnt/user, which is where I copied the files in docker and where I'm checking permissions from. I assumed copying straight to the diskN paths would be a bad idea (also because I copied more than a drive's worth of data). Also as a reminder, the weirdest (by far) part is that I can manage the files from Explorer just fine. I can open them in Mediaplayer Classic without issue, I can rename and delete them, etc. But VLC will reliably give an 'VLC is unable to open the MRL' error. BUT, if I take the exact same file, copy it to a different folder on the same share with Explorer, it works fine! This is all from the same windows machine; and at no point am I getting UACed or asked to do anything additional. New file permissions: -rw-rw-rw- 1 nobody users 605445386 Aug 7 23:41 Test.mp4 Old file permissions: -rw-rw-rw- 1 nobody users 605445386 Aug 7 23:41 Test.mp4 I can also take this new file, rename it, and copy it back to the old folder and it still plays fine. And to throw out it being a VLC-specific quirk, I get similar results in GIMP; so it does seem to be oriented around programs likely using cross-platform toolkits like GTK. Also worth noting that VLC can play the same file from the old NAS (Synology) just fine as well, using the same mechanisms (Mapped network drive over SMB) Edit: I also found another thread from 2018 with the same problem (By searching the exact VLC error, a cryptic "filesystem error: read error: No error"), no solution though.
  23. Further testing on this issue: 1) Found the "Docker safe new perms" tool via a plugin, however running this didn't yield any different results 2) Tried making a new directory and pointing my script at that; it was able to write files fine. Files written this way can be read just fine. 3) Permissions between the "bad" files and good ones appear identical, including user and group 4) Also tried logging in with a user account. This creates file under the correct user (using the "users" group) but those files can still be read as nobody just fine. I'm very confused now. I'd also like to get this sorted out before my trial runs out if possible (1 day left), so if anyone has any ideas on what to check or try please let me know.
  24. I couldn't find a "docker safe" anything under tools, but there was a "New Permissions" tool which looks like the right thing perhaps? unraidip/Tools/NewPerms Ran it on all disks against one of the shares I've been testing with. No change in behavior. Edit: Also, to follow up on permissions, here's one of the affected files: -rw-rw-rw- 1 nobody users 976202725 Aug 17 00:50 test.mp4
  25. Ok this is the weirdest thing I've seen, but it's the last quirk preventing me from retiring my synology box, so here we go. I migrated all the data over via a docker image to the new shares. This seemed to work fine, and I can view the files in explorer, add/remove/edit/open them just fine. HOWEVER, certain programs are unable to read files. They can traverse directories, but will either be unable to see any files, or give obtuse errors when trying to display or open them. I've confirmed this behavior with both Gimp and VLC under windows. For Gimp, the file open dialogue errors out when files are present in the path, and for VLC it fails to open the file (but the file open widget works just fine) The shares themselves are pretty straightforward. They're default configured public shares, and on the windows machines I've tried both mapping them to network drives, or going via UNC paths. I've also confirmed this behavior on two machines, with completely different configurations. I was able to repro this in some software I wrote, and the behavior is similar; I can check if a directory exists and it will return expected results. But checking if a file exists will return false regardless, and attempting to stat or open a file will produce "not found" errors. Running as admin doesn't appear to affect the behavior. However, the synology box does not exhibit these problems. Any ideas on what might be causing this? EDIT: Partial solution/workaround here, may require further investigation to prevent this from happening to others