Jump to content

sorano

Members
  • Posts

    26
  • Joined

  • Last visited

Posts posted by sorano

  1. 2 hours ago, BVD said:

    I'd re-read the instructions- you have 4 lines for your vfs, doing two separate things, where you should only have one line per physical port

     

    Sorry, I'm mobile right now or I'd type out more, but it looks like you mightve mixed both methods for some reason

     

    Well, in the last post you told me I need to vfio bind the NIC before configuring VFs right?

    So after that I added the two first lines:

    # VFIO bind Intel X520 before creating Virtual Functions
    bash /boot/config/vfio-pci-bind.sh 8086:154d 0000:06:00.0
    bash /boot/config/vfio-pci-bind.sh 8086:154d 0000:06:00.1


    Then the next two lines are creating the actual VFs

    # Create Intel X520 Virtual Functions
    echo 4 > /sys/bus/pci/devices/0000:06:00.0/sriov_numvfs
    echo 4 > /sys/bus/pci/devices/0000:06:00.1/sriov_numvfs

    (This is the part that does not work since sriov_numvfs is not visible under /sys/bus/pci/devices/0000:06:00.0/ so the echo does nothing)

     

  2. 3 hours ago, BVD said:

    You have to bind the 520 series prior to using VFs as its partitioning the entire device - take a look at the chipset specific section which goes over that a bit for more detail, but the guide covers it in pretty decent detail, just be sure to follow it through and you'll be set 👍

     

    Damn, I really hoped that would have been it but it's still not working.

     

    Same as before. I cannot even create VFs even though X520 is VFIO bound:

    image.png.36f98f069e1d417d8146c78a8a9c7065.png

     

    image.png.3ebc02e1d170651211b75c9c25609f9c.png

     

    sriov_numfs still does not exist for the card:

    image.png.5d5a11cce816d59246058bb9ee553f02.png

     

     

  3. After spending way to many hours trying to get Mellanox Virtual Function running under Windows I finally gave up and bought an Intel X520 DA-2image.png.3e7ec190581163cca4379f10aa4d6c10.png

     

    New card, new problems.

    It's like the kernel is ignoring the SR-IOV functions of the card for some reason.

     

    image.png.0b811f2ad54f59814d2c8a1fded9f6b6.png

     

    I checked in BIOS but could not find anything related to activating SR-IOV for the card either.

     

    Since sriov_numvfs does not exist for the device I cannot get any VF's.

     

     image.png.ffc6079c2b511b0247c585c11836ced4.png

  4. 10 minutes ago, ich777 said:

    Isn't this a infiniband card or something like that? Unraid doesn't support Infiniband in general I think...

     

    You always can compile them on your own with the Custom Build Mode.

     

    EDIT: Which card are you having exactly?

     

    I have a ConnectX-3 Pro.

     

    But I'm having issues with SR-IOV to a Windows VM as in this thread: 

     

     

    I was discussing my issue with the moderators in VFIO discord and one of them said I should try with newer Mellanox drivers.

  5. @ich777

     

    If you have the time and possibility I would like to request the Mellanox 4.9-2.2.4.0 LTS driver:

    https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed/#tabs-2

     

    Release notes/Installation:

    https://docs.mellanox.com/display/MLNXENv492240/Release+Notes

     

    Linux Driver Compatibility Matrix:

    https://www.mellanox.com/support/mlnx-ofed-matrix

     

    Currently, UnRAID 6.9.1 ships with

    kernel: mlx4_core: Mellanox ConnectX core driver v4.0-0
    modinfo mlx4_en
    filename:       /lib/modules/5.10.21-Unraid/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko.xz
    version:        4.0-0
    license:        Dual BSD/GPL
    description:    Mellanox ConnectX HCA Ethernet driver
    author:         Liran Liss, Yevgeny Petrilin
    srcversion:     EE160E8DB5FA601160D41B2
    depends:        mlx4_core
    retpoline:      Y
    intree:         Y
    name:           mlx4_en
    vermagic:       5.10.21-Unraid SMP mod_unload 
    parm:           udp_rss:Enable RSS for incoming UDP traffic or disabled (0) (uint)
    parm:           pfctx:Priority based Flow Control policy on TX[7:0]. Per priority bit mask (uint)
    parm:           pfcrx:Priority based Flow Control policy on RX[7:0]. Per priority bit mask (uint)
    parm:           inline_thold:Threshold for using inline data (range: 17-104, default: 104) (uint)

     

  6. On 3/29/2021 at 7:06 PM, BVD said:

    What version of windows 10? It's like I said with mellanox drivers, they're all kinda tied up in licensing crap...

     

    My suggestion would be to use something like 7zip to manually extract an older version of the drivers from the executable and then attempt manual installation.

     

    For windows and virtual functions, intel (and some chelsio) are really the only sure fire way to ensure compatibility thanks to the crap nvidia and mellanox have pulled with their drivers.

    It's Windows 10 20H2.

     

    But yeah, seems to be related to that driver mess just like you wrote. While troubleshooting I added the VF to an Ubuntu VM that just boots off Ubuntu 20.10 live iso and the VF worked straight away.

     

    Guess I'm gonna be finding and testing alot of drivers, if you have any recommendation it would be greatly appreciated.

     

    The card is updated with latest official firmware; 2.42.5000.

  7. Seeing this I decided to put my Mellanox ConnectX3-Pro into my UnRAID server.

     

    I created 4 VFs following @ConnectivIT's guide using Option 1 (didnt set any static MAC yet). Then I added 1 of the VFs to my Win10 gaming workstation.

    However, Windows device manager doesn't want to start the device. There is a Mellanox ConnectX-3 VPI (MT041900) Virtual Network Adapter visible, but it is stopped with code 43. I tried installing 5.50.53000 WinOF driver from Nvidia site, but still no go.

     

    Any suggestion for a solution would be greatly appreciated!

  8.   

    On 3/15/2021 at 1:12 PM, ThatDude said:

    I'm experiencing extremely high CPU activity in my Windows 10 VM since upgrading from 6.8.x to 6.9.1, is anyone getting this?

     

    The VM is BlueIris CCTV server and uses a pass through NV710 PCI-E GPU for decoding, I've checked that GPU decoding is working, and it is. But even when I stop the BlueIris service, the base VM when idle is consuming 100% of the 3 CPU's assigned to it. Task manager keeps highlighting system interrupts, so I implemented the two known workarounds for those but it hasn't made a tangible difference.

     

    Yeah I'm also seeing ALOT higher CPU usage on my gaming VM since the update.

     

    Upgraded from 6.8.3 to 6.9.1. Was fine at the start, after a while I started to notice that mouse movement would freeze and stutter, checked load on the VM's isolated cores in unraid webui, 99-100% on all of them.

     

    Since then I've been trying loads of edits to get it better, updated xml to use Q35-5.1 since that is supposed to be alot better with VFIO. Still havent come up with anything really good though. 😒

  9. 3 hours ago, John_M said:

     

    I'd try all the available options in the linked article before giving up. Have you tried option 2 (btrfs restore)?

     

    I had been running memtest86 v8.4 for a couple of hours in order to rule out bad RAM but it was not showing any errors.

    So I decided it could be worth trying out btrfs restore, unfortunately the outcome was pretty similar to mounting readonly and copying. Some files restored fine but the big img files that are of interest would just keep looping.

    I'm going back to memtest and leave that running under night to get more a reliable result. 

  10. Appreciate the help!

     

    I managed to mount the cache array with 

    mount -o ro,notreelog,nologreplay /dev/sdl1 /pleasework

     

    and it's currently copying the data to the main array.

     

    Is there any risk that the files thats currently being copied have become corrupted?

    Is there anything I should pay extra attention to prevent this from happening again?

    And can I trust unraid webui regarding that this disk is the cause of the problem cause it seems that whenever there is issues with the cache array the error will always point at the "first" disk in the cache array.

    image.png.026e61e771c20cc7d02747b6c6eb667b.png

     

    After it's finished copying the data, is there any point in trying to run a btrfs --repair?

     

    I was planning to re-design the array after 6.9 stable but sometimes things doesn't go according to plan :P.

  11. Greetings.

     

    Last night my cache array stopped working as normal.

    What I suspect happened is that some docker container filled the free space.

     

    I'm unsure on how to continue to resolve this.

     

    So far I have started the array in maintenance and ran the btrfs -readonly which shows this at the beginning

    [1/7] checking root items
    [2/7] checking extents
    bad key ordering 42 43
    bad key ordering 42 43
    bad key ordering 42 43
    bad key ordering 42 43
    bad block 16614422200320
    ERROR: errors found in extent allocation tree or chunk allocation
    [3/7] checking free space cache
    there is no free space entry for 16470947741696-16470947749888
    there is no free space entry for 16470947741696-16473475514368
    cache appears valid but isn't 16468106805248
    [4/7] checking fs roots
    bad key ordering 42 43

    then it continues with alot of 

    bad key ordering 42 43

     

    and after that lots of files like

    unresolved ref dir 6983009 index 30 namelen 15 name ClamWinPortable filetype 0 errors 3, no dir item, no dir index root 5 inode 75545 errors 2000, link count wrong

     

    What gives me some hope is the:

    cache appears valid but isn't 16468106805248
     

    So, hopefully some btrfs magic can be made to get it back online.

     

    Edit: I never saw the cache array being near full in the webui, so  I'm thinking something like inodes/allocation full?

     

    Any help is appreciated. I've attached diagnostics.

     

    tower-diagnostics-20210124-1049.zip

  12. Ok, it finished now:

     

    Quote

    Phase 1 - find and verify superblock... - block cache size set to 3067240 entries Phase 2 - using internal log - zero log... zero_log: head block 2019822 tail block 2019818 ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... sb_fdblocks 85383563, counted 87550687 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 Phase 5 - rebuild AG headers and trees... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... Maximum metadata LSN (2:2019828) is ahead of log (1:2). Format log to cycle 5. XFS_REPAIR Summary Fri Jun 28 09:57:32 2019 Phase Start End Duration Phase 1: 06/28 09:23:44 06/28 09:23:44 Phase 2: 06/28 09:23:44 06/28 09:34:22 10 minutes, 38 seconds Phase 3: 06/28 09:34:22 06/28 09:35:49 1 minute, 27 seconds Phase 4: 06/28 09:35:49 06/28 09:35:49 Phase 5: 06/28 09:35:49 06/28 09:35:49 Phase 6: 06/28 09:35:49 06/28 09:36:12 23 seconds Phase 7: 06/28 09:36:12 06/28 09:36:12 Total run time: 12 minutes, 28 seconds done

    Then I stopped maintenance and started the array with the disk still unassigned and my missing folders are back!

     

    What is my next step now? Stop the array and re-assign the disk again?

  13. Quote

    Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

    @johnnie.black

  14. So I stopped array and started in maintenance mode and ran check with -nv

     

    Quote

    Phase 1 - find and verify superblock... - block cache size set to 3067240 entries Phase 2 - using internal log - zero log... zero_log: head block 2019822 tail block 2019818 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... sb_fdblocks 85383563, counted 87550687 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Thu Jun 27 23:40:59 2019 Phase Start End Duration Phase 1: 06/27 23:39:09 06/27 23:39:09 Phase 2: 06/27 23:39:09 06/27 23:39:13 4 seconds Phase 3: 06/27 23:39:13 06/27 23:40:39 1 minute, 26 seconds Phase 4: 06/27 23:40:39 06/27 23:40:39 Phase 5: Skipped Phase 6: 06/27 23:40:39 06/27 23:40:59 20 seconds Phase 7: 06/27 23:40:59 06/27 23:40:59 Total run time: 1 minute, 50 seconds

    So it doesnt tell me to run xfs_repair, what would my next step be?

  15. So when I stopped my array for a reboot to replace a cache ssd it wouldn't stop and I had to powercycle the host after a long wait.

    Once it was back up I got a warning about a array disk that had error'd out.

     

    I started the array with the disk unassigned and then I saw this in syslog:

    Quote

    Jun 27 19:40:59 Tower emhttpd: shcmd (178): mkdir -p /mnt/disk2
    Jun 27 19:40:59 Tower emhttpd: shcmd (179): mount -t xfs -o noatime,nodiratime /dev/md2 /mnt/disk2
    Jun 27 19:40:59 Tower kernel: XFS (md2): Mounting V5 Filesystem
    Jun 27 19:40:59 Tower kernel: XFS (md2): Corruption warning: Metadata has LSN (2:2019828) ahead of current LSN (2:2019822). Please unmount and run xfs_repair (>= v4.3) to resolve.
    Jun 27 19:40:59 Tower kernel: XFS (md2): log mount/recovery failed: error -22
    Jun 27 19:40:59 Tower kernel: XFS (md2): log mount failed
    Jun 27 19:40:59 Tower root: mount: /mnt/disk2: wrong fs type, bad option, bad superblock on /dev/md2, missing codepage or helper program, or other error.
    Jun 27 19:40:59 Tower emhttpd: shcmd (179): exit status: 32
    Jun 27 19:40:59 Tower emhttpd: /mnt/disk2 mount error: No file system
    Jun 27 19:40:59 Tower emhttpd: shcmd (180): umount /mnt/disk2
    Jun 27 19:40:59 Tower root: umount: /mnt/disk2: not mounted.
    Jun 27 19:40:59 Tower emhttpd: shcmd (180): exit status: 32

    Then I noticed some of the emulated shares are missing folders and files.

    I'm running unraid 6.7.2.

     

    What should be my next step in order to try to salvage any data that is missing? I've attached diagnostics.

    tower-diagnostics-20190627-1838.zip

  16. Diagnostics are attached in first posts.

     

    Give me a min and I'll locate the time when it happened.

     

    In the syslog file:

    May 16 12:29:52 Tower sshd[7773]: Accepted password for root from 10.0.1.11 port 51344 ssh2
    May 16 12:30:12 Tower kernel: usb 4-1: USB disconnect, device number 2

     

    So between 12:29:52 and 12:30:12 I ran hdparm -W1 /dev/sdb

     

     

  17. Ok, I forgot to mention I powercycled the cabinett that hosts the drives that had errors after the first stop of the array.

     

    I take it nothing can be done to save the lost files now?

     

    About split-level, I'm running: Automatically split any directory as required.

     

    I tested to browse the share from webgui, how could I verify if they are anywhere else @itimpi ?

     

  18. The latest fix common problems version was complaining that one of my disks did not have write cache enabled.

     

    So in order to fix that I ran hdparm -W1 /dev/sdb (while the array was still mounted). hdparm took longer than normal and ended up with a failure, then just seconds after hdparm ended, unraid started reporting errors on all 4 disks connected to the same hdd cabinet as /dev/sdb. Eventually unraid dropped one of the disks and marked it as failed.

     

    I stopped the array.

    Unassigned the disk that had been marked as failed.

    Started the array.

    Stopped the array.

    Re-assigned the disk that had been failed.

    Now it's running Parity-Sync/Data-Rebuild.

     

    The thing is, I browsed some of my shares and noticed there were folders that were just empty. Will the files inside those folder be recreated when parity/data-rebuild is completed or did I suffer major dataloss? If so can I do anything to try to recover those files? (No backup).

     

    Edit: Forgot diagnostics.

    tower-diagnostics-20190516-1724.zip

  19. I felt the need to clarify the reason how I ended up vfio-bind'ing the HBA. Might save someone else from suffering the same fate.

    At first when I upgraded to 6.7.0 I had problems with booting unraid properly, while safe mode worked.

     

    After troubleshooting the non-boot I found the cause to be something installed from devpack plugin which I removed and then could boot normally. But by then I had already made a clean unraid install on the USB and just copied over my config folder (but I forgot the syslinux folder). The forgotten syslinux folder caused the pcie_acs_override=downstream to dissapear in my syslinux configuration append line which in turn cause my HBA and GPU to end up in the same IOMMU group. The new vfio-pci.cfg will by default passthrough everything in the IOMMU group, and that is what screwed me over when I started my array.

     

    So all in all bad luck mixed with some laziness screwed me over. Don't be lazy people and double check your configurations!

     

    Now I'm hoping some btrfs guru will drop some magic repair command on me to save the day.

     

    IOMMU group 1:	[8086:1901] 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05)
    [8086:1905] 00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) (rev 05)
    [10de:1b81] 01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
    [10de:10f0] 01:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)
    [1000:0072] 02:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

     

  20. So I just upgraded to 6.7.0 and when I rebooted after adding the new vfio-pci.cfg to passthrough my GPU's unraid somehow thought I passthroughed my HBA (and my dumb ass didnt notice) which in turn caused me to start the BTRFS cache array with just 2 SSD's (instead of the normal 10, the missing 8 disks are all connected to the "passthroughed" HBA).

     

    Now my cache array is in the "Unmountable: No file system" state and I'm scared.

     

    I've rolled back the vfio-pci.cfg and unraid can see all SSD's but I need help to repair my BTRFS RAID10 cache filesystem. I havent started the array after my first fuckup and if I try assign the disks to cache now I get the "All existing data on this device will be OVERWRITTEN when array is Started."

     

    Technically the data should still be left on the disks, but metadata and logs are probably bonkers.

     

    I've encountered the "Unmountable: No file system" on the main array and at that time I repaired it with xfs_repair. But that was just 1 disk.

     

    Any guidance on how to proceed to repair the cache array would be very much appreciated.

×
×
  • Create New...