The Black Bear - Threadripper 2990WX build


Recommended Posts

1 hour ago, Gsusking2 said:

Hello @testdasi

Amazing thread! A lot of great information here.  I am using the same MOBO currently (had to replace a x399 that died recently). 

Can i ask you what settings you used on the BIOS to get the USB to boot ?  I created it using the UNRAID boot maker.  It has been years since i had to finagle with the gigabyte bios and i am completely at a loss after reading the Limeware 'getting started guide'

If you could help shed any light on this situation it would be greatly appreciated.

Not much. Go to the boot order page and make sure the USB is on top (if you have UEFI enabled then you should have 2 lines corresponding to the USB stick). In my case, just move "SANDISK" to the top of the boot order.

Link to comment

Can someone clue me in on how to pass NVMe drives to VM's on a Designare and 2950? The two left hand m2's seem to be bound together, making a passthrough to vm with one and using the other for a cache drive impossible. But, for that matter, I can't even get the NVMe drive (set up in baremetal) to boot a vm even if I passthrough both m2's.

IOMMU's:

Quote

IOMMU group 0:[1022:1452] 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 1:[1022:1453] 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge

IOMMU group 2:[1022:1452] 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 3:[1022:1452] 00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 4:[1022:1452] 00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 5:[1022:1452] 00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 6:[1022:1454] 00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B

IOMMU group 7:[1022:1452] 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 8:[1022:1454] 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B

IOMMU group 9:[1022:790b] 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 59)

[1022:790e] 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)

IOMMU group 10:[1022:1460] 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0

[1022:1461] 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1

[1022:1462] 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2

[1022:1463] 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3

[1022:1464] 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4

[1022:1465] 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5

[1022:1466] 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6

[1022:1467] 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7

IOMMU group 11:[1022:1460] 00:19.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0

[1022:1461] 00:19.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1

[1022:1462] 00:19.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2

[1022:1463] 00:19.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3

[1022:1464] 00:19.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4

[1022:1465] 00:19.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5

[1022:1466] 00:19.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6

[1022:1467] 00:19.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7

IOMMU group 12:[1022:43ba] 01:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset USB 3.1 xHCI Controller (rev 02)

[1022:43b6] 01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset SATA Controller (rev 02)

[1022:43b1] 01:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset PCIe Bridge (rev 02)

[1022:43b4] 02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)

[1022:43b4] 02:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)

[1022:43b4] 02:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)

[1022:43b4] 02:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)

[1022:43b4] 02:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)

[8086:1539] 04:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

[8086:24fd] 05:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)

[8086:1539] 06:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

IOMMU group 13:[1022:145a] 08:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function

[1022:1456] 08:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor

[1022:145f] 08:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller

IOMMU group 14:[1022:1455] 09:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function

[1022:7901] 09:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

[1022:1457] 09:00.3 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller

IOMMU group 15:[1022:1452] 40:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 16:[1022:1453] 40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge

IOMMU group 17:[1022:1453] 40:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge

IOMMU group 18:[1022:1452] 40:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 19:[1022:1452] 40:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 20:[1022:1453] 40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge

IOMMU group 21:[1022:1452] 40:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 22:[1022:1452] 40:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 23:[1022:1454] 40:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B

IOMMU group 24:[1022:1452] 40:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge

IOMMU group 25:[1022:1454] 40:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B

IOMMU group 26:[1987:5012] 41:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01)

IOMMU group 27:[1987:5012] 42:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01)

IOMMU group 28:[10de:1c82] 43:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)

[10de:0fb9] 43:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)

IOMMU group 29:[1022:145a] 44:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function

[1022:1456] 44:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor

[1022:145f] 44:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller

IOMMU group 30:[1022:1455] 45:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function

[1022:7901] 45:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

 

Link to comment
16 hours ago, gtroyp said:

Can someone clue me in on how to pass NVMe drives to VM's on a Designare and 2950? The two left hand m2's seem to be bound together, making a passthrough to vm with one and using the other for a cache drive impossible. But, for that matter, I can't even get the NVMe drive (set up in baremetal) to boot a vm even if I passthrough both m2's.

IOMMU's:

 

You need to turn on ACS Override (set it to Both) to split the M.2 out. And just to repeat like a broken record, the security concerns over ACS Override have virtually no impact on consumer use cases.

 

In terms booting, what do you mean by can't get it to boot? Did you change the boot order tag in the VM xml?

Edited by testdasi
Link to comment

I figured out why it wouldn't boot. Don't ask and I won't be embarrassed. However, nothing seems to fix the 2 drives being inseparable. I have turned on ACS, I have moved one of the M2's, but I think because they have same controller type onboard, they show up with the same id, and can't be separated (even when in different IOMMU groups). Going to buy a Sammy M2 and see if that solves the problem....

 

Link to comment
2 hours ago, gtroyp said:

I figured out why it wouldn't boot. Don't ask and I won't be embarrassed. However, nothing seems to fix the 2 drives being inseparable. I have turned on ACS, I have moved one of the M2's, but I think because they have same controller type onboard, they show up with the same id, and can't be separated (even when in different IOMMU groups). Going to buy a Sammy M2 and see if that solves the problem....

 

You probably are confused about IOMMU grouping and PCIe device binding / stubbing.

You are binding by ID so of course devices with the same ID will both be stubbed.

Watch SpaceInvader One latest guide for updated instructions.

 

Link to comment
On 5/22/2020 at 5:43 PM, testdasi said:

You probably are confused about IOMMU grouping and PCIe device binding / stubbing.

You are binding by ID so of course devices with the same ID will both be stubbed.

Watch SpaceInvader One latest guide for updated instructions.

 

Truer words never posted. All hail @SpaceInvaderOne for his help (contributing to his patreon now)

 

Link to comment
  • 1 month later...

Long overdue updates:

  • I am so happy with the Optane performance that I added another one. This time it's the same 905p but 380GB 22110 M.2 form factor. I put it in the same Asus Hyper M.2 adapter / splitter so it's now fully populated (and used exclusively for the workstation VM). My workstation VM now has 380GB Optane boot + 2x 960GB Optane working drives + 2x 3.84TB PM983 storage + 2TB 970 Evo temp.
  • Finally bought a Quadro P2000 for hardware transcoding. Had some driver issues with didn't agree with Plex so I spent a few days migrated to Jellyfin and then Plex issue was fixed. 😅. I still decided to maintain both Plex and Jellyfin. The former is for local network (mainly because I already paid for Plex lifetime membership) and the latter for remote access (because Jellyfin users are managed on my server instead of through 3rd party like Plex).
  • And talking about remote access, finally come to setting up letsencrypt to allow some remote access while on holiday e.g. watch media, read comics etc. Had to pay my ISP for this remote access privilege but it's not too bad.
  • Resisted checking out 6.9.0 beta for quite sometime and then noticed beta22 enables multiple pools so I made the jump only to open the zfs can of worms. 😆
    • So it started with the unraid nvidia custom build has a aforementioned driver clash with Plex. That forces me to look around a bit and noticed ich777 custom version which has a later driver. He also built zfs + nvidia versions which I decided to pick just out of curiosity.
    • My original idea was to set the 2x Intel 750 in a RAID-0 btrfs pool as my daily network-based driver. That wasn't the ideal thing though since I have some stuff that I want fast NVMe speed but not the RAID-0 risks. So after some reading, I found out that zfs pool is created based on partitions instead of full disks (in fact, the zpool create on a /dev/something will create 2 partitions, p1 is BF01 (Zfs Solaris/Apple) + p9 is 8MB BF07 (Solaris reserve) with only the BF01 used in the pool). So then came the grand plan.
      • Run zpool create on the Intel 750 NVMe just to set up p9 correctly, just to be safe.
      • Run gdisk to delete p1 and split into 3 partitions. 512GB + 512GB + the rest (about 93GB).
      • Zpool p1 on each 750 in RAID 0 -> 1TB striped
      • Zpool p2 on each 750 in RAID 1 mirror -> 0.5TB mirror
      • Zpool p3 on each 750 in RAID 1 mirror -> 90+GB mirror
      • Leave p9 alone
    • So I now have a fast daily network driver (p1 striped), a safe daily network driver (p2 mirror e.g. for vdisks, docker, appdata etc.) and a document mirror (p3).
    • I then use znapzend to create snapshots automatically.
  • Some tips with zfs - cuz it wasn't that smooth sailing. It's quite appropriate that the zfs plugin is marked as for expert uses only in the CA store.
    • I specifically use by-id method to point to the partitions. I avoid using /dev/sd method since the code can change.
    • Sharing zfs mounts on SMB causes spamming of sys_get_quota warnings because SMB tries to read quota information that is missing in /etc/mtab. This is because zfs import manages mounts outside of /etc/fstab (which creates entries in /etc/mtab).
      • The solution is pretty simple by echoing mount lines into /etc/mtab for each filesystem that is exposed to SMB, even through symlinks
        echo "[pool]/[filesystem] /mnt/[pool]/[filesystem] zfs rw,default 0 0" >> /etc/mtab

         

    • For whatever reasons, qcow2 image on the zfs filesystem + my VM config = libvirt hanging + zfs unable to destroy the vdisk filesystem.
      • After half a day of trouble shooting and trying out various things, my current solution is to create volume instead of filesystem (-V to create volume, -s to make it thin provisioned). That would automatically create a matching /dev/zd# (zd instead of sd, starting with zd0, zd16, zd32 i.e. increase by 1 hex for each new volume, don't ask me why) that you can mount in the VM as a block device through virtio (just like you would do to "pass through" storage by ata-id).
      • You then need to use qemu-img convert to convert your vdisk file directly into /dev/zd# (target raw format) and voila you have a clone of your vdisk in the zfs volume. Just have to make sure the volume size you create matches the vdisk size.
      • Note that you might want to change cache = none and discard = unmap in the VM xml. The former is recommended but I don't know why. The latter is to enable trim.
      • Presumably destroying a volume will change subsequent zd# codes, requiring changes to the xml. I don't have enough VM for it to be a problem and I also don't expect to destroy volumes often.
      • This is a good way to build snapshot and compression capabilities for OS / filesystem that doesn't support it natively. For compression, there should be somewhat better performance as it's done on the host (with all cores exposed) instead of limited to the cores assigned to the VM.
    • Copying a huge amount of data between filesystems in the same pool using console rsync seems to crash zfs - indefinitely hanging that requires a reboot to get access back. Don't know why. Doing it through smb is fine so far so something is kinda peculiar there. Doesn't affect me that much (only discovered this when trying to clone appdata and vdisk between 2 file systems using rsync).
    • You can use the new 6.9.0 beta feature of using folder as docker mount on the zfs filesystem. It works fine for me with a major annoyance that it would create a massive number of children filesystems required for docker. It makes zfs list very annoying to read so after using it for a day, I moved back to just have a docker image file.
    • I create a filesystem for the appdata of each group of similar dockers. This is to simplify snapshots while still allowing me to have some degrees of freedom in defining snapshot schedules.
    • Turning on compression improves speed but with caveats:
      • It only improves speed with highly compressible data. e.g. reading a file created by dd from /dev/null is 4.5TB/s (write speed was 1.2TB/s)
      • For highly incompressible stuff (e.g. archives, videos, etc.), it actually has a speed penalty, very small with lz4 but there's a penalty.
      • You definitely want to create more filesystems instead of just subfolders to manage compression accordingly.
      • gzip-9 is a fun experiment to hang your server during any IO. When people say lz4 is the best compromise, it's actually true so just stick to that.
  • Future consideration: I'm considering getting another PM983 to create a raidz1 pool in the host + create a volume to mount as virtio volume. That will give me snapshot + raid5 + compression to use in Windows. Not sure about performance so may want to test it out.
  • Thanks 4
Link to comment

A few more ZFS-related updates:

  • My original idea of running PM983 in RAIDZ1 pool + create volume to mount as virtio in VM turned out to be a bad idea. ZFS does NOT respect isolcpus so it lags the main VM under heavy IO. It is normally ok when the volume isn't used by the VM but once mounted, the IOWait exacerbates the lag.
  • I moved the Samsung and Crucial 2TB into 2 single ZFS pools to leverage on the snapshot capabilities with znapzend. It works really well for incremental backup purposes.
  • I'm now looking for another 4TB+ SATA SSD to build a 4x SSD raidz1 pool as my online backup. Planning to add a sync software to sync my main workstation working NVMe with this new pool and use zfs snapshot for incremental. That way I'll have a live backup without having to worry about cryptovirus or stuff like that since restore is a just a zfs rollback away.
Link to comment
  • 2 weeks later...

More updates after a fortnight with zfs:

  • I decided to not use it and switch to btrfs. 😂
    • The biggest reason is that ZFS does not respect isolcpus (there's even an official bug report for it but with no fix mentioned). This is only made worse by what I can only conclude as ZFS preference to use isolated cores after 2 weeks of use. Normally it's fine but under heavy IO, it's painful as it lags even web-browsing.
      • In contrast, btrfs in 6.9.0 is much better. Only balance doesn't respect isolcpus now - which I can live with as I can schedule that at the wee hours. Scrub and normal IO are both fine.
      • I also found out kswapd and unraidd processes don't respect isolcpus. The latter looks like an Unraid-spawned process but at least they don't lag as badly as zfs under heavy IO.
    • A small annoyance is that I have to use CLI to check on pool health and free space. No big deal but it does play a part in the final decision. 
    • The btrfs write hole issue is mostly fixed, except for scenarios that would also affect other non-zfs RAID solutions. At least with btrfs, I can have metadata and system running in RAID1 (or RAID10 / RAID1C3) so the write hole is likely to only corrupt the particular data being written without entirely killing the file system. And frequent scrubbing would also help.
  • The final nail was my epiphany moment realising btrfs can do snapshot just like ZFS. ZFS does have znapzend which makes doing snapshots trivial. However, I was able to create a set of 4 scripts to to do automated snapshot with cleanup in a way very similar to znapzend (e.g. the equivalent to znapzend 1week=>1hour,1month=>1day,1year=>1week,10year=>1month). It's not as elegant but it works well enough that I have no reservation moving to btrfs snapshot.
    • Oh and btrfs also does compression. Not as well as zfs but good enough for backups.
  • So now I took full advantage of Unraid 6.9.0-beta25 multi-pool feature to have 1-2-3-4 (didn't intend for it to be that way)
    • Array: 1x Samsung 970 Evo 2TB
      • No trim in array but I intend to use it as temp space so will have plenty of free space to mitigate write speed degradation (e.g. it's sitting right now with 99.9% free space). Hopefully I won't have to resort to periodic blkdiscard to refresh it.
    • Pool1: 2x Intel 750 1.2TB in RAID0
      • The daily network driver. Most stuff is done on here.
    • Pool2: 3x Samsung 860 Evo 4TB in RAID-5 (metadata + system in RAID1C3)
      • I know it's an overkill doing RAID5 data chunks + RAID1C3 metadata/system chunks but then I'd like something irrational in my life.
      • This is my main game / Steam storage. Performance is surprisingly good even over the network (e.g. ARK only loads about 30% slower than on a passed-through NVMe and that's the worst case scenario due to ARK liberal use of tiny files)
    • Pool3: 4x Kingston DC500R 7.68TB in RAID-5 (metadata + system in RAID10)
      • Running metadata + system chunks in RAID10 provide theoretically better performance (i.e. not perceivable in practice) with the same single-failure protection as RAID-5
      • This is for online backup of my workstation data (with compression) and miscellaneous storage i.e. it's what the array used to be for me.
    • And a spare SATA port for my Bluray drive 🤣
  • It's kinda ironic that I originally used Unraid because it's NOT RAID but have evolved into running 3 RAID pools. 😅 It speaks volume to how important the non-core features (namely VM with PCIe pass-through and docker with readily available apps) have become over the years.
    • Now if only Limetech would remove the requirement to have at least one device in the array but maybe that's too much to ask. 😆

 

 

  • Thanks 1
Link to comment
  • 2 weeks later...

I saw a post on the forum asking if BTRFS RAID5/6 is still a big no-no and given the lack of information out there, it's kinda funny how some outdated noises can reverberate for so long. So let's clarify.

 

The reason BTRFS RAID5/6 has been marked as "unstable" is due to this seemingly mysterious problem called "write hole".

  • A write hole happens when there's unclean shutdown (most commonly due to power outage but could also be due to for example a kernel crash) causing the parity block to be wrong. That problem, on its own, is mundane with BTRFS because a scrub (with correction) will correct it.
    • Note that Unraid default to running BTRFS scrub without correction and as johnnie the sage said, there's no reason to run BTRFS scrub without correction so remember to tick the box. 😅
  • Now if a write hole happens AND you have not scrubbed AND then a disk fails, then you run into trouble because you can't recover from a wrong parity block.
    • This only affects blocks which have wrong parity. Anything that doesn't have wrong parity is still recoverable.
    • Mathematically RAID-1 is a 2-device RAID-5 (because a parity bit of a single bit is itself i.e. x xor 0 = x) so write hole does affect RAID-1 as well. It's just that anything that kills RAID-1 data chunks would also kills RAID-1 metadata chunks (read further below on default profile for metadata) so it isn't considered "unstable" because of reasons.
    • And remember, this only happens when you have TWO consecutive failures without a full scrub in between.
  • I think the main reason why BTRFS folks mark RAID5/6 as unstable is because of ZFS, which by design is immune to write hole. ZFS is like your overachieving cousin - no matter how good you are, your parents always want to upgrade you to that cousin 🧐.
    • The risks of consecutive failures is also relatively higher with HDD, so applying BTRFS RAID5/6 to a HDD pool would certainly carry more risks than ZFS on HDD or BTRFS on SSD.

 

Now it is looking quite a bit less mysterious, we can add a few tips to help with reliability.

  • Start with the more straight-forward, only run BTRFS RAID5/6 on SSD pool to reduce risks of consecutive failures.
    • SSD fail less often (and less catastrophically) and are natively faster, increasing your chance of completing a scrub in between 2 failures.
    • This point is somewhat academic with Unraid uses since most people would put HDD in the array and run BTRFS cache pool (or multiple pools with 6.9.0).
      • But note that it would still be a good idea to run BTRFS Single on the array HDD to benefit from checksum.
  • Run scrub frequently and watch out for IO errors so you can take the appropriate actions promptly.
    • As mentioned, issue only arises with two consecutive failures without a scrub in between so scrubbing and prompt actions are very important.
    • A side bonus with scrubbing for RAID-1/10/5/6 is that it will also correct the unlikely scenario of silent data corruption.
  • Next tip is to run metadata (and system) chunks with as many redundancy as you can.
    • That means
      • DUP profile with single disk (e.g. array / single drive) (DUP = 2 copies of data on a single device)
      • RAID1 with 2 disks even if you have data chunks in RAID-0
      • RAID1C3 with 3 disks
      • RAID1C4 with 4+ disks.
    • To undertand why, we first need to delve into the 3 "chunks" of BTRFS
      • Data is what it says on the tin i.e. your actual data, which should occupy the most storage space
        • When one refers to "BTRFS RAID 5" pool, it typically only refers to data chunks.
      • Metadata is, quoting from BTRFS wiki: "Data about data. In btrfs, this includes all of the internal data structures of the filesystem, including directory structures, filenames, file permissions, checksums, and the location of each file's extents.". Underline added for emphasis.
        • Metadata defaults to DUP with single HDD, SINGLE for single SSD, RAID1 for multi-drive pool.
        • Reason metadata default is SINGLE for SSD but DUP for HDD was originally to reduce write wear but then metadata chunks are relatively quite small so doesn't affect longevity that much. The re-explanation is that consecutive writes usually end up in the same cell anyway and a SSD cell tends to fail in full so even DUP won't help if the SSD fails. That makes more sense to me but it doesn't stop DUP being theoretically slightly better in terms of reliability.
      • System chunk is... I have no clue what it is 😅 but from my understanding, it contains information about how to construct the pool from the various devices.
        • Same default profile as metadata.
    • The filesystem checksum is used to determine if a recovered block is good or corrupted (e.g. due to wrong parity). So as long as your filesystem is intact (or recoverable), you will, at worst, lose only the blocks with wrong parity. The key here is the ability to recover SOME data.
      • As we already discussed in the 1st point, SSD doesn't fail as often and rarely ever fails catastrophically. So if you "lose" a drive, it's likely because of (for example) a cable problem causing the drive to drop offline and not because its moving parts have worn out (because there is no moving part).
      • Running RAID1C3 metadata with 3 disks and RAID5 data sounds like an overkill theoretically because a 2-drive failure would render all data irrecoverable. However, if it was just 2 drive dropping offline because of cable, you still have the remaining drive to hold intact metadata (i.e. the filesystem!) so you just need to shut down, make sure the drives go back online and you have a VERY GOOD CHANCE of losing only the data that has been written AFTER the double failure.
        • Obvioulsy, $#!t happens and can derail the best plan out there but I'm not planning for a zombie apocalypse. 😉
    • You can convert the profile of your data (dconvert), metadata (mconvert), system (sconvert) using these commands
      btrfs balance start -dconvert=[profile d] -mconvert=[profile m] /mnt/[pool name]
      btrfs balance start -sconvert=[profile s] -f /mnt/[pool name]

      For example: 

      btrfs balance start -dconvert=raid5 -mconvert=raid1c3 /mnt/cache
      btrfs balance start -sconvert=raid1c3 -f /mnt/cache

      The reason I split sconvert separately from d/mconvert is that it requires a -f option (i.e. "force"). The force option is considered "dangerous" e.g. if there's a failure during the conversion then your pool info may be corrupted (i.e. "unmountable"). I have run it many times (in fact running everything together with -f) without any issue whatsoever but you have been warned - keep a good backup of anything important.

      • Needless to say, the best time to do this is when you just created the pool.

 

That is it for now. Hope it helps clarifying some misconceptions out there.

 

 

Edited by testdasi
  • Like 2
  • Thanks 2
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.