[6.9.0 beta 35] BTRFS Raid 1 pool issue

JorgeB · December 13, 2020

This is a general support issue, one of the cache devices dropped offline:

Dec 13 06:56:15 NAS kernel: ata4: softreset failed (1st FIS failed)
Dec 13 06:56:15 NAS kernel: ata4: limiting SATA link speed to 3.0 Gbps
Dec 13 06:56:15 NAS kernel: ata4: hard resetting link
Dec 13 06:56:20 NAS kernel: ata4: softreset failed (1st FIS failed)
Dec 13 06:56:20 NAS kernel: ata4: reset failed, giving up
Dec 13 06:56:20 NAS kernel: ata4.00: disabled

This was caused by a problem with the AMD onboard SATA controller:

Dec 13 06:55:14 NAS kernel: ahci 0000:03:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0015 address=0xdef9e000 flags=0x0000]

This issue is quite common with Ryzen boards, look for a BIOS update or disable IOMMU if not needed, a newer/different kernel might also help.

JorgeB · December 13, 2020

Changed Priority to Other

Gnomuz · December 13, 2020

Thanks for the reply.

I'm on the latest version of BIOS, which unluckily is not brand new on this MB (Agesa 1.0.0.6). BIOS releases by AsrockRack are rare ...

As for IOMMU, it is enabled. I understand it is necessary, because I have a GPU with the new Nvidia driver plugin to use it in containers, and I also pass it through to a VM sometime (containers then disabled, of course).

For a newer kernel, I'll have to wait for RC2, as I have telegraf running and using smartctl, and I really don't like the idea of having my 6 HDDs spun up 24/7.

If I understand well, the problem arises because the SDDs are connected through the built-in SATA AMD onboard controller. I may have an option then, as I have an LSI HBA for the 6 HDDs, and two spare trays in the HDD cages. So I could plug the two SSDs in the spare trays and thus have them connected through the HBA.

Is it a possible approach and if yes how should I proceed without any loss of data on the cache ?

Thanks in advance.

JorgeB · December 13, 2020

That should fix the problem for now, you'll lose trim support but that's a small price to pay for stability, you can just connect the SSDs to the LSI, Unraid will pick them up without a need to do anything else.

Gnomuz · December 13, 2020

Thanks, but why will I lose trim support ? You mean trim will not work if I connect SSDs to the LSI HBA (IT mode) instead of the buit-in SATA controller ?

I was aware you can't trim SSDs when they belong to the array.

But my understanding was LSI HBA 9207-8i (SAS2308, FW 20.00.07.00) like mine do support trim when in IT mode, provided the attached SSD(s) have both 'Data Set Management TRIM supported (limit 8 blocks)' & 'Deterministic read ZEROs after TRIM' enabled. I just checked with hdparm -I on my SSDs, both Samsung 840 Pro and 860 Evo have these two features enabled, lucky me for once 😉.

So I think it should work fine in my case.

I insist because for me, possibly losing the trim feature is not really a small price to pay, even against greater stability, especially on a cache drive which is often written to when transferring data from workstations to the NAS and this data is then moved to the array by the mover overnight. From my understanding, it's typically a use case where trimming an SSD does make a lot of sense.

Edited December 13, 2020 by Gnomuz
fw version

JorgeB · December 14, 2020

17 hours ago, Gnomuz said:

But my understanding was LSI HBA 9207-8i (SAS2308, FW 20.00.07.00) like mine do support trim when in IT mode, provided the attached SSD(s) have both 'Data Set Management TRIM supported (limit 8 blocks)' & 'Deterministic read ZEROs after TRIM' enabled. I just checked with hdparm -I on my SSDs, both Samsung 840 Pro and 860 Evo have these two features enabled, lucky me for once

For some time, since firmware P17 IIRC, that trim doesn't work with LSI SAS2 models, it works with SAS3 models (only on SSDs with DRZAT like you mentioned).

Gnomuz · December 14, 2020

1 hour ago, JorgeB said:

For some time, since firmware P17 IIRC, that trim doesn't work with LSI SAS2 models, it works with SAS3 models (only on SSDs with DRZAT like you mentioned).

As I had read contradictory information on this matter, I hot-plugged a spare cold storage SanDisk SSD Plus 2TB in a spare tray. This SSD seems to have the required features, even if DRZAT is called "Deterministic read data after TRIM" :

root@NAS:~# hdparm -I /dev/sdj

/dev/sdj:

ATA device, with non-removable media
        Model Number:       SanDisk SSD PLUS 2000GB                 
        Serial Number:      2025BG457713        
        Firmware Revision:  UP4504RL
        Media Serial Num:   
        Media Manufacturer: 
        Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
        Used: unknown (minor revision code 0x0110) 
        Supported: 10 9 8 7 6 5 
        Likely used: 10
Configuration:
        Logical         max     current
        cylinders       16383   0
        heads           16      0
        sectors/track   63      0
        --
        LBA    user addressable sectors:   268435455
        LBA48  user addressable sectors:  3907029168
        Logical  Sector size:                   512 bytes
        Physical Sector size:                   512 bytes
        Logical Sector-0 offset:                  0 bytes
        device size with M = 1024*1024:     1907729 MBytes
        device size with M = 1000*1000:     2000398 MBytes (2000 GB)
        cache/buffer size  = unknown
        Form Factor: 2.5 inch
        Nominal Media Rotation Rate: Solid State Device
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, no device specific minimum
        R/W multiple sector transfer: Max = 1   Current = 1
        Advanced power management level: disabled
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4 
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    SMART feature set
                Security Mode feature set
           *    Power Management feature set
           *    Write cache
           *    Look-ahead
           *    Host Protected Area feature set
           *    WRITE_BUFFER command
           *    READ_BUFFER command
           *    DOWNLOAD_MICROCODE
                Advanced Power Management feature set
                SET_MAX security extension
           *    48-bit Address feature set
           *    Device Configuration Overlay feature set
           *    Mandatory FLUSH_CACHE
           *    FLUSH_CACHE_EXT
           *    SMART error logging
           *    SMART self-test
           *    General Purpose Logging feature set
           *    64-bit World wide name
           *    WRITE_UNCORRECTABLE_EXT command
           *    {READ,WRITE}_DMA_EXT_GPL commands
           *    Segmented DOWNLOAD_MICROCODE
                unknown 119[8]
           *    Gen1 signaling speed (1.5Gb/s)
           *    Gen2 signaling speed (3.0Gb/s)
           *    Gen3 signaling speed (6.0Gb/s)
           *    Native Command Queueing (NCQ)
           *    Phy event counters
           *    READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
                Device-initiated interface power management
           *    Software settings preservation
                Device Sleep (DEVSLP)
           *    SANITIZE feature set
           *    BLOCK_ERASE_EXT command
           *    SET MAX SETPASSWORD/UNLOCK DMA commands
           *    WRITE BUFFER DMA command
           *    READ BUFFER DMA command
           *    DEVICE CONFIGURATION SET/IDENTIFY DMA commands
           *    Data Set Management TRIM supported (limit 8 blocks) <------
           *    Deterministic read data after TRIM                  <------
Security: 
        Master password revision code = 65534
                supported
        not     enabled
        not     locked
        not     frozen
        not     expired: security count
        not     supported: enhanced erase
        2min for SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5001b444a7732960
        NAA             : 5
        IEEE OUI        : 001b44
        Unique ID       : 4a7732960
Device Sleep:
        DEVSLP Exit Timeout (DETO): 50 ms (drive)
        Minimum DEVSLP Assertion Time (MDAT): 31 ms (drive)
Checksum: correct

I mounted the drive with UD, and fstrim seems to work fine with my SAS2308 HBA (IT mode firmware 20.00.07.00) :

root@NAS:~# fstrim -v /mnt/disks/WKSTN/
/mnt/disks/WKSTN/: 1.8 TiB (2000201273344 bytes) trimmed

So maybe something has changed with the latest versions of the drivers ? That's beyond my technical skills, but I just wanted to report for others who would wonder about TRIM support behind this HBA/firmware couple.

Thanks again for your help @JorgeB !

JorgeB · December 14, 2020

15 minutes ago, Gnomuz said:

So maybe something has changed with the latest versions of the drivers ?

Most likely, it's been a while since I tested.

JorgeB · December 14, 2020

Changed Status to Closed

Gnomuz · December 14, 2020

As the IO error happened again with the cache SSDs connected to the onboard SATA controller, I decided to plug them in the spare trays and thus use LSI 9207-8i instead.

As @JorgeB had told me, the SSDs were recognized by Unraid on reboot, and the cache was mounted properly.

'sdc' is the 840 Pro SSD which didn't report any error so far.

'sdf' is the 860 Evo SSD on which the IO errors occurred.

I launched a fstrim to check the functionality. Here's the worrying result :

root@NAS:~# fstrim -v /mnt/cache
fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error

and syslog is less terse :

Dec 14 19:48:57 NAS kernel: sd 9:0:4:0: [sdf] tag#9146 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Dec 14 19:48:57 NAS kernel: sd 9:0:4:0: [sdf] tag#9146 Sense Key : 0x5 [current]
Dec 14 19:48:57 NAS kernel: sd 9:0:4:0: [sdf] tag#9146 ASC=0x21 ASCQ=0x0
Dec 14 19:48:57 NAS kernel: sd 9:0:4:0: [sdf] tag#9146 CDB: opcode=0x42 42 00 00 00 00 00 00 00 18 00
Dec 14 19:48:57 NAS kernel: blk_update_request: critical target error, dev sdf, sector 973080532 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
Dec 14 19:48:57 NAS kernel: BTRFS warning (device sdf1): failed to trim 1 device(s), last error -121

That's where I need help. I understand, but maybe I'm wrong, that the trim operation succeeded on sdc (840 Pro), which would indicate the problem lies specifically with sdf (860 Evo).

As the physical connection are completely different now (SATA cables from onboard controller to SSDs -> SSDs in hot-plug trays), the SATA cables are not the problem.

I start wondering whether the 860 Evo SSD itself doesn't have an issue. Or maybe I should try to reinitialize it and rebuild the Raid-1, but I must admit BTRFS is really cryptic for me ...

I launched a short SMART self-test which completed without error.

I now launch an extended self-test (polling time 85 min), we'll see ...

Any help and idea will be appreciated.

Edited December 14, 2020 by Gnomuz

Gnomuz · December 14, 2020

I don't know who had the brilliant idea of marking this thread as "closed" after I reopened it, but I can confirm there's still an issue there. But again, as on another issue, perhaps "Closed" means "Open" or "Under review" for some around there, who knows ...

Anyway, the extended smart test completed without error.

Even with my very little understanding of BTRFS, I found a thread elsewhere and issued the command "btrfs check /dev/sd1", which reported many errors (output too long for here), among which "errors found in fs roots"

I think the file system has been corrupted on this SSD due to the device going offline when still connected to the onboard controller. And I'm almost sure I will lose data if the clean SDD were to have any problem....

So now, how could I rebuild the raid1 pool with both SSDs assuming sdc is clean and sdf needs to be somehow reformatted ?

Thanks in advance for your guidance and please do not bury this thread again.

Gnomuz · December 14, 2020

Changed Status to Open

JorgeB · December 15, 2020

Like mentioned before this is not a bug, it's a general support issue, and I also did mentioned that most likely trim wouldn't work with that LSI, if you still need support you should start a thread in the general support forum.

Gnomuz · December 15, 2020

Changed Status to Closed

Gnomuz · December 15, 2020

Thanks, I just thought that following up on the initial issue would help the community understand the context and provide relevant answers. My mistake.

So I'm going to open a thread in General Support to ask the very same question, perhaps with a reference to this (non-bug) thread, if it doesn't infringe any forum rule.

Marshalleq · December 18, 2020

There have been arguments on both side of the fence (btrfs is great, btrfs is not great). My experience has been the latter and I would recommend not using it in your cache. Switch to XFS and forget the mirror. With BTRFS mirror I did not have a reliable experience as have others.

BTRFS 'should' be able to cope with a disconnect, even if it had to be fixed manually. However, if both devices have got constant hardware disconnects then I guess that's going to be pretty challenging on any filesystem.

From my quick reading, this is a kernel issue though, not a hardware issue, so if you're not on the latest beta, if it were me, I'd probably shift to that given rc1 has the much newer kernel.

Running SSD's via an LSI card will reduce their speed as well BTW.

Another thing I'd do is raise it directly on the kernel mailing list (it's probably already there if it's still a current issue) because that would eventually find it's way into unraid.

Hopefully you're not on the latest beta / kernel as that alone might possibly solve it.

Gnomuz · December 18, 2020

54 minutes ago, Marshalleq said:

There have been arguments on both side of the fence (btrfs is great, btrfs is not great). My experience has been the latter and I would recommend not using it in your cache. Switch to XFS and forget the mirror. With BTRFS mirror I did not have a reliable experience as have others.

BTRFS 'should' be able to cope with a disconnect, even if it had to be fixed manually. However, if both devices have got constant hardware disconnects then I guess that's going to be pretty challenging on any filesystem.

From my quick reading, this is a kernel issue though, not a hardware issue, so if you're not on the latest beta, if it were me, I'd probably shift to that given rc1 has the much newer kernel.

Running SSD's via an LSI card will reduce their speed as well BTW.

Another thing I'd do is raise it directly on the kernel mailing list (it's probably already there if it's still a current issue) because that would eventually find it's way into unraid.

Hopefully you're not on the latest beta / kernel as that alone might possibly solve it.

Thanks for your thoughts on my initial problem. I'm not on the latest beta RC1 because there's an issue which prevents disks from spinning down in configurations similar to mine (telegraf/smartctl). Waiting for RC2 and thus kernel 5.9+ to test again the SSDs connected to the onboard SATA controller, as it's a known issue with X470 boards.

Btw I only had disconnection issues with one of the SSDs, but btrfs never coped with it. For sure, a Raid 1 file system which turns unwritable when one the pool members fails while the other is up and running and requires a reboot to start over is not exactly what I expected ... So let's say we share the same unreliable experience with BTRFS mirroring.

For the moment I let the cache SSDs running via the LSI, have converted the cache to XFS and forgotten the mirroring as you suggested. I've immediately noticed that with the same global I/O load on the cache, the constant write load from running VMs and containers was divided by circa 3.5 switching from btrfs raid1 to xfs. At least, btrfs write amplification is a reality I can confirm...

Edited December 18, 2020 by Gnomuz

Marshalleq · December 19, 2020

2 hours ago, Gnomuz said:

a Raid 1 file system which turns unwritable when one the pool members fails while the other is up and running and requires a reboot to start over is not exactly what I expected ... So let's say we share the same unreliable experience with BTRFS mirroring.

LOL, I know right?! This is what amazes me sometimes with the defenders of BTRFS. The basics 'seem' to be overlooked. That said, I've never been sure if it's BTRFS or something wonky with the way Unraid formats the mirror. I know in the past that there definitely WAS a problem with the way unraid formatted it (I got shot down for suggesting that too), but in the end that was fixed by the wonderful guys at LimeTech.

In that prior example, I found that I could fix the issue in the command line and not in the GUI, but I still had issues with BTRFS getting corrupted so ended up giving up. But full disclaimer, I'm pretty light on BTRFS experience - however having used probably more filesystems than most, in both large production and small residential instances, I think I'm qualified enough to say it's unreliable, at least in Unraid (only place I've used it).

2 hours ago, Gnomuz said:

For the moment I let the cache SSDs running via the LSI, have converted the cache to XFS and forgotten the mirroring as you suggested. I've immediately noticed that with the same global I/O load on the cache, the constant write load from running VMs and containers was divided by circa 3.5 switching from btrfs raid1 to xfs. At least, btrfs write amplification is a reality I can confirm...

That's an impressive stat. I thought I read write amplification was solved some time ago in unraid, but perhaps it's come back (or perhaps a reformat is needed with accurate cluster sizes or something). BTW, I am also sure I read in the previous beta that the spin down issue was solved for SAS drives, which logically should also mean with a controller of some sort. It was in one of the release notes.

You may be interested in my post this morning on my experience migrating away from unraids array here (though still using unraid product). Not for the inexperienced but I'm definitely looking forward to ZFS being baked in.

Edited December 19, 2020 by Marshalleq

Gnomuz · December 19, 2020

6 hours ago, Marshalleq said:

That's an impressive stat. I thought I read write amplification was solved some time ago in unraid, but perhaps it's come back (or perhaps a reformat is needed with accurate cluster sizes or something). BTW, I am also sure I read in the previous beta that the spin down issue was solved for SAS drives, which logically should also mean with a controller of some sort. It was in one of the release notes.

As for the (non) spin down issue, I understand it's brand new in RC1 due to the kernel upgrade to 5.9, when smartctl is used by e.g. telegraf to poll disks stats, and should be fixed in next RC.

So, I keep away for the moment. That would really be pity to have all disks spun up in an Unraid array without getting the benefits from a file system which by nature keeps disks spun up but gives you performance and "minor" features such as snapshots or read caching in return, wouldn't it ? 😉

For the write amplification, I had cautiously followed the steps to align both SSDs partitions to 1MB iirc, so the comparison of write loads between BTRFS Raid1 and xfs cache is on an already "optimized" btrfs setup. I just threw an eye to the overnight stats, a 3 to 4 factor is confirmed so far.

Marshalleq · December 19, 2020

Well according to RC2's release notes I was wrong about SAS and domains. But 1.5 of your issues are sorted in RC2 by the looks.

Gnomuz · December 20, 2020

Well, no upgrade on Sundays, family first ...

As for write amplification, now that I can step back, I can confirm my preliminary findings about the evolution of the write load on the cache. I compared the average write load between 12/12 (BTRFS Raid1) and 12/19 (XFS) on a full 24h period, and the result is clear : 1.35 MB/s vs 454 kB/s, i.e. a factor of 3.

As I can't believe the overhead due to Raid1 metadata management may explain such a difference, it obviously confirms for me a BTRFS weakness, whatever partition alignment is ...

Edited December 20, 2020 by Gnomuz

JorgeB · December 20, 2020

Btrfs is known to have a higher write amplification, mostly because of being COW and checksummed, it can reach 25x IIRC, there's a study about that.

[6.9.0 beta 35] BTRFS Raid 1 pool issue

User Feedback

Recommended Comments

JorgeB 7507

Link to comment

JorgeB 7507

Link to comment

Gnomuz 50

Link to comment

JorgeB 7507

Link to comment

Gnomuz 50

Link to comment

JorgeB 7507

Link to comment

Gnomuz 50

Link to comment

JorgeB 7507

Link to comment

JorgeB 7507

Link to comment

Gnomuz 50

Link to comment

Gnomuz 50

Link to comment

Gnomuz 50

Link to comment

JorgeB 7507

Link to comment

Gnomuz 50

Link to comment

Gnomuz 50

Link to comment

Marshalleq 139

Link to comment

Gnomuz 50

Link to comment

Marshalleq 139

Link to comment

Gnomuz 50

Link to comment

Marshalleq 139

Link to comment

Gnomuz 50

Link to comment

JorgeB 7507

Link to comment

Join the conversation