• [6.9.0 beta 35] BTRFS Raid 1 pool issue


    Gnomuz
    • Closed

    This morning I found all containers and VMS using the BTRFS Raid 1 cache pool irresponsive with numerous log errors, and the syslog was spammed with messages like "Dec 13 06:56:54 NAS kernel: sd 5:0:0:0: [sdi] tag#6 access beyond end of device".

    The cache pool is composed of two SATA SSDs :

    - sdi : 840 Pro 512GB

    - sdh : 860 Evo 500 GB

    I quickly understood any write access to the cache pool was reporting errors since Dec 13 06:55:14. 

     

    A similar issue has already happened once in 6.9.0 beta25, but today I was clever enough (?) to download diags attached before stopping VM manager and Docker and then reboot. After reboot, I performed a full balance and scrub (no errors) on the pool, then restarted VMs and containers, and everything works fine again. Despite, a parity check was launched after reboot, for whatever reason the shutdown was considered as unclean.

     

    It may be a hardware issue, but I've also always wondered if it was a good idea from me to have a Raid-1 pool with two drives of different capacity, which btw has a reported size of 506GB (!) in the "Main" tab.

     

    Thanks in advance for having a look at the diags and hopefully give me some ideas of how to get rid of this repeated and very worrying instability.

    nas-diagnostics-20201213-0948.zip



    User Feedback

    Recommended Comments

    This is a general support issue, one of the cache devices dropped offline:

     

    Dec 13 06:56:15 NAS kernel: ata4: softreset failed (1st FIS failed)
    Dec 13 06:56:15 NAS kernel: ata4: limiting SATA link speed to 3.0 Gbps
    Dec 13 06:56:15 NAS kernel: ata4: hard resetting link
    Dec 13 06:56:20 NAS kernel: ata4: softreset failed (1st FIS failed)
    Dec 13 06:56:20 NAS kernel: ata4: reset failed, giving up
    Dec 13 06:56:20 NAS kernel: ata4.00: disabled

    This was caused by a problem with the AMD onboard SATA controller:

    Dec 13 06:55:14 NAS kernel: ahci 0000:03:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0015 address=0xdef9e000 flags=0x0000]

     

    This issue is quite common with Ryzen boards, look for a BIOS update or disable IOMMU if not needed, a newer/different kernel might also help.

    Link to comment
    Share on other sites

    Thanks for the reply.

     

    I'm on the latest version of BIOS, which unluckily is not brand new on this MB (Agesa 1.0.0.6). BIOS releases by AsrockRack are rare ...

    As for IOMMU, it is enabled. I understand it is necessary, because I have a GPU with the new Nvidia driver plugin to use it in containers, and I also pass it through to a VM sometime (containers then disabled, of course).

    For a newer kernel, I'll have to wait for RC2, as I have telegraf running and using smartctl, and I really don't like the idea of having my 6 HDDs spun up 24/7.

     

    If I understand well, the problem arises because the SDDs are connected through the built-in SATA AMD onboard controller. I may have an option then, as I have an LSI HBA for the 6 HDDs, and two spare trays in the HDD cages. So I could plug the two SSDs in the spare trays and thus have them connected through the HBA.

    Is it a possible approach and if yes how should I proceed without any loss of data on the cache ?

    Thanks in advance.

    Link to comment
    Share on other sites

    That should fix the problem for now, you'll lose trim support but that's a small price to pay for stability, you can just connect the SSDs to the LSI, Unraid will pick them up without a need to do anything else.

    Link to comment
    Share on other sites

    Thanks, but why will I lose trim support ? You mean trim will not work if I connect SSDs to the LSI HBA (IT mode) instead of the buit-in SATA controller ?

    I was aware you can't trim SSDs when they belong to the array.

    But my understanding was LSI HBA 9207-8i (SAS2308, FW 20.00.07.00) like mine do support trim when in IT mode, provided the attached SSD(s) have both 'Data Set Management TRIM supported (limit 8 blocks)' & 'Deterministic read ZEROs after TRIM' enabled. I just checked with hdparm -I on my SSDs, both Samsung 840 Pro and 860 Evo have these two features enabled, lucky me for once 😉.

    So I think it should work fine in my case.

    I insist because for me, possibly losing the trim feature is not really a small price to pay, even against greater stability, especially on a cache drive which is often written to when transferring data from workstations to the NAS and this data is then moved to the array by the mover overnight. From my understanding, it's typically a use case where trimming an SSD does make a lot of sense.

    Edited by Gnomuz
    fw version
    Link to comment
    Share on other sites
    17 hours ago, Gnomuz said:

    But my understanding was LSI HBA 9207-8i (SAS2308, FW 20.00.07.00) like mine do support trim when in IT mode, provided the attached SSD(s) have both 'Data Set Management TRIM supported (limit 8 blocks)' & 'Deterministic read ZEROs after TRIM' enabled. I just checked with hdparm -I on my SSDs, both Samsung 840 Pro and 860 Evo have these two features enabled, lucky me for once 

    For some time, since firmware P17 IIRC, that trim doesn't work with LSI SAS2 models, it works with SAS3 models (only on SSDs with DRZAT like you mentioned).

    Link to comment
    Share on other sites
    1 hour ago, JorgeB said:

    For some time, since firmware P17 IIRC, that trim doesn't work with LSI SAS2 models, it works with SAS3 models (only on SSDs with DRZAT like you mentioned).

    As I had read contradictory information on this matter, I hot-plugged a spare cold storage SanDisk SSD Plus 2TB in a spare tray. This SSD seems to have the required features, even if DRZAT is called "Deterministic read data after TRIM" :

    root@NAS:~# hdparm -I /dev/sdj
    
    /dev/sdj:
    
    ATA device, with non-removable media
            Model Number:       SanDisk SSD PLUS 2000GB                 
            Serial Number:      2025BG457713        
            Firmware Revision:  UP4504RL
            Media Serial Num:   
            Media Manufacturer: 
            Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
    Standards:
            Used: unknown (minor revision code 0x0110) 
            Supported: 10 9 8 7 6 5 
            Likely used: 10
    Configuration:
            Logical         max     current
            cylinders       16383   0
            heads           16      0
            sectors/track   63      0
            --
            LBA    user addressable sectors:   268435455
            LBA48  user addressable sectors:  3907029168
            Logical  Sector size:                   512 bytes
            Physical Sector size:                   512 bytes
            Logical Sector-0 offset:                  0 bytes
            device size with M = 1024*1024:     1907729 MBytes
            device size with M = 1000*1000:     2000398 MBytes (2000 GB)
            cache/buffer size  = unknown
            Form Factor: 2.5 inch
            Nominal Media Rotation Rate: Solid State Device
    Capabilities:
            LBA, IORDY(can be disabled)
            Queue depth: 32
            Standby timer values: spec'd by Standard, no device specific minimum
            R/W multiple sector transfer: Max = 1   Current = 1
            Advanced power management level: disabled
            DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
                 Cycle time: min=120ns recommended=120ns
            PIO: pio0 pio1 pio2 pio3 pio4 
                 Cycle time: no flow control=120ns  IORDY flow control=120ns
    Commands/features:
            Enabled Supported:
               *    SMART feature set
                    Security Mode feature set
               *    Power Management feature set
               *    Write cache
               *    Look-ahead
               *    Host Protected Area feature set
               *    WRITE_BUFFER command
               *    READ_BUFFER command
               *    DOWNLOAD_MICROCODE
                    Advanced Power Management feature set
                    SET_MAX security extension
               *    48-bit Address feature set
               *    Device Configuration Overlay feature set
               *    Mandatory FLUSH_CACHE
               *    FLUSH_CACHE_EXT
               *    SMART error logging
               *    SMART self-test
               *    General Purpose Logging feature set
               *    64-bit World wide name
               *    WRITE_UNCORRECTABLE_EXT command
               *    {READ,WRITE}_DMA_EXT_GPL commands
               *    Segmented DOWNLOAD_MICROCODE
                    unknown 119[8]
               *    Gen1 signaling speed (1.5Gb/s)
               *    Gen2 signaling speed (3.0Gb/s)
               *    Gen3 signaling speed (6.0Gb/s)
               *    Native Command Queueing (NCQ)
               *    Phy event counters
               *    READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
                    Device-initiated interface power management
               *    Software settings preservation
                    Device Sleep (DEVSLP)
               *    SANITIZE feature set
               *    BLOCK_ERASE_EXT command
               *    SET MAX SETPASSWORD/UNLOCK DMA commands
               *    WRITE BUFFER DMA command
               *    READ BUFFER DMA command
               *    DEVICE CONFIGURATION SET/IDENTIFY DMA commands
               *    Data Set Management TRIM supported (limit 8 blocks) <------
               *    Deterministic read data after TRIM                  <------
    Security: 
            Master password revision code = 65534
                    supported
            not     enabled
            not     locked
            not     frozen
            not     expired: security count
            not     supported: enhanced erase
            2min for SECURITY ERASE UNIT.
    Logical Unit WWN Device Identifier: 5001b444a7732960
            NAA             : 5
            IEEE OUI        : 001b44
            Unique ID       : 4a7732960
    Device Sleep:
            DEVSLP Exit Timeout (DETO): 50 ms (drive)
            Minimum DEVSLP Assertion Time (MDAT): 31 ms (drive)
    Checksum: correct

     I mounted the drive with UD, and fstrim seems to work fine with my SAS2308 HBA (IT mode firmware 20.00.07.00) :

    root@NAS:~# fstrim -v /mnt/disks/WKSTN/
    /mnt/disks/WKSTN/: 1.8 TiB (2000201273344 bytes) trimmed

    So maybe something has changed with the latest versions of the drivers ? That's beyond my technical skills, but I just wanted to report for others who would wonder about TRIM support behind this HBA/firmware couple.

    Thanks again for your help @JorgeB !

    • Like 1
    Link to comment
    Share on other sites
    15 minutes ago, Gnomuz said:

    So maybe something has changed with the latest versions of the drivers ?

    Most likely, it's been a while since I tested.

    Link to comment
    Share on other sites

    As the IO error happened again with the cache SSDs connected to the onboard SATA controller, I decided to plug them in the spare trays and thus use LSI 9207-8i instead.

    As @JorgeB had told me, the SSDs were recognized by Unraid on reboot, and the cache was mounted properly.

    'sdc' is the 840 Pro SSD which didn't report any error so far.

    'sdf' is the 860 Evo SSD on which the IO errors occurred.

     

    I launched a fstrim to check the functionality. Here's the worrying result :

    root@NAS:~# fstrim -v /mnt/cache
    fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error

    and syslog is less terse :

    Dec 14 19:48:57 NAS kernel: sd 9:0:4:0: [sdf] tag#9146 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
    Dec 14 19:48:57 NAS kernel: sd 9:0:4:0: [sdf] tag#9146 Sense Key : 0x5 [current]
    Dec 14 19:48:57 NAS kernel: sd 9:0:4:0: [sdf] tag#9146 ASC=0x21 ASCQ=0x0
    Dec 14 19:48:57 NAS kernel: sd 9:0:4:0: [sdf] tag#9146 CDB: opcode=0x42 42 00 00 00 00 00 00 00 18 00
    Dec 14 19:48:57 NAS kernel: blk_update_request: critical target error, dev sdf, sector 973080532 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0
    Dec 14 19:48:57 NAS kernel: BTRFS warning (device sdf1): failed to trim 1 device(s), last error -121

    That's where I need help. I understand, but maybe I'm wrong, that the trim operation succeeded on sdc (840 Pro), which would indicate the problem lies specifically with sdf (860 Evo).

    As the physical connection are completely different now (SATA cables from onboard controller to SSDs -> SSDs in hot-plug trays), the SATA cables are not the problem.

    I start wondering whether the 860 Evo SSD itself doesn't have an issue. Or maybe I should try to reinitialize it and rebuild the Raid-1, but I must admit BTRFS is really cryptic for me ...

     

    I launched a short SMART self-test which completed without error.

    I now launch an extended self-test (polling time 85 min), we'll see ...

     

    Any help and idea will be appreciated.

    Edited by Gnomuz
    Link to comment
    Share on other sites

    I don't know who had the brilliant idea of marking this thread as "closed" after I reopened it, but I can confirm there's still an issue there. But again, as on another issue, perhaps "Closed" means "Open" or "Under review" for some around there, who knows ...

     

    Anyway, the extended smart test completed without error.

    Even with my very little understanding of BTRFS, I found a thread elsewhere and issued the command "btrfs check /dev/sd1", which reported many errors (output too long for here), among which "errors found in fs roots"

    I think the file system has been corrupted on this SSD due to the device going offline when still connected to the onboard controller. And I'm almost sure I will lose data if the clean SDD were to have any problem....

    So now, how could I rebuild the raid1 pool with both SSDs assuming sdc is clean and sdf needs to be somehow reformatted ?

    Thanks in advance for your guidance and please do not bury this thread again.

    Link to comment
    Share on other sites

    Like mentioned before this is not a bug, it's a general support issue, and I also did mentioned that most likely trim wouldn't work with that LSI, if you still need support you should start a thread in the general support forum.

    Link to comment
    Share on other sites

    Thanks, I just thought that following up on the initial issue would help the community understand the context and provide relevant answers. My mistake.

    So I'm going to open a thread in General Support to ask the very same question, perhaps with a reference to this (non-bug) thread, if it doesn't infringe any forum rule.

    Link to comment
    Share on other sites

    There have been arguments on both side of the fence (btrfs is great, btrfs is not great).  My experience has been the latter and I would recommend not using it in your cache.  Switch to XFS and forget the mirror.  With BTRFS mirror I did not have a reliable experience as have others.

     

    BTRFS 'should' be able to cope with a disconnect, even if it had to be fixed manually. However, if both devices have got constant hardware disconnects then I guess that's going to be pretty challenging on any filesystem.

     

    From my quick reading, this is a kernel issue though, not a hardware issue, so if you're not on the latest beta, if it were me, I'd probably shift to that given rc1 has the much newer kernel.  

     

    Running SSD's via an LSI card will reduce their speed as well BTW.

     

    Another thing I'd do is raise it directly on the kernel mailing list (it's probably already there if it's still a current issue) because that would eventually find it's way into unraid.

     

    Hopefully you're not on the latest beta / kernel as that alone might possibly solve it.

     

     

    Link to comment
    Share on other sites
    54 minutes ago, Marshalleq said:

    There have been arguments on both side of the fence (btrfs is great, btrfs is not great).  My experience has been the latter and I would recommend not using it in your cache.  Switch to XFS and forget the mirror.  With BTRFS mirror I did not have a reliable experience as have others.

     

    BTRFS 'should' be able to cope with a disconnect, even if it had to be fixed manually. However, if both devices have got constant hardware disconnects then I guess that's going to be pretty challenging on any filesystem.

     

    From my quick reading, this is a kernel issue though, not a hardware issue, so if you're not on the latest beta, if it were me, I'd probably shift to that given rc1 has the much newer kernel.  

     

    Running SSD's via an LSI card will reduce their speed as well BTW.

     

    Another thing I'd do is raise it directly on the kernel mailing list (it's probably already there if it's still a current issue) because that would eventually find it's way into unraid.

     

    Hopefully you're not on the latest beta / kernel as that alone might possibly solve it.

     

     

    Thanks for your thoughts on my initial problem. I'm not on the latest beta RC1 because there's an issue which prevents disks from spinning down in configurations similar to mine (telegraf/smartctl). Waiting for RC2 and thus kernel 5.9+ to test again the SSDs connected to the onboard SATA controller, as it's a known issue with X470 boards.

     

    Btw I only had disconnection issues with one of the SSDs, but btrfs never coped with it. For sure, a Raid 1 file system which turns unwritable when one the pool members fails while the other is up and running and requires a reboot to start over is not exactly what I expected ... So let's say we share the same unreliable experience with BTRFS mirroring.

     

    For the moment I let the cache SSDs running via the LSI, have converted the cache to XFS and forgotten the mirroring as you suggested. I've immediately noticed that with the same global I/O load on the cache, the constant write load from running VMs and containers was divided by circa 3.5 switching from btrfs raid1 to xfs. At least, btrfs write amplification is a reality I can confirm...

    Edited by Gnomuz
    Link to comment
    Share on other sites

     

    2 hours ago, Gnomuz said:

    a Raid 1 file system which turns unwritable when one the pool members fails while the other is up and running and requires a reboot to start over is not exactly what I expected ... So let's say we share the same unreliable experience with BTRFS mirroring.

    LOL, I know right?!  This is what amazes me sometimes with the defenders of BTRFS.  The basics 'seem' to be overlooked.  That said, I've never been sure if it's BTRFS or something wonky with the way Unraid formats the mirror.  I know in the past that there definitely WAS a problem with the way unraid formatted it (I got shot down for suggesting that too), but in the end that was fixed by the wonderful guys at LimeTech.

     

    In that prior example, I found that I could fix the issue in the command line and not in the GUI, but I still had issues with BTRFS getting corrupted so ended up giving up.  But full disclaimer, I'm pretty light on BTRFS experience - however having used probably more filesystems than most, in both large production and small residential instances, I think I'm qualified enough to say it's unreliable, at least in Unraid (only place I've used it).

    2 hours ago, Gnomuz said:

    For the moment I let the cache SSDs running via the LSI, have converted the cache to XFS and forgotten the mirroring as you suggested. I've immediately noticed that with the same global I/O load on the cache, the constant write load from running VMs and containers was divided by circa 3.5 switching from btrfs raid1 to xfs. At least, btrfs write amplification is a reality I can confirm...

    That's an impressive stat.  I thought I read write amplification was solved some time ago in unraid, but perhaps it's come back (or perhaps a reformat is needed with accurate cluster sizes or something).  BTW, I am also sure I read in the previous beta that the spin down issue was solved for SAS drives, which logically should also mean with a controller of some sort.  It was in one of the release notes.

     

    You may be interested in my post this morning on my experience migrating away from unraids array here (though still using unraid product).  Not for the inexperienced but I'm definitely looking forward to ZFS being baked in.

    Edited by Marshalleq
    Link to comment
    Share on other sites
    6 hours ago, Marshalleq said:

    That's an impressive stat.  I thought I read write amplification was solved some time ago in unraid, but perhaps it's come back (or perhaps a reformat is needed with accurate cluster sizes or something).  BTW, I am also sure I read in the previous beta that the spin down issue was solved for SAS drives, which logically should also mean with a controller of some sort.  It was in one of the release notes.

    As for the (non) spin down issue, I understand it's brand new in RC1 due to the kernel upgrade to 5.9, when smartctl is used by e.g. telegraf to poll disks stats, and should be fixed in next RC.

    So, I keep away for the moment. That would really be pity to have all disks spun up in an Unraid array without getting the benefits from a file system which by nature keeps disks spun up but gives you performance and "minor" features such as snapshots or read caching in return, wouldn't it ? 😉

    For the write amplification, I had cautiously followed the steps to align both SSDs partitions to 1MB iirc, so the comparison of write loads between BTRFS Raid1 and xfs cache is on an already "optimized" btrfs setup. I just threw an eye to the overnight stats, a 3 to 4 factor is confirmed so far.

    Link to comment
    Share on other sites

    Well according to RC2's release notes I was wrong about SAS and domains.  But 1.5 of your issues are sorted in RC2 by the looks. :)

    Link to comment
    Share on other sites

    Well, no upgrade on Sundays, family first ...

    As for write amplification, now that I can step back, I can confirm my preliminary findings about the evolution of the write load on the cache. I compared the average write load between 12/12 (BTRFS Raid1) and 12/19 (XFS) on a full 24h period, and the result is clear : 1.35 MB/s vs 454 kB/s, i.e. a factor of 3.

    As I can't believe the overhead due to Raid1 metadata management may explain such a difference, it obviously confirms for me a BTRFS weakness, whatever partition alignment is ...

    Edited by Gnomuz
    Link to comment
    Share on other sites

    Btrfs is known to have a higher write amplification, mostly because of being COW and checksummed, it can reach 25x IIRC, there's a study about that.

    Link to comment
    Share on other sites


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.