• [6.10.0-RC4] Pool unmountable if array is re-started after device replacement


    JorgeB
    • Solved

    Found in this thread, was able to reproduce in safe mode to make sure it's not plugin related, how to reproduce:

     

    -start with a redundant pool

    -replace one device

    -replacement will complete successfully and pool will work normally during/after the replacement

    -stop/start array and pool will now be unmountable:

     

    Apr 10 12:55:48 Test2 emhttpd: shcmd (354): mkdir -p /mnt/cache
    Apr 10 12:55:48 Test2 emhttpd: /mnt/cache uuid: 601ca645-abb2-463f-881e-074622a7abbb
    Apr 10 12:55:48 Test2 emhttpd: /mnt/cache found: 2
    Apr 10 12:55:48 Test2 emhttpd: /mnt/cache extra: 0
    Apr 10 12:55:48 Test2 emhttpd: /mnt/cache missing: 1
    Apr 10 12:55:48 Test2 emhttpd: /mnt/cache Label: none  uuid: 601ca645-abb2-463f-881e-074622a7abbb
    Apr 10 12:55:48 Test2 emhttpd: /mnt/cache Total devices 2 FS bytes used 1.00GiB
    Apr 10 12:55:48 Test2 emhttpd: /mnt/cache devid    1 size 111.79GiB used 5.03GiB path /dev/sdc1
    Apr 10 12:55:48 Test2 emhttpd: /mnt/cache devid    3 size 111.79GiB used 5.03GiB path /dev/sde1
    Apr 10 12:55:48 Test2 emhttpd: /mnt/cache mount error: Invalid pool config

     

    For some reason it's detecting a missing device despite both being available and detected, after rebooting pool mounts normally, marking this urgent not because the bug directly results in data loss but because I'm afraid some users than run into this will start trying to add/remove devices to fix this and end up nuking the pool.

     

     

     

    test2-diagnostics-20220410-1255.zip

    • Upvote 4



    User Feedback

    Recommended Comments

    13 hours ago, leejbarker said:

    A restart doesn't fix it for me...

     

    Because it's a different problem, you have an actual missing device:

    Apr 18 18:05:26 NAS emhttpd: /mnt/cache uuid: c0604d95-a95d-4473-b590-5d657cf45624
    Apr 18 18:05:26 NAS emhttpd: /mnt/cache found: 2
    Apr 18 18:05:26 NAS emhttpd: /mnt/cache extra: 0
    Apr 18 18:05:26 NAS emhttpd: /mnt/cache missing: 0
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache warning, device 2 is missing
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache Label: none  uuid: c0604d95-a95d-4473-b590-5d657cf45624
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache Total devices 3 FS bytes used 429.72GiB
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache devid    1 size 953.87GiB used 698.03GiB path /dev/nvme0n1p1
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache *** Some devices missing
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache mount error: Invalid pool config

     

    No valid btrfs filesystem is being detected on the other NVMe device, and whatever happened it did before the diags posted, since no fs was detected on boot, if the device was wiped with wipefs you should be able to restore it with:

     

    btrfs-select-super -s 1 /dev/nvme1n1p1

     

    If the command completes successfully then you need to reset the pool, to do that unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, reassign all cache devices (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), start array, if it still doesn't mount please start a new thread in the general support forum and post new diags.

    • Like 1
    Link to comment
    5 hours ago, JorgeB said:

     

    Because it's a different problem, you have an actual missing device:

    Apr 18 18:05:26 NAS emhttpd: /mnt/cache uuid: c0604d95-a95d-4473-b590-5d657cf45624
    Apr 18 18:05:26 NAS emhttpd: /mnt/cache found: 2
    Apr 18 18:05:26 NAS emhttpd: /mnt/cache extra: 0
    Apr 18 18:05:26 NAS emhttpd: /mnt/cache missing: 0
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache warning, device 2 is missing
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache Label: none  uuid: c0604d95-a95d-4473-b590-5d657cf45624
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache Total devices 3 FS bytes used 429.72GiB
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache devid    1 size 953.87GiB used 698.03GiB path /dev/nvme0n1p1
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache *** Some devices missing
    Apr 18 18:05:27 NAS emhttpd: /mnt/cache mount error: Invalid pool config

     

    No valid btrfs filesystem is being detected on the other NVMe device, and whatever happened it did before the diags posted, since no fs was detected on boot, if the device was wiped with wipefs you should be able to restore it with:

     

    btrfs-select-super -s 1 /dev/nvme1n1p1

     

    If the command completes successfully then you need to reset the pool, to do that unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, reassign all cache devices (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), start array, if it still doesn't mount please start a new thread in the general support forum and post new diags.

     

    As ever... you're a star... Thanks

    Link to comment

    can confirm exalty this behavior on 6.10-rc4 including the working workaround (reboot).

    BTW:

    @JorgeB I assume you meant 6.10.0-rc4 in the title? may you updating the title? ;) 

    • Thanks 1
    Link to comment
    1 hour ago, ChatNoir said:

    Changed Priority to Other

    Not a big deal but we usually only change priority to "other" when it isn't/wasn't a bug, and this one was, it's solved but it was a bug.

    Link to comment
    10 hours ago, bregell said:

    Just had the same problem on 6.11.0

    This particular issue has been fixed since v6.10.0-rc5, you might have seen a different one, but we'd need the diagnostics to confirm.

    Link to comment
    On 10/3/2022 at 5:02 PM, bregell said:

    Just had the same problem on 6.11.0

    I too have just replaced a bad drive on a cache array (it would randomly disappear till reboot and other errors), and everything worked without error, even ran another balance to make sure things were good (WD-WXL608031706; still in the machine, as i was going to format and test it to figure out why it was failing before.  As of writing, i haven't touched it yet, since removing from the cache pool). 

     

    The cache pool was operating fine after the disk replacement. Got back from a trip to find the dreaded "Unmountable: Invalid pool config" has come to call.  I tried to reboot, thinking there was a hiccup somewhere, but it returned the same.

    I have attached my diag...miicarstorage-diagnostics-20230117-1604.zip

     

    UPDATE:  So i decided to mount the "bad/replaced" drive in the UN-Assigned Devices; It mounted and said "pool"!  So i unassigned all the active "Unmountable" cache pool drives, started the unraid array again and my folders showed up in Krusader!  At the moment, i am moving the files over to my main array, and the drives are all showing read activity as it's happening!  So i am hoping to at least recover my files (so i don't have to download all my steam games again...ffs).  I will update once its finished with more news.  

    NOW.  Why is Unassigned devices better at BTRFS (kind of joking)?  I've had quite a few issues with BTRFS cache pools, in the last while, during maintenance.  I wish UNraid had a cache pool file system that is more stable, or with better tools for repairs!  It seems like the only "FIX" for anything, is to move the files to another drive with some recovery script, and start over...not really ideal!  Maybe i'm doing it wrong, but I'm getting frustrated!

    Edited by miicar
    UPDATE:
    Link to comment
    9 hours ago, miicar said:

    I too have just replaced a bad drive on a cache array

    Your issue has nothing to do with this bug which has been solved a long time ago, next time please post in the general support forum instead, also you should have posted the diags after array start, there are no mount errors shown in the diags posted.

     

    Jan 17 15:21:22 miiCARstorage kernel: BTRFS: device fsid 163b5df8-31a8-4584-ad31-e3af8e5b294e devid 4 transid 98425 /dev/sdb1 scanned by udevd (1304)
    Jan 17 15:21:22 miiCARstorage kernel: BTRFS: device fsid 163b5df8-31a8-4584-ad31-e3af8e5b294e devid 1 transid 98425 /dev/sdk1 scanned by udevd (1491)
    Jan 17 15:21:22 miiCARstorage kernel: BTRFS: device fsid 163b5df8-31a8-4584-ad31-e3af8e5b294e devid 2 transid 98425 /dev/sdl1 scanned by udevd (1456)
    Jan 17 15:21:22 miiCARstorage kernel: BTRFS: device fsid 163b5df8-31a8-4584-ad31-e3af8e5b294e devid 7 transid 98425 /dev/sdh1 scanned by udevd (1491)
    Jan 17 15:21:22 miiCARstorage kernel: BTRFS: device fsid 163b5df8-31a8-4584-ad31-e3af8e5b294e devid 5 transid 86793 /dev/sdi1 scanned by udevd (1456)
    Jan 17 15:21:22 miiCARstorage kernel: BTRFS: device fsid 163b5df8-31a8-4584-ad31-e3af8e5b294e devid 3 transid 98425 /dev/sdg1 scanned by udevd (1456)
    Jan 17 15:21:22 miiCARstorage kernel: BTRFS: device fsid 163b5df8-31a8-4584-ad31-e3af8e5b294e devid 6 transid 98425 /dev/sdj1 scanned by udevd (1456)

     

    Based on the above deviID #5 is out of sync with the rest of the pool, assuming the pool is redundant it should mount degraded if you create a new pool and assign all devices expect sdi, and disconnect sdi from the server, since it's still showing as a valid pool member and was not correctly replaced/wiped.

     

     

     

     

    • Confused 1
    Link to comment

    Have the same problem in 6.11 (current stable). My cache just imploded because I took the system down and swapped cache (SSDs) to different SATA ports (HBA) card.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.