• MAJOR ISSUE: CACHE DRIVES FILESYSTEM GONE ?? (SOLVED)


    Helmonder

    After a reboot this morning my cache drives seems to be unmountable... No idea what is going on...

     

    Syslog is attached

     

    Error messages in the log are as below:

     

    May  8 09:51:08 Tower kernel: ACPI: Early table checksum verification disabled
    May  8 09:51:08 Tower kernel: spurious 8259A interrupt: IRQ7.
    May  8 09:51:08 Tower kernel: floppy0: no floppy controllers found
    May  8 09:51:08 Tower kernel: random: 7 urandom warning(s) missed due to ratelimiting
    May  8 09:51:09 Tower rpc.statd[1802]: Failed to read /var/lib/nfs/state: Success
    May  8 09:51:09 Tower ntpd[1832]: bind(19) AF_INET6 fe80::1c3e:aeff:fe3a:defa%13#123 flags 0x11 failed: Cannot assign requested address
    May  8 09:51:09 Tower ntpd[1832]: failed to init interface for address fe80::1c3e:aeff:fe3a:defa%13
    May  8 09:51:28 Tower avahi-daemon[11706]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns!
    May  8 09:51:40 Tower kernel: WARNING: CPU: 2 PID: 12688 at fs/btrfs/extent-tree.c:6795 __btrfs_free_extent+0x1fd/0x8e4
    May  8 09:51:40 Tower kernel: CPU: 2 PID: 12688 Comm: mount Not tainted 4.19.37-Unraid #1
    May  8 09:51:40 Tower kernel: Call Trace:
    May  8 09:51:40 Tower kernel: BTRFS error (device nvme0n1p1): unable to find ref byte nr 1037649829888 parent 0 root 5  owner 77097 offset 230969344
    May  8 09:51:40 Tower kernel: BTRFS: Transaction aborted (error -2)
    May  8 09:51:40 Tower kernel: WARNING: CPU: 2 PID: 12688 at fs/btrfs/extent-tree.c:6801 __btrfs_free_extent+0x250/0x8e4
    May  8 09:51:40 Tower kernel: CPU: 2 PID: 12688 Comm: mount Tainted: G        W         4.19.37-Unraid #1
    May  8 09:51:40 Tower kernel: Call Trace:
    May  8 09:51:40 Tower kernel: BTRFS: error (device nvme0n1p1) in __btrfs_free_extent:6801: errno=-2 No such entry
    May  8 09:51:40 Tower kernel: BTRFS: error (device nvme0n1p1) in btrfs_run_delayed_refs:2935: errno=-2 No such entry
    May  8 09:51:40 Tower kernel: BTRFS: error (device nvme0n1p1) in btrfs_replay_log:2277: errno=-2 No such entry (Failed to recover log tree)
    May  8 09:51:40 Tower kernel: BTRFS error (device nvme0n1p1): pending csums is 134717440
    May  8 09:51:40 Tower root: mount: /mnt/cache: mount(2) system call failed: No such file or directory.
    May  8 09:51:40 Tower emhttpd: /mnt/cache mount error: No file system
    May  8 09:51:40 Tower kernel: BTRFS error (device nvme0n1p1): open_ctree failed

    The cache drive is still listed as a cache drive, just with an unmountable file system, attributes do not show issues I recognise as a an issue:

     

    Critical warning	0x00
    -	Temperature	36 Celsius
    -	Available spare	100%
    -	Available spare threshold	5%
    -	Percentage used	4%
    -	Data units read	155,230,378 [79.4 TB]
    -	Data units written	90,224,490 [46.1 TB]
    -	Host read commands	464,542,688
    -	Host write commands	539,484,666
    -	Controller busy time	2,395
    -	Power cycles	21
    -	Power on hours	2,684
    -	Unsafe shutdowns	13
    -	Media and data integrity errors	0
    -	Error information log entries	10,922
    -	Warning comp. temperature time	0
    -	Critical comp. temperature time	0

     

    Balance and scrub cannot be run "because array is not started" (array is ofcourse started and working)

     

    I have started the array in maitenance mode so I can run the btrfs filesystem check in readonly mode, results are as follows:

     

    [1/7] checking root items
    [2/7] checking extents
    ref mismatch on [1037649817600 8192] extent item 255, found 1
    data backref 1037649829888 root 5 owner 77097 offset 230969344 num_refs 0 not found in extent tree
    incorrect local backref count on 1037649829888 root 5 owner 77097 offset 230969344 found 1 wanted 0 back 0xcd9f170
    incorrect local backref count on 1037649829888 root 5 owner 77097 offset 17208183807669456896 found 0 wanted 4287137790 back 0x17974a30
    backref disk bytenr does not match extent record, bytenr=1037649829888, ref bytenr=0
    backpointer mismatch on [1037649829888 4096]
    ERROR: errors found in extent allocation tree or chunk allocation
    [3/7] checking free space cache
    [4/7] checking fs roots
    [5/7] checking only csums items (without verifying data)
    [6/7] checking root refs
    [7/7] checking quota groups skipped (not enabled on this FS)
    Opening filesystem to check...
    Checking filesystem on /dev/nvme0n1p1
    UUID: 344c37ac-26f1-4307-8451-1116b06922be
    found 238952861696 bytes used, error(s) found
    total csum bytes: 172892316
    total tree bytes: 1707900928
    total fs tree bytes: 1359200256
    total extent tree bytes: 124682240
    btree space waste bytes: 369061441
    file data blocks allocated: 1187284238336
     referenced 233465798656

     

    Since errors are found I changed the --readonly to --repair and started a new check, allowing BTRFS to fix itself.  It looks however like a dialogue process is now presented that is waiting for input that I ofcourse cannot give thru the webpage:

     

    enabling repair mode
    Opening filesystem to check...
    Checking filesystem on /dev/nvme0n1p1
    UUID: 344c37ac-26f1-4307-8451-1116b06922be
    repair mode will force to clear out log tree, are you sure? [y/N]:

    To make sure something else is not rotten I stopped the array, unassigned the cache drive, started the array without cache drive, stopped the array and re-added the cache drive. Cache drives comes back but again without file system.

     

    Since the BTRFS repair option still might work but appears to be stuck in a dialogue process I want to run it through commandline, unofrtunately the /dev/ name listed as name of the cache drive does seem to work, if I give:

     

    btrfs check --repair /dev/nvme0n1 that comes back with a remark the there is no btrfs filesystem there..

     

    I checked the log to see how the check is run through the GUI, this gives a different /dev/ name: /dev/nvme0nlpl

     

    I am now running the following command:

     

    btrfs check --repair /dev/nvme0nlpl

     

    Unfortunately it comes back as aborted, output is as follows:

     

    root@Tower:/dev# btrfs check --repair /dev/nvme0n1p1
    enabling repair mode
    Opening filesystem to check...
    Checking filesystem on /dev/nvme0n1p1
    UUID: 344c37ac-26f1-4307-8451-1116b06922be
    repair mode will force to clear out log tree, are you sure? [y/N]: Y
    [1/7] checking root items
    Fixed 0 roots.
    [2/7] checking extents
    ref mismatch on [1037649817600 8192] extent item 255, found 1
    repair deleting extent record: key [1037649817600,168,8192]
    adding new data backref on 1037649817600 root 5 owner 77097 offset 188153856 found 1
    Repaired extent references for 1037649817600
    data backref 1037649829888 root 5 owner 77097 offset 230969344 num_refs 0 not found in extent tree
    incorrect local backref count on 1037649829888 root 5 owner 77097 offset 230969344 found 1 wanted 0 back 0xce5cd30
    incorrect local backref count on 1037649829888 root 5 owner 77097 offset 17208183807669456896 found 0 wanted 4287137790 back 0x17a32240
    backref disk bytenr does not match extent record, bytenr=1037649829888, ref bytenr=0
    backpointer mismatch on [1037649829888 4096]
    repair deleting extent record: key [1037649829888,168,4096]
    adding new data backref on 1037649829888 root 5 owner 77097 offset 230969344 found 1
    Repaired extent references for 1037649829888
    Failed to find [253425188864, 168, 16384]
    btrfs unable to find ref byte nr 253425221632 parent 0 root 2  owner 0 offset 0
    transaction.c:195: btrfs_commit_transaction: BUG_ON `ret` triggered, value -5
    btrfs[0x43e9f2]
    btrfs(btrfs_commit_transaction+0x1ae)[0x43efce]
    btrfs[0x45d282]
    btrfs(cmd_check+0xc07)[0x45fff7]
    btrfs(main+0x8e)[0x40dcbe]
    /lib64/libc.so.6(__libc_start_main+0xeb)[0x14f732db9b5b]
    btrfs(_start+0x2a)[0x40deba]
    Aborted

    I have tried the same with the array not running... same result..

     

    I ran the fix a couple of more times... Because I think the output was slightly different every time, maybe it was working itself through something..  I got through it without an abort after 4 tries, when I now bootup the array in maintenance mode and do a readonly check I get the following output:

     

    [1/7] checking root items
    [2/7] checking extents
    [3/7] checking free space cache
    [4/7] checking fs roots
    [5/7] checking only csums items (without verifying data)
    [6/7] checking root refs
    [7/7] checking quota groups skipped (not enabled on this FS)
    Opening filesystem to check...
    Checking filesystem on /dev/nvme0n1p1
    UUID: 344c37ac-26f1-4307-8451-1116b06922be
    cache and super generation don't match, space cache will be invalidated
    found 238952861696 bytes used, no error found
    total csum bytes: 172892316
    total tree bytes: 1707900928
    total fs tree bytes: 1359200256
    total extent tree bytes: 124682240
    btree space waste bytes: 369061441
    file data blocks allocated: 1187284238336
     referenced 233465798656

    This basically looks error free I think ?

     

    The cache drive continues to appear as without file system though... Even after stopping and restarting the array..

     

    Therefore I did again:

     

    I stopped the array, unassigned the cache drive, started the array without cache drive, stopped the array and re-added the cache drive. Then started the array in maintenance mode. There is no message relating to an unmountable file system any more..

     

    I stop the array and restart it regularly (without maintenance mode)

     

    Now the array comes back up without a missing filesystem. 

     

    Cache drive appears to be back in full operation, dockers are also running again... 

     

    Issue solved.. But any idea what went wrong here ?

     

     

    tower-syslog-20190508-0756.zip

    • Thanks 1



    User Feedback

    Recommended Comments

    Difficult to say what caused the corruption but check --repair should only be used if told to do so by a btrfs maintainer, or risk making things worse.

     

    Also even if it's now working after a fix you're much more likely to run into corruption again, I would recommend backing up cache, re-format and restore the data, or at least make sure backups are current.

    Link to comment

    Thanks for the up !  I run monthly appdata backups but I will make an extra one right now and then do a reformat.

    Link to comment

     

     

    1) I have copied the complete contents of the cache drive to a share in my array. Before I did that I turned off docker and KVM in settings (them beiing active might interfere with the copy).

     

    2) The copy I ran through putty but in "screen" to make sure an interrupted ssh session would not kill the copy (I used MC).

     

    3) The copy went fine (no errors), but just to make sure it really - really was ok I did a compare on file size of the copy and the original; they were the same size.

     

    4) Then on to reformatting the cache drive, that is not a straightforward process, it appears... there only is a format button when a drive is not formatted.. There used to be a way araound this by doing:

     

    - Stop the array

    - Go main, cache drive, change file system if you need to, then press format.

    - In case the filesystem was allready what you wanted you needed to change to a filesystem you do not want, format, and then do it the other way around.

     

    Now however, there only is the option for BTRFS... So this does not work any more... To get the disk to a point that I could reformat I did in the end:

     

    - Stop the array

    - Remove cache drive out of the array (not physically, just change to "no drive" in the cache choice tab

    - Now start the array, the drive will show up as "unassigned drive"

    - I did a limited preclear (erase only), that removes the filesystem

    - Stop the array again

    - Add the cache drive back in its original spot

    - Start the array and format the drive (which is now an option)

     

     

     

    5) Now copiing all the data back from the array to the cache drive..

     

     

    Link to comment
    34 minutes ago, Helmonder said:

    Now however, there only is the option for BTRFS... So this does not work any more...

    It does, but you need to change the cache slots to 1 first, with multiple cache slots btrfs is the only option.

     

    Alternatively you can wipe the ssd with blkdiscard and will have the option to format it after array start.

    Link to comment

    Its still a hassle... but... to be preferred over having a format option that people might make mistakes with... I dont mind jumping a few hoops..

    Link to comment

    Since I am redoing the complete cache drive anyhow I decided to also recreate my docker image file and redownload all my dockers.. 

     

    Actually a very easy process:

     

    1) turn of docker in settings

    2) delete docker file

    3) turn on docker in settings (and recreate file)

    4) all of your dockers are now gone

    5) choose "add docker" in the docker screen and look up your docker under "user templates" in the drop down, that will reinstall the docker with all your previous sessions and mappings

    6) set the dockers to auto start if you had that before (this is not automatic)

    Link to comment
    8 minutes ago, Helmonder said:

    5) choose "add docker" in the docker screen and look up your docker under "user templates" in the drop down, that will reinstall the docker with all your previous sessions and mappings

    This is exactly what the Previous Apps feature on the Apps page does for you. It will let you reinstall any and all dockers exactly as they were using those user templates.

    Link to comment

    And it works out great ! 

     

    All my dockers are back up and running..

     

    It cost me a cache drive crash to start it off but effectively I have now "powerwashed" my complete cache drive..

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.