BTRFS Raid1 pool issue


Gnomuz

Recommended Posts

Hi all,

To make a long story short, I'm in a situation where the cache BTRFS Raid1 pool is in an instable state, because one of the pool members went offline due to a known issue with the AMD SATA onboard controller.

For those who'd need more context, please see my initial post and @JorgeB 's answers in this thread :

I've already run scrub on the pool, no errors found, and also performed a full balance.

Could anyone help me rebuild this cache pool composed of a 840 Pro 512GB (sdc) and 860 Evo 500GB (sdf), knowing that sdc is presumably clean, whereas sdf still generates a ton of errors when I run 'btrfs check /dev/sd1'. I'm not at all a BTRFS expert, I just want to force BTRFS to rebuild the raid1, just mirroring sdc to sdf.

I hope my request is clear, diags attached if it can help.

 

Thanks in advance.

 

nas-diagnostics-20201215-0957.zip

Edited by Gnomuz
Link to comment
2 minutes ago, JorgeB said:

The only error I see appear to be trim related, do you see any errors without trying to trim the pool?

 

 

You're totally right, the error appears, pointing to sdf (the device which went offline as you know), only when I run fstrim. But also, checking file system on sdf1 (btrfs check) yesterday evening reported a huge number of file system errors.

Guess what, when I issue the same command this morning I get this result :

root@NAS:~# btrfs check /dev/sdf1
Opening filesystem to check...
Checking filesystem on /dev/sdf1
UUID: c74954f2-6961-434c-b5ba-d0155fd6602d
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 217076813824 bytes used, no error found
total csum bytes: 18733168
total tree bytes: 410370048
total fs tree bytes: 298795008
total extent tree bytes: 87080960
btree space waste bytes: 96197597
file data blocks allocated: 2801561227264
 referenced 167881072640

I don't know what "cache and super generation don't match, space cache will be invalidated" means and whether it's worrying or not, but this is completely different from what I got yesterday ...

And the fstrim /mnt/cache still produces the same error, I just tried.

 

Thus a previous question. The fstrim command does not report any error on sdc (the other member of the pool, the 840 Pro SSD) when trimming the pool, just on sdf (860 Evo). Does it mean according to you that trim works on sdc, and fails on sdf ?

If yes, that would clearly point to the 860 Evo not supporting trim over the LSI HBA for whatever reason (firmware ?). I could then replace it with a 512GB 860 Pro (present 840 Pro equivalent) to have trim supported on both members of the pool.

Remember I succeeded trimming a spare consumer grade Sandisk SSD Plus 2TB over the HBA, so trim support over the HBA is no longer questionable. But the major difference is this test SSD was XFS formatted not BTRFS, my mistake. So maybe BTRFS Raid1 is also the problem for trim support, not the SSD...

 

I must admit I'm a bit lost with too many possible causes and hypothesis ... What would be the possible next steps ?

 

Thanks in advance.

 

Link to comment
52 minutes ago, Gnomuz said:

I don't know what "cache and super generation don't match, space cache will be invalidated" means and whether it's worrying or not

That's nothing to worry about.

 

The trim issue is a known issue with SAS2 LSI controllers and current firmware/driver, some users found that downgrading to firmware p16 allows trim to work, and while I don't personally recommend using an old firmware it might be an option, more info below:

 

https://forums.unraid.net/topic/74493-samsung-ssd-hba-trim/page/4/

 

Link to comment

Thanks for directing me to this relevant frightening thread. Well, I really don't like what I read !

 

I also understand from this thread that the issue is specific to BTRFS, which may explain why my test of trim over the HBA on an XFS-formatted SanDisk SSD was successful (?).

 

To sum up, all options with my hardware components look like dead-ends or gambles :

- downgrading my 9207-8i  firmware from P20 to P16 (risky operation btw) would enable trim, but may also lead to "chaos" and data loss according to one of the posters. No thanks !

- switching the cache file system from BTRFS to XFS may enable trim, but then redundancy is not an option. I just can't imagine a single-device cache for the VMs' virtual disks and the containers' data !

- connecting back the SSDs to the AMD onboard controller would re-enable trim, but would also very likely trigger the "quite common" Ryzen issue I've already experienced 3 times, which itself may or may not be fixed by a future BIOS update and/or newer kernel.

 

Another option would be to swap the SAS2 9207-8i out for a 9300-8i which seems to be the equivalent SAS3 model. That's another 100€ for a used one from somewhere in Asia (with all the risks incurred), or almost 250€ for a brand new one from Germany. If this can be avoided, my bank account will thank me ...

 

To try and reach a conclusion, even a sad one, what would be according to you the overall impact of not trimming the cache pool until a miracle happens, i.e. Unraid proposes an alternative file system to BTRFS for redundant cache pools ?

 

Thanks again 😉

Edited by Gnomuz
Link to comment
31 minutes ago, Gnomuz said:

what would be according to you the overall impact of not trimming the cache pool

Possible performance decrease over time.

 

BTW, I believe you are putting too much weight on the redundancy factor. Redundancy only protects against a very limited amount of risks, you are much more likely to lose data from something NOT covered by a redundant drive. You really need a good backup strategy, for container data there is CA Backup, for VM's you have the options of either backing them up like you would any desktop, with client software installed in the VM, or backing up the vdisks and xml with tools inside unraid. Any of those backup options can be targeted to a location on the parity protected array if you wish.

 

Once you have a solid backup strategy in place, this kind of stress over the integrity of the cache pool is greatly reduced.

 

RAID or Unraid of any flavor is NEVER a replacement for backups. It's only there to keep things up and running in case of a disk failure. It can't protect from data corruption or deletion from users or bad hardware, only backups can do that.

Link to comment

Thanks for reminding the difference between redundancy and backups @jonathanm . CA Backup was one of my first installed plugins, VMs are backed up daily to the array (Macrium Software for Windows, Timeshift for Ubuntu), the array has 2 parity drives, and a weekly backup of the array to external drives is scheduled. So I think belt and braces are in place, but any advice is welcome.

 

My understanding of the cache usefulness is (when built with SSD(s) over a 2.5G network to workstations) :

performance for frequently accessed data (Vdisks, containers' data, ...),

- fast-write temporary storage before the mover does its job overnight

And that's why cache redundancy appears so important to me. I hardly imagine telling one of my users "Well, you know what ? You stored an important document at 5pm on my super secure NAS, but as a piece of hardware crashed at 10pm, your 8 hours of work have disappeared".

 

I've been working for 2 decades in the capital-market industry IT departments, where a 5 minutes outage during market hours was simply not to be envisaged, whatever the cost. But maybe I have to lower my expectations now that my environment is more regular and budget-constrained, who knows 😉

Edited by Gnomuz
Link to comment
20 minutes ago, Gnomuz said:

But maybe I have to lower my expectations now that my environment is more regular and budget-constrained, who knows 😉

I believe you answered your own question here.

 

1 hour ago, Gnomuz said:

Another option would be to swap the SAS2 9207-8i out for a 9300-8i which seems to be the equivalent SAS3 model. That's another 100€ for a used one from somewhere in Asia (with all the risks incurred), or almost 250€ for a brand new one from Germany.

Honestly though, if you keep tabs on your equipment, the odds of a random out of the blue single device failure recoverable with a redundant cache pool is exceedingly rare. Most failures nowadays have advance warning.

Link to comment
18 hours ago, jonathanm said:

Honestly though, if you keep tabs on your equipment, the odds of a random out of the blue single device failure recoverable with a redundant cache pool is exceedingly rare. Most failures nowadays have advance warning.

Back after a good night sleep, and I decided to give up redundancy on the cache and switch to XFS as a file system, following @jonathanm 's advice.

So, as I'm a noob, I have a new question : what are the steps to convert my current BTRFS raid1 cache pool to a single-device XFS one ?

 

Thanks in advance.

Link to comment
21 hours ago, jonathanm said:

Honestly though, if you keep tabs on your equipment, the odds of a random out of the blue single device failure recoverable with a redundant cache pool is exceedingly rare. Most failures nowadays have advance warning.

I don't want to derail the OP's thread but do you mean this generally or specific to BTRFS/SSDs?

 

Like the OP I have a BTRFS RAID1 cache pool so if the risk is low and in any case recovery unlikely it seems like real cost (half the space) for no real benefit.

Link to comment
7 hours ago, Gnomuz said:

what are the steps to convert my current BTRFS raid1 cache pool to a single-device XFS one ?

In a nutshell, change any shares that currently have data on the cache to cache:yes, stop the docker and vm service, not just containers and vms, run the mover, make sure cache pool is empty, stop the array, change visible cache slots to one, change cache desired format to xfs, start the array, check that only the cache drive shows unformatted, select the box to format it and press the button. Then set desired shares to cache:prefer, run mover, enable docker and vm services.

 

This is all covered in the wiki about replacing a cache drive.

Link to comment
4 hours ago, CS01-HS said:

Like the OP I have a BTRFS RAID1 cache pool so if the risk is low and in any case recovery unlikely it seems like real cost (half the space) for no real benefit.

There are benefits to keeping BTRFS if your system works well with it. The OP has a real ongoing issue that may be solved by changing to XFS.

Link to comment
2 hours ago, jonathanm said:

In a nutshell, change any shares that currently have data on the cache to cache:yes, stop the docker and vm service, not just containers and vms, run the mover, make sure cache pool is empty, stop the array, change visible cache slots to one, change cache desired format to xfs, start the array, check that only the cache drive shows unformatted, select the box to format it and press the button. Then set desired shares to cache:prefer, run mover, enable docker and vm services.

 

This is all covered in the wiki about replacing a cache drive.

Thanks for the guidelines, I had read the wiki but was unsure about the steps for changing the cache pool type (from two devices to one) and reformatting the remaining cache device.

I wanted to be sure this was not different from replacing a faulty cache device by a brand new one.

Link to comment

Well, nothing is as simple as expected or documented ...

 

I followed the step by step @jonathanm was kind enough to write for me. 

All containers stooped, all VMs stopped, Docker stopped, VM Manager, stopped, and the three shares (appdata, domains and system) changed to cache:Yes, and mover launched. There were about 210 GB on the cache, so it took quite a long time.

 

After mover ended, I was surprised to see cache had still 78.5 MB used. I browsed cache from the main tab. Only appdata is still there, so system and domains were moved properly. 

Inside appdata I have data left from 3 containers :

- binhex-krusader

- plex

- speedtest-tracker

Some of their data has been moved to the array, but a few directories are left in the cache. I relaunched the mover, but nothing new happened.

 

I then browsed to the deepest directories left in the cache, and noticed that all of them content symlinks. Here is an example for binhex-krusader :

 

root@NAS:/mnt/cache/appdata/binhex-krusader/home/.icons/BLACK-Ice-Numix-FLAT/16/actions# ls -lha
total 172K
drwxrwxr-x 1 nobody users 1.6K Jul 14 23:23 ./
drwxrwxr-x 1 nobody users  104 Jul 14 23:23 ../
lrwxrwxrwx 1 nobody users   26 Jul 14 23:23 archive-insert-directory.svg -> add-folders-to-archive.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 document-sign.svg -> document-edit.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 edit-entry.svg -> document-edit.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 edit-map.svg -> document-edit.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 editimage.svg -> document-edit.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 end_of_life.svg -> dialog-cancel.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 entry-edit.svg -> document-edit.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 filename-ignore-amarok.svg -> dialog-cancel.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 fileopen.svg -> document-open.svg
lrwxrwxrwx 1 nobody users   14 Jul 14 23:23 folder_new.svg -> folder-new.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 group-edit.svg -> document-edit.svg
lrwxrwxrwx 1 nobody users   14 Jul 14 23:23 group-new.svg -> folder-new.svg
lrwxrwxrwx 1 nobody users   13 Jul 14 23:23 gtk-info.svg -> gtk-about.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 gtk-open.svg -> document-open.svg
lrwxrwxrwx 1 nobody users   16 Jul 14 23:23 gtk-yes.svg -> dialog-apply.svg
lrwxrwxrwx 1 nobody users   20 Jul 14 23:23 kdenlive-menu.svg -> application-menu.svg
lrwxrwxrwx 1 nobody users   16 Jul 14 23:23 knotes_close.svg -> dialog-close.svg
lrwxrwxrwx 1 nobody users   19 Jul 14 23:23 ktnef_extract_to.svg -> archive-extract.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 list-resource-add.svg -> list-add-user.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 mail-thread-ignored.svg -> dialog-cancel.svg
lrwxrwxrwx 1 nobody users   20 Jul 14 23:23 menu_new.svg -> application-menu.svg
lrwxrwxrwx 1 nobody users   29 Jul 14 23:23 object-align-vertical-bottom-top-calligra.svg -> align-vertical-bottom-out.svg
lrwxrwxrwx 1 nobody users   19 Jul 14 23:23 offline-settings.svg -> network-connect.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 open-for-editing.svg -> document-edit.svg
lrwxrwxrwx 1 nobody users    8 Jul 14 23:23 password-show-off.svg -> hint.svg
lrwxrwxrwx 1 nobody users   19 Jul 14 23:23 relationship.svg -> network-connect.svg
lrwxrwxrwx 1 nobody users   26 Jul 14 23:23 rhythmbox-set-star.svg -> gnome-app-install-star.svg
lrwxrwxrwx 1 nobody users   20 Jul 14 23:23 selection-make-bitmap-copy.svg -> fileview-preview.svg
lrwxrwxrwx 1 nobody users   16 Jul 14 23:23 stock_calc-accept.svg -> dialog-apply.svg
lrwxrwxrwx 1 nobody users    8 Jul 14 23:23 stock_edit.svg -> edit.svg
lrwxrwxrwx 1 nobody users   16 Jul 14 23:23 stock_mark.svg -> dialog-apply.svg
lrwxrwxrwx 1 nobody users   14 Jul 14 23:23 stock_new-dir.svg -> folder-new.svg
lrwxrwxrwx 1 nobody users   13 Jul 14 23:23 stock_view-details.svg -> gtk-about.svg
lrwxrwxrwx 1 nobody users   16 Jul 14 23:23 stock_yes.svg -> dialog-apply.svg
lrwxrwxrwx 1 nobody users   24 Jul 14 23:23 tag-folder.svg -> document-open-folder.svg
lrwxrwxrwx 1 nobody users    9 Jul 14 23:23 tag-places.svg -> globe.svg
lrwxrwxrwx 1 nobody users   18 Jul 14 23:23 umbr-coll-message-synchronous.svg -> mail-forwarded.svg
lrwxrwxrwx 1 nobody users   18 Jul 14 23:23 umbr-message-synchronous.svg -> mail-forwarded.svg
lrwxrwxrwx 1 nobody users   17 Jul 14 23:23 view-resource-calendar.svg -> view-calendar.svg
lrwxrwxrwx 1 nobody users   16 Jul 14 23:23 x-shape-image.svg -> view-preview.svg
lrwxrwxrwx 1 nobody users   13 Jul 14 23:23 zoom-fit-drawing.svg -> zoom-draw.svg
lrwxrwxrwx 1 nobody users   13 Jul 14 23:23 zoom-fit-page.svg -> page-zoom.svg
lrwxrwxrwx 1 nobody users   22 Jul 14 23:23 zoom-select-fit.svg -> zoom-fit-selection.svg

Another example from speedtest-tracker :

root@NAS:/mnt/cache/appdata/speedtest-tracker/www/node_modules/@babel/plugin-proposal-class-properties/node_modules/.bin# ls -lha
total 4.0K
drwxrwxr-x 1 911 911 12 Dec 18 03:48 ./
drwxrwxr-x 1 911 911  8 Dec 14 03:48 ../
lrwxrwxrwx 1 911 911 36 Dec 18 03:48 parser -> ../\@babel/parser/bin/babel-parser.js

And of course a lot of similar examples for the Plex Metadata directory.

 

I conclude there's an issue moving from cache to array directories containing symlinks.

 

And I'm now in the middle of nowhere, with 99% of the cache data transferred to the array, 78.5MB remaining on cache and not "movable" for a reason beyond my technical knowledge. So I can't reformat the cache to XFS, I'm basically stuck in the middle of the operation without any next step on sight ...

 

Any help will be appreciated.

Edited by Gnomuz
Link to comment
6 minutes ago, itimpi said:

Have you checked that there are files on the cache - if so which ones?   The used space will never go to zero as there is a basic overhead because it has been formatted with a file system.

Of course, appdata is not empty, and as shown in examples, the deepest directories have something in common, they all contain what seems to be hardlinks. I've read in other threads there's a problem moving these hard links, I think it's my problem.

I also checked that the hardlinks remaining in the /mnt/cache/appdata directories do not exist in the corresponding mnt/disk2/appdata.

Needless saying I don't have the faintest idea why these links are there, I just installed the containers from CA.

Edited by Gnomuz
Link to comment

I quickly analyzed the output of "ls -lhaR /mnt/cache/appdata", only links (hard or not, it's beyond my technical skills) apart from directories of course remain on the cache, and there are 50150 of them (searching lrwxrwxrwx), mainly in the /mnt/cache/appdata/Plex/... , as I expected.

 

So definitely, the cache is far from being empty, and I'm quite sure formatting it now will prevent the 3 containers at stake from ever restarting after the move back from array to cache.

 

For reference, I attach the output of ls -lhaR /mnt/cache/appdata after the partial move.

 

Thanks anyway for trying to help me @itimpi ! Any guidance from @JorgeB or @jonathanm who have followed my nightmare from the beginning would be more than welcome, if of course they are somehow available. 

appdata-content.txt

Edited by Gnomuz
SOS
Link to comment

Hard links cannot normally be moved between mount points which implies that they cannot be moved between drives when set up at the physical drive level.

 

If appdata is mapped to /mnt/user/appdata rather than /mnt/cache/appdata it just might work because at that level drives should be transparent.  I have not tried it so that is just a guess.

Link to comment
18 minutes ago, itimpi said:

I do not believe that hard links can be moved between drives.

That's what I sadly understand also from other posts. But then how is it possible that containers creating these objects can be found in CA, as basically they are not compatible with one the building-blocks of Unraid, i.e. cache pool management ?

 

I think I could simply uninstall the three apps, delete their data directory, and after the reformat and move back operation reinstall them from scratch.

No problem for krusader, but :

- I would lose historical data from speedtest-tracker (not such a big deal)

- above all, I would have to rebuild from scratch my Plex library, which is heavily customized with custom posters, obscure music albums manually indexed, and I probably forget many other things.

 

That would be hours if not days of patient work lost, and to make it simple an unrecoverable data loss. That was exactly what I expected not to happen with a dedicated NAS OS like Unraid. If there's no viable solution to get out of this dead-end, I must admit bitterness will be an understatement to describe my state of mind...

Edited by Gnomuz
Link to comment
9 minutes ago, jonathanm said:

https://forums.lime-technology.com/topic/61211-plugin-ca-appdata-backup-restore-v2/

It's not as elegant as the mover solution, but it should work to back up /mnt/user/appdata

Thanks for this glimmer of hope @jonathanm , and don't worry, elegance is not my first priority by now 😉

Just a question (one more !) , now that everything is stopped (containers, VMSs and related managers), but appdata "spread" between cache (hard links) and array (all other data), can I : 

- launch a backup in this dubious state with CA Backup & Restore,

- format the cache to XFS,

- move system and domain (after setting them to cache : Prefer) to the "new" cache with the mover,

- delete any track of appdata on the array,

- set appdata to cache : Prefer,

- restore appdata with CA Backup and Restore,

or should I :

move appdata back to cache with the mover (cache : prefer),

- back it up with CA Backup,

- format the cache to XFS,

- move system and domain (after setting them to cache : Prefer) to the "new" cache with the mover,

- restore appdata with CA Backup and restore

Or any other approach you would consider as safer.

You understand I have a doubt on the restorability of a backup I would create in the current unsound situation of appdata.

 

Link to comment
4 minutes ago, jonathanm said:

Pretty sure CA Backup will handle this correctly, perhaps @Squid will pop in and confirm.

The only reason I have any doubt is because I've never seen the mover method fail, so like you I'm concerned about why. My supposition is that the containers in question were mapped to /mnt/cache/appdata instead of /mnt/user/appdata like they should have been.

I think at least some of them (can't check by now, docker stopped) were mapped to /mnt/cache/appdata instead of /mnt/user, because I had read in this forum under the signature of well-known veterans it was a good idea to do so in terms of performance. Again, applying procedures without understanding the potential consequences was a bad initiative on my side ...

I launched the backup in the current state, in a worst case scenario I have a daily backup of today 3:30am available which should do the job, I didn't do much this morning as you imagine !

Link to comment
11 minutes ago, Gnomuz said:

I had read in this forum under the signature of well-known veterans it was a good idea to do so in terms of performance.

Theoretically forcing things to be mapped to a specific drive instead of a user share shouldn't cause major issues, but like you said, hidden consequences.

 

Since you have CA Backup already doing its scheduled thing, personally I'd use the procedures already set forth in CA Backup's disaster recovery and use your daily backup to restore appdata after formatting the drive. Revert the appdata settings back to cache:only if you have /mnt/cache mapped, as setting it to the default cache:prefer could end up with files on an array disk under some circumstances. Before you start the docker and vm service do an audit of all the shares involved and make sure the data is all where it needs to be, domains and system should properly move back with the mover after setting them to cache prefer.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.