Been seeing some erratic behaviour recently (after 2 years). Logs are full of BTRFS errors


Go to solution Solved by JorgeB,

Recommended Posts

Good day, all.

 

I started seeing some erratic behaviour recently (after 2 years running just fine).

 

Out of the blue, I started getting alerts to my email (I've configured it that way) from Fix Common Problems informing me that the SSD cache couldn't be written.

 

Quote

Event: Fix Common Problems - Unraid-Server
Subject: Errors have been found with your server (Unraid-Server).
Description: Investigate at Settings / User Utilities / Fix Common Problems
Importance: alert

* **Unable to write to cache-ssd**

image.thumb.png.2e4840dfc0e99f93114d813bf8c8e6de.png

 

When clicking the Unraid Main page to confirm, though, here's what I see on my cache disks (they are 2 disks running as 1 cache in RAID1 mode):

 

image.thumb.png.c70f3eed87efd8bbdce23a96c275b3c9.png

 

So there's plenty of space there.

 

I also started noticing that PLEX wasn't working well, so I restarted the Docker image, but no dice. It keeps coming back up as unhealthy.

 

The logs are full of BTRFS errors, but the SMART info for both SSDs running my cache are just dandy, so I have no idea what the heck is going on.

 

I'm attaching the Logs here, in the hopes that someone may be able to point out what's the deal here, as I'm not really a pro at BTRFS by any stretch of the imagination and lately was just happy to keep my server running quietly there without any intervention (that's what I want my Unraid Server to do).

 

Thank you all in advance.

 

unraid-server-diagnostics-20230227-1822.zip

Link to comment

Attaching the logs here after a successful reboot.

Also, the document you linked to has a piece of info that conflicts with my case (SS below).

 

image.thumb.png.a6a0927e5348e1b54b932a4249d6bbdb.png

 

Here's what I see (it's telling me to start it in maintenance mode, although the instructions are clear that I should NEVER run this in maintenance mode for BTRFS filesystems):

 

image.thumb.png.d2b0645830e69ba96768e7162cd6a6c2.png

 

Any ideas?

 

Thanks

unraid-server-diagnostics-20230227-2138 (1).zip

Link to comment
16 minutes ago, JorgeB said:

Scrub is run in normal mode, check filesystem which is want we want is in maintenance mode.

I'm sorry, did I misunderstand the paragraph below?

I'm just trying to make sure I'm not going to destroy the data there by double checking. What am I missing?

 

Quote

 

If the file system is XFS or ReiserFS (but NOT BTRFS), then you must start the array in Maintenance mode, by clicking the Maintenance mode check box before clicking the Start button. This starts the unRAID driver but does not mount any of the drives.

If the file system is BTRFS, then make sure the array is started, and NOT in Maintenance mode.

 

 

Thanks again.

Link to comment
1 hour ago, JorgeB said:

Like mentioned and although it's not very clear that is talking about a scrub, not a filesystem check.

 

Thank you again for your patience, and I apologize for asking it again, but just trying to make sure that my data won't be wrecked, so please bear with me.

I've clicked the link you posted, and it took me straight to that part of the post.

 

I've got BTRFS running my cache, so in order to run said "Check Filesystem", what exactly should I do? I went over the instructions posted, but couldn't find anything that would indicate that I should bring the cache running BTRFS into maintenance mode and then run any different commands.

 

Here's what I'm reading when I click the link you shared:

 

Quote

 

Checking and fixing drives in the webGui

The instructions here are designed to check and fix the integrity of the file system of a data drive, while maintaining its parity info.

Preparing to test

The first step is to identify the file system of the drive you wish to test or repair. If you don't know for sure, then go to the Main page of the webGui, and click on the name of the drive (Disk 3, Cache, etc). Look for File system type, and you will see the file system format for your drive (should be xfs, reiserfs, or btrfs).

If the file system is XFS or ReiserFS (but NOT BTRFS), then you must start the array in Maintenance mode, by clicking the Maintenance mode check box before clicking the Start button. This starts the unRAID driver but does not mount any of the drives.

If the file system is BTRFS, then make sure the array is started, and NOT in Maintenance mode.

Important note! Drives formatted with XFS or ReiserFS must be checked and repaired in Maintenance mode. Drives formatted with BTRFS cannot be checked or repaired in Maintenance mode! They must be checked and repaired with the array started in the regular mode.

From the Main screen of the webGui, click the name of the disk that you want to test or repair. For example, if the drive of concern is Disk 5, then click on Disk 5. If it's the Cache drive, then click on Cache.

You should see a page of options for that drive, beginning with various partition, file system format, and spin down settings. The section following that is the one you want, titled Check Filesystem Status. There is a box with the 2 words Not available in it. This is the command output box, where the progress and results of the command will be displayed. Below that is the Check button that starts the test or repair, followed by the options box where you can type in options for the test/repair command. The options box may already have default options in it, for a read-only check of the file system. For more help, click the Help button in the upper right.

Running the test

The default for most file system formats (XFS and ReiserFS) is a read-only check of the file system, with no changes made to the drive.

If the file system is ReiserFS, then the options box will be blank initially, which implies the --check option. Always begin a ReiserFS drive test session with the --check option!

If the file system is XFS, then the options box will contain the -n option (it means check only, no modification yet). We recommend adding the -v option (it means "verbose" for greater message display), so add a v to the -n, making it -nv.

If the file system is BTRFS, then you want the scrub command, not the Balance command. The options box should already have the recommended options in it.

Once the options are correct, click the Check button to start the operation.

The progress and results will be displayed, and you will review them, and decide what action to take next. It is necessary to refresh the screen periodically, using the Refresh button underneath it or your browser's refresh button or keystroke, until the display indicates the test or repair is complete.

If the results display a successful test, with no corruptions found, then you are done! Skip down to After the test and repair.

Running the repair

If however issues were found, the display of results will indicate the recommended action to take. Typically, that will involve repeating the command with a specific option, clearly stated, which you will type into the options box (including any hyphens, usually 2 leading hyphens).

Then (if not BTRFS) click the Check button again to run the repair.

Progress and results will again be displayed, and it's possible that you will have to run it again, with perhaps a different option. If the repairs were successful, it's recommended to run the read-only check one more time, to verify all is well now.

For ReiserFS drives

For more info on the reiserfsck tool and its options, see reiserfsck. There is also more information and some warnings about reiserfsck options in the Running reiserfsck section below.

Important! If you are instructed to rerun the repair with the --rebuild-tree option, then PLEASE read the warnings in the Running reiserfsck section below!

Important! If you are instructed to rerun the repair with the --rebuild-sb option, then you cannot continue here in the webGui! You will have to follow the instructions in the Drives formatted with ReiserFS using unRAID v5 or later section below.

For XFS drives

For more info on the xfs_repair tool and its options, see xfs_repair. See also the Drives formatted with XFS section below.

For BTRFS drives and pools

Use the scrub command. Use your best judgement, we have little experience yet. Do read the Drives formatted with BTRFS section below.

For more info on the scrub command and its options, see btrfs scrub.

Unfortunately, the scrub command can detect but not fix errors on single drives. If errors are detected, you will want to go to the Redoing a drive formatted with BTRFS section.

If it's a drive pool, the scrub command can fix certain but not all errors. So far there is not a reliable BTRFS tool that can fix ALL possible errors. If the result indicates that there are still uncorrectable errors, then you will have to copy off all data and reformat the drive or pool anew (see Redoing a drive formatted with BTRFS).

unRAID is committed to staying up-to-date with BTRFS development, so once better tools are ready, they will be available here too.

During and at the conclusion of the command, a report will be displayed, in the command output box. If errors are detected, this report may specify additional actions to take. For XFS and BTRFS, we don't yet have much expertise to advise you. Use your best judgement. Most of us will probably do whatever it suggests we do.

If your file system has only minor issues, then running the first option suggested should be all that is necessary. If it finds more significant corruption, then it may create a lost+found folder and place in it the files, folders, and parts of files it can recover. It will then be up to you to rename them and restore those files and folders to their correct locations. Many times, it will be possible to identify them by their contents and size.

Note: If the repair command performs write operations to repair the file system, parity will be maintained.

 

 

Thank you once again.

Link to comment
15 minutes ago, JorgeB said:

Start the array in maintenance mode,click on cache, then click on "CHECK":

imagem.png

Using the default settings, then post the output.

Here you go:

[1/7] checking root items
[2/7] checking extents
backref 1565130752 root 18446612693123346832 not referenced back 0x29cb4a0
tree backref 1565130752 root 2 not found in extent tree
incorrect global backref count on 1565130752 found 2 wanted 1
backpointer mismatch on [1565130752 16384]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/sdf1
UUID: e0fd7e5a-50c1-410a-89e6-f41a0c634d63
found 149333053440 bytes used, error(s) found
total csum bytes: 145464004
total tree bytes: 377913344
total fs tree bytes: 171819008
total extent tree bytes: 34390016
btree space waste bytes: 57082664
file data blocks allocated: 3987076739072
 referenced 145612578816
 

 

image.thumb.png.9567244f4c52770f7d37d96df5943fb4.png

Link to comment
18 hours ago, JorgeB said:

File system has issues, suggest backing up anything important then try to repair, or possibly better just format the pool and restore the data.

I've followed the steps to format (and fully erase the content) of the pool of 2x SSD drives in btrfs (by changing to encrypted and then back to normal).

Then, I created a new docker.img (recommended by the backup restoration tool), and started the docker service.

I, then, restored the appdata from a back from last week, which took several hours.

 

Even though I can see over 100GB of data (expected) in the appdata folder, including all my older docker applications, none of them show up under my docker area at all.

So the data is physically there, but docker doesn't recognize them as it should...

 

Since this is my first time restoring the data (I'm using the Backup/Restore appdata tool under settings), am I doing something wrong here?

 

Diagnostics are attached, in case they're needed.

 

I'll also be happy to post screenshots, if it helps.

 

Thank you.

unraid-server-diagnostics-20230301-1733.zip

Link to comment
41 minutes ago, Digaumspider said:

I, then, restored the appdata from a back from last week, which took several hours

This restores the docker containers variable data - it does not restore the docker containers themselves and the settings you used when installing them previously.

 

How to do this (with their original settings intact) is covered here in the online documentation accessible via the Manual link at the bottom of the Unraid GUI or the Documentation link at the bottom of every forum page.

 

 

Link to comment
3 hours ago, itimpi said:

This restores the docker containers variable data - it does not restore the docker containers themselves and the settings you used when installing them previously.

 

How to do this (with their original settings intact) is covered here in the online documentation accessible via the Manual link at the bottom of the Unraid GUI or the Documentation link at the bottom of every forum page.

 

 

Thanks.

 

I wasn't aware of that. Now I know.

I'll mark Jorge's response as the solution, though, as he allowed me to rebuild the cache and bring it back to a functional state again, to the point where I can use the docker solution normally again.

 

Also, on the documentation: For what is worth, I tried following the steps there, but if you scroll up and take a look at my screenshot, the official documentation conflicts directly with the solution to the problem, so I thought it best to double (and triple) check here first before going ahead.

 

Might be helpful to folks in the future if the documentation is updated to reflect the correct steps to be taken.

 

Again, thank you for the help.

Link to comment
1 hour ago, Digaumspider said:

Also, on the documentation: For what is worth, I tried following the steps there, but if you scroll up and take a look at my screenshot, the official documentation conflicts directly with the solution to the problem, so I thought it best to double (and triple) check here first before going ahead

I am going to try and add more detail to the documentation to make it cleared how the different file system types should be handled from a check/repair perspective.   I guess shortly we will have to add ZFS systems into this mix as well.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.