BTRFS errors on the cache pool

November 25, 20241 yr

Hello, what should be done if I have so much errors in my BTRFS cache pool?

Accordingly scrutiny both drives are fine, both are NVME drives.

Do I need to replace the drive? I already did RAM replacement, running currently other set with stock speed

Quote

November 25, 20241 yr

Community Expert

You may need to run a btrfs scrub to check and correct issues. copy important data while you can...

Other general troubleshooting check file system and smart. Errors don't mean imediate drive failure but can cause or be signs of one.

1. Confirm the Cause of Errors

Check System Logs: Go to Tools > System Log in the Unraid interface or use SSH to view the logs (/var/log/syslog). Look for BTRFS-related errors. Common culprits are:

Metadata corruption

Misconfiguration in the cache pool

Connection issues (less likely with NVMe)

BTRFS Scrub: Perform a BTRFS scrub to verify the integrity of the data and correct any recoverable errors.

Go to Main > Cache Pool.

Click on the Cache Pool name.

Select Scrub.

Review the scrub results for uncorrectable errors.

2. Backup Critical Data

Before proceeding with repairs or resets, ensure that you back up all critical data from the cache pool. You can use:

Unraid mover: Move appdata and other shares off the cache to the array.

Manual backup: Copy files directly using an SSH or file management plugin like Krusader.

3. Repair the BTRFS Cache Pool

Option A: BTRFS File System Check

Stop the array from the Unraid Main tab.

Use the command line or Unraid GUI to check and repair the filesystem:

SSH into your server or use the terminal in Unraid.

Identify your cache devices (/dev/sdX, /mnt/cache).

btrfs check --repair /dev/sdX1

Replace /dev/sdX1 with your actual device.

⚠️ Note: The --repair flag should only be used as a last resort.

Option B: Recreate the Cache Pool

*if file sytem and smart pass...

If the repair process fails or errors persist:

Stop the Array and unassign the drives from the cache pool.

Use wipefs or a similar tool to clear existing BTRFS metadata from the drives:

wipefs -a /dev/sdX

Replace /dev/sdX with the correct device.

Reassign the drives to the cache pool in the Unraid GUI.

Format the cache pool (ensure you select BTRFS again if needed).

4. Optimize the BTRFS Cache Pool Configuration

RAID Profile: For a 2-drive NVMe pool, ensure you're using a balanced RAID profile (e.g., RAID1 for redundancy or RAID0 for performance).

Go to the terminal and verify RAID level

btrfs filesystem df /mnt/cache

change raid level i needed:

btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt/cache

*disk setting trim?

Trim: Periodically run TRIM on NVMe drives to maintain performance.

Go to Settings > Scheduler > Dynamix SSD TRIM and enable periodic trims.

Monitor the System

Keep an eye on errors in the Unraid dashboard and logs after performing the above steps.

Use the Scrutiny plugin or other monitoring tools to track drive health.

Address Potential Root Causes

Since you've already replaced RAM and reduced speeds, consider other possible contributors:

PCIe Lane Configuration: Ensure NVMe drives are receiving adequate PCIe lanes and bandwidth.

Cooling: NVMe drives may throttle or misbehave under high temperatures. Check temperatures during heavy workloads.

please post diag file syslog is not enouth info to assist.

Quote

November 25, 20241 yr

Author

I did scrub couple of times, it fixes some errors, but this keeps happening.

My cache don't have much load on them, mainly appdata for my containers, I had to recreate docker image already couple of times

Probably I need try to repair, currently pool is in Raid1

If this won't help, I will replace drive

Apparently now server UI not responding, but i still works, I see errors written on my other box via remote syslog

fortress-diagnostics-20241125-0334.zip

Edited November 25, 20241 yr by J05u

Quote

November 25, 20241 yr

Community Expert

this is looking like disk failure, sorry...

From the logs, the key issues revolve around I/O errors and write failures on nvme2n1p1. This indicates persistent issues, such as:

Write Errors: error writing primary super block to device 1 is critical, as the superblock contains the filesystem's metadata. Corruption here can destabilize the pool.

I/O Errors: Repeated lost page write due to IO error suggests the device is experiencing connectivity or hardware issues.

BTRFS Device Errors: Incremental errors like wr (writes), rd (reads), and flush suggest BTRFS is encountering challenges maintaining consistency on this device.

Backup Critical Data Immediately

Move appdata and other important shares from the cache to the array using the mover or a manual copy method.

Ensure a complete backup of anything stored on the cache pool.

Run a Detailed Disk Health Check Even though Scrutiny shows the drives as healthy, further checks can confirm:

SMART Test: Run long SMART tests on nvme2n1p1 and the second cache drive.

Check NVMe Health: Look for signs of wear or connectivity issues using tools like nvme-cli (if supported by Unraid):

nvme smart-log /dev/nvme2n1

Attempt a advance Repair Perform a BTRFS repair operation:

btrfs check --repair /dev/nvme2n1p1

⚠️ Use the --repair flag cautiously, as it can cause data loss. Always back up first.

Recreate the Cache Pool (Preferred if Errors Persist) If errors continue or repair doesn't help:

Stop the array.

Unassign the problematic NVMe drive from the cache pool.

Wipe the drive's BTRFS metadata using

wipefs -a /dev/nvme2n1

Recreate the cache pool with a fresh RAID1 configuration and restore your data from the backup.

Test the Drives Individually If the issue persists with nvme2n1p1:

Remove it from the cache pool and run intensive tests to isolate hardware or driver issues.

Operate the pool temporarily with just the second NVMe drive to rule out other system issues.

Check for PCIe Configuration and Stability

Verify NVMe drives are seated correctly and receiving adequate PCIe lanes.

Ensure system cooling is sufficient to prevent throttling or thermal issues.

Update the motherboard BIOS and Unraid to the latest versions.

If you continue encountering errors after performing the above steps, replacing the drive might be necessary.

Quote

1

March 2, 20251 yr

On 11/24/2024 at 8:32 PM, bmartino1 said:

BTRFS Scrub: Perform a BTRFS scrub to verify the integrity of the data and correct any recoverable errors.

Go to Main > Cache Pool.

Click on the Cache Pool name.

Select Scrub.

I don't see any scrub option on that screen?

Quote

March 2, 20251 yr

Community Expert

Please post the diagnostics.

Quote

March 3, 20251 yr

Community Expert

On 3/2/2025 at 8:27 AM, taflix said:

I don't see any scrub option on that screen?

Please post a diag file...

you may not be on a btrfs fomrated disk:

Main:

*Note the FS

click the name in my case its called cache:

It is under balance section...

!!!scroll down!!!!

this is the same for ZFS. otherwise you will need to run the termianl comands targeting hte disks.

check your filesystems!

lsblk -f
df -T
blkid

run scan manual:

#Before starting, list all Btrfs file systems:
mount -t btrfs

#Start a Btrfs Scrub (you must set the path...)
#example: btrfs scrub start /mnt
btrfs scrub start /dev/sdX

#check status:
#example btrfs scrub status /mnt
btrfs scrub status -d /dev/sdX

Alternative there is a script you can run (this is what unraids does ...)

for mount in $(mount -t btrfs | awk '{print $3}'); do
    btrfs scrub start "$mount"
done

*Run Scrub on All Btrfs Volumes (If Multiple)

Quote

1

BTRFS errors on the cache pool

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)