November 25, 20241 yr Hello, what should be done if I have so much errors in my BTRFS cache pool? Accordingly scrutiny both drives are fine, both are NVME drives. Do I need to replace the drive? I already did RAM replacement, running currently other set with stock speed syslog.log
November 25, 20241 yr Community Expert You may need to run a btrfs scrub to check and correct issues. copy important data while you can... Other general troubleshooting check file system and smart. Errors don't mean imediate drive failure but can cause or be signs of one. 1. Confirm the Cause of Errors Check System Logs: Go to Tools > System Log in the Unraid interface or use SSH to view the logs (/var/log/syslog). Look for BTRFS-related errors. Common culprits are: Metadata corruption Misconfiguration in the cache pool Connection issues (less likely with NVMe) BTRFS Scrub: Perform a BTRFS scrub to verify the integrity of the data and correct any recoverable errors. Go to Main > Cache Pool. Click on the Cache Pool name. Select Scrub. Review the scrub results for uncorrectable errors. 2. Backup Critical Data Before proceeding with repairs or resets, ensure that you back up all critical data from the cache pool. You can use: Unraid mover: Move appdata and other shares off the cache to the array. Manual backup: Copy files directly using an SSH or file management plugin like Krusader. 3. Repair the BTRFS Cache Pool Option A: BTRFS File System Check Stop the array from the Unraid Main tab. Use the command line or Unraid GUI to check and repair the filesystem: SSH into your server or use the terminal in Unraid. Identify your cache devices (/dev/sdX, /mnt/cache). btrfs check --repair /dev/sdX1 Replace /dev/sdX1 with your actual device. ⚠️ Note: The --repair flag should only be used as a last resort. Option B: Recreate the Cache Pool *if file sytem and smart pass... If the repair process fails or errors persist: Stop the Array and unassign the drives from the cache pool. Use wipefs or a similar tool to clear existing BTRFS metadata from the drives: wipefs -a /dev/sdX Replace /dev/sdX with the correct device. Reassign the drives to the cache pool in the Unraid GUI. Format the cache pool (ensure you select BTRFS again if needed). 4. Optimize the BTRFS Cache Pool Configuration RAID Profile: For a 2-drive NVMe pool, ensure you're using a balanced RAID profile (e.g., RAID1 for redundancy or RAID0 for performance). Go to the terminal and verify RAID level btrfs filesystem df /mnt/cache change raid level i needed: btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt/cache *disk setting trim? Trim: Periodically run TRIM on NVMe drives to maintain performance. Go to Settings > Scheduler > Dynamix SSD TRIM and enable periodic trims. Monitor the System Keep an eye on errors in the Unraid dashboard and logs after performing the above steps. Use the Scrutiny plugin or other monitoring tools to track drive health. Address Potential Root Causes Since you've already replaced RAM and reduced speeds, consider other possible contributors: PCIe Lane Configuration: Ensure NVMe drives are receiving adequate PCIe lanes and bandwidth. Cooling: NVMe drives may throttle or misbehave under high temperatures. Check temperatures during heavy workloads. please post diag file syslog is not enouth info to assist.
November 25, 20241 yr Author I did scrub couple of times, it fixes some errors, but this keeps happening. My cache don't have much load on them, mainly appdata for my containers, I had to recreate docker image already couple of times Probably I need try to repair, currently pool is in Raid1 If this won't help, I will replace drive Apparently now server UI not responding, but i still works, I see errors written on my other box via remote syslog fortress-diagnostics-20241125-0334.zip Edited November 25, 20241 yr by J05u
November 25, 20241 yr Community Expert this is looking like disk failure, sorry... From the logs, the key issues revolve around I/O errors and write failures on nvme2n1p1. This indicates persistent issues, such as: Write Errors: error writing primary super block to device 1 is critical, as the superblock contains the filesystem's metadata. Corruption here can destabilize the pool. I/O Errors: Repeated lost page write due to IO error suggests the device is experiencing connectivity or hardware issues. BTRFS Device Errors: Incremental errors like wr (writes), rd (reads), and flush suggest BTRFS is encountering challenges maintaining consistency on this device. Backup Critical Data Immediately Move appdata and other important shares from the cache to the array using the mover or a manual copy method. Ensure a complete backup of anything stored on the cache pool. Run a Detailed Disk Health Check Even though Scrutiny shows the drives as healthy, further checks can confirm: SMART Test: Run long SMART tests on nvme2n1p1 and the second cache drive. Check NVMe Health: Look for signs of wear or connectivity issues using tools like nvme-cli (if supported by Unraid): nvme smart-log /dev/nvme2n1 Attempt a advance Repair Perform a BTRFS repair operation: btrfs check --repair /dev/nvme2n1p1 ⚠️ Use the --repair flag cautiously, as it can cause data loss. Always back up first. Recreate the Cache Pool (Preferred if Errors Persist) If errors continue or repair doesn't help: Stop the array. Unassign the problematic NVMe drive from the cache pool. Wipe the drive's BTRFS metadata using wipefs -a /dev/nvme2n1 Recreate the cache pool with a fresh RAID1 configuration and restore your data from the backup. Test the Drives Individually If the issue persists with nvme2n1p1: Remove it from the cache pool and run intensive tests to isolate hardware or driver issues. Operate the pool temporarily with just the second NVMe drive to rule out other system issues. Check for PCIe Configuration and Stability Verify NVMe drives are seated correctly and receiving adequate PCIe lanes. Ensure system cooling is sufficient to prevent throttling or thermal issues. Update the motherboard BIOS and Unraid to the latest versions. If you continue encountering errors after performing the above steps, replacing the drive might be necessary.
March 2, 20251 yr On 11/24/2024 at 8:32 PM, bmartino1 said: BTRFS Scrub: Perform a BTRFS scrub to verify the integrity of the data and correct any recoverable errors. Go to Main > Cache Pool. Click on the Cache Pool name. Select Scrub. I don't see any scrub option on that screen?
March 3, 20251 yr Community Expert On 3/2/2025 at 8:27 AM, taflix said: I don't see any scrub option on that screen? Please post a diag file... you may not be on a btrfs fomrated disk: Main: *Note the FS click the name in my case its called cache: It is under balance section... !!!scroll down!!!! this is the same for ZFS. otherwise you will need to run the termianl comands targeting hte disks. check your filesystems! lsblk -f df -T blkid run scan manual: #Before starting, list all Btrfs file systems: mount -t btrfs #Start a Btrfs Scrub (you must set the path...) #example: btrfs scrub start /mnt btrfs scrub start /dev/sdX #check status: #example btrfs scrub status /mnt btrfs scrub status -d /dev/sdX Alternative there is a script you can run (this is what unraids does ...) for mount in $(mount -t btrfs | awk '{print $3}'); do btrfs scrub start "$mount" done *Run Scrub on All Btrfs Volumes (If Multiple)
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.