BTRFS reoccuring IO-Error issues: Frustrated and Depleted

madejackson · December 13, 2021

Seriously, What good is a raid-protected filesystem if it fails to protect your data.

Most important infos at first, my Questions:

Why does unraid/btrfs not detect and notify me that my ssd's going havoc?

Why does unraid/btrfs allow writing to a bad pool, rendering the data inaccessible?

Why is btrfs not able to restore my data after a drive failure? (raid 6, 1 drive detached)

Why is btrfs not able to correct Errors, even in RAID 6, which should survive 2x drive failures? (raid 6, all drives accessible) In my understanding, uncorrectable error would mean, I have lost at least 3 of my 6 drives at once. It's impossible that this is a purely hardware based failure.

Writeup:

In recent months I got very frustrated with my BTRFS-Cache pool and I think it is not ready for Primetime at all. Unfortunately, there is no alternative, as ZFS does not allow a protectet cache. Maybe I am doing something wrong but I frequently get uncorrectable I/O-Error's and suddenly disappearing drives.

in the past 5 Years I had 2x240gb in RAID 1 which I upgraded 1 Year ago to 3x240GB in RAID5. I also added another pool with 2x120GB in raid 0 (metadata raid1).

It started with getting corrupted files in the raid0-Pool. The bad files were logged and I was able to delete them.
1. As this issue started to re-appear I recreated the complete pool but: no avail.
Suddenly, one day my server froze and I had to kill it via Powerswitch. After Reboot I saw a lot of I/O-Errors on the raid 5 pool and 1x Drive was not recognized by unraid anymore. After a reboot, the drive was appearing but suddenly disappeared when trying to add it to the pool.
1. After Clearing the disk via an USB-SATA-Adapter on another machine and then preclearing it via unraid, I was able to add the drive to the pool again and it suddenly worked flawlessly again.
2. Tons of my photos were corrupted and I tried every fucking btrfs-tutorial and command out there on the internet to save them but it didn't work. The data is gone despite using a RAID 5.
After this devastating dataloss-fiasco, I removed the raid0 pool and created one huge pool with raid 6, 3x240gb + 3x120gb. Working flawlessly for 2 Months.
Today, i suddenly noticed one of the drive had no access anymore and the logs showed ton's of btrfs errors. I stopped everything and Rebooted, drive not showing up anymore. Btrfs showed a ton of Errors.
1. AGAIN: Every fucking tutorial to save the data didn't work. I always have some unsolvable error and the command get's abortet.
2. reattached the drive via an USB-SATA adapter and wow: Drive is suddenly back and can be accessed.
  1. Well I am not gonna use this Drive anymore: btrfs device remove...
    1. Result: aborted IO-ERROR: ARE YOU FUCKING KIDDING ME I even cannot remove a bad drive what should i do?

btw: My System has ECC-RAM installed, so definitely no issues there.

tower-diagnostics-20211213-2249.zip

JorgeB · December 14, 2021

Dec 13 21:48:39 Tower kernel: BTRFS info (device sdg1): bdev /dev/sdh1 errs: wr 22953727, rd 19328774, flush 221212, corrupt 0, gen 0
Dec 13 21:48:39 Tower kernel: BTRFS info (device sdg1): bdev /dev/sdr1 errs: wr 0, rd 0, flush 0, corrupt 65, gen 0
Dec 13 21:48:39 Tower kernel: BTRFS info (device sdg1): bdev /dev/sdb1 errs: wr 50983508, rd 45647922, flush 817018, corrupt 0, gen 0

This shows you had multiple devices dropping offline in the past, probably not a the same time, but you want to be on top of that, the longer it goes without a correcting scrub after a device drops/reconnects the most likely you are to run into issues, Unraid doesn't warn the user if a pool device drops, see here for better pool monitoring, also btrfs raid5/raid6 still has some known issues, I only recommend using it with stable hardware, if your devices keep dropping will run into issues with it, especially if not immediately fixed.

I run various raid5 pools without issues, but I keep on top of them and have backups, also keep in mind that no raid is a substitute for backups.

sunbear · January 5, 2022

I'm having what sounds like very similar issues.

I have RMA'd disks, replaced RAM, replaced all SATA cables, moved to HBA splitter and yet I still get endless corruption issues and ata dropout errors whenever I try to copy anything to my RAID5 cache.

I'm at the point where I have made the decision to move to TrueNAS because of a year or more of these issues but I'm stuck with unraid until I get a new machine.

BTRFS reoccuring IO-Error issues: Frustrated and Depleted

Recommended Posts

madejackson

Link to comment

JorgeB

Link to comment

sunbear

Link to comment

Join the conversation