[Urgent] Btrfs errors due to one disk being offline for a bit

July 29, 20169 yr

Hi guys,

I'm running the new rc3 and just realized that the dashboard showed the log being 54% full (it's usually at about 2%).

When I clicked on the log, I saw that it was constantly being flooded with btrfs errors (here's a snapshot: http://pastebin.com/xgTdwLK6 )

No errors reported in the gui, everything seems fine. The docker containers and the VM are running fine.

I immediately stopped the dockers and the vm. Checked the drives in the cache pool. One of them (sdd) shows that it can't retrieve attributes or capabilities.

Diagnostics file is too large to attach. Here it is on google drive: https://drive.google.com/open?id=0B6Pdr6ZXGPHDWmYxbWNva3lOQ2M

Any help would be appreciated.

Thanks

EDIT: I stopped the array and it shows no device for that cache drive. It seems the drive has dropped. I did put in a new drive cage recently and I suspect the issue is either with the cage, or the SSD was not properly making contact with the ports in the back

Quote

July 29, 20169 yr

Author

Hi again,

Just got home, checked the connections, restarted the server and now the drive is accessible again.

However, I am getting some btrfs errors during array start, possibly due to information mismatch (writes happened while one drive was offline).

What is the best way to fix that? Should I run a balance again? or a scrub? Will it fix itself automatically? (I did try to restart the array one more time and still getting the errors)

Here are the lines in the syslog:

Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): disk space caching is enabled
Jul 29 19:22:24 Tower kernel: BTRFS: has skinny extents
Jul 29 19:22:24 Tower kernel: verify_parent_transid: 1 callbacks suppressed
Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397037187072 wanted 1505374 found 1473345
Jul 29 19:22:24 Tower kernel: repair_io_failure: 34 callbacks suppressed
Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037187072 (dev /dev/sdd1 sector 82385888)
Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037191168 (dev /dev/sdd1 sector 82385896)
Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037195264 (dev /dev/sdd1 sector 82385904)
Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037199360 (dev /dev/sdd1 sector 82385912)
Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397037203456 wanted 1505374 found 1473344
Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037203456 (dev /dev/sdd1 sector 82385920)
Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037207552 (dev /dev/sdd1 sector 82385928)
Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037211648 (dev /dev/sdd1 sector 82385936)
Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037215744 (dev /dev/sdd1 sector 82385944)
Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): bdev /dev/sdd1 errs: wr 2117450, rd 100550, flush 220783, corrupt 0, gen 0
Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397050064896 wanted 1505371 found 1473350
Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397050064896 (dev /dev/sdd1 sector 82411040)
Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397050068992 (dev /dev/sdd1 sector 82411048)
Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1396998029312 wanted 1505361 found 1473319
Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397036580864 wanted 1505372 found 1473344
Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397049393152 wanted 1505371 found 1473350
Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397036630016 wanted 1505372 found 1473346
Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1396962344960 wanted 1505332 found 1473274
Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397036646400 wanted 1505372 found 1473344
Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397036843008 wanted 1505372 found 1473344
Jul 29 19:22:24 Tower kernel: BTRFS: detected SSD devices, enabling SSD mode

Thanks

Quote

July 30, 20169 yr

Author

So I tried to start docker, but the docker.img would not mount. I got errors about the filesystem.

So I did a scrub of the cache pool. It found 1500 errors, and about 16000 checksum errors. I let it correct it all. No more btrfs error messages in the syslog.

But docker still would not start. I nuked the docker.img and recreated it. Everything seems to work now, but I can't make sure whether any other files were corrupted during the process (one thing I did not try was to do the scrub for the docker.img).

Can someone clarify the process of recovering or protecting the data in the cache pool in a btrfs raid1 setting. I thought the mirroring was foolproof (albeit with a loss of 50% disk space) but now I'm not so sure.

With the array, if a disk drops and misses writes, it red balls and you let it rebuild. For the cache pool, I was not given an option. Should I expect corrupt files now just like the docker.img?

Thanks

Quote

July 30, 20169 yr

Community Expert

If there are no errors during a scrub in theory all files are good.

Docker.img is different, maybe running a scrub on it would fix it, maybe not, it seams very prone to corruption, even when used on a single xfs or reiserfs disk.

Quote

July 31, 20169 yr

Author

Thanks johnnie.black

I've been monitoring it for the last couple of days and did not notice any other issues or signs of corruption. All containers and VMs are running just fine. It seems docker.img was the only affected file. Who knows maybe a scrub on it would have fixed it.

Anyway, so the only question is, if something like that happens again where a drive drops, I should just reboot and run a scrub? That's the btrfs way? No rebuild necessary as long as the disk itself is fine?

Thanks

Quote

July 31, 20169 yr

Community Expert

Scrub and a balance should be enough, another option is clearing the "failed" disk and add it to the pool again, it will be automatically balanced.

Quote

[Urgent] Btrfs errors due to one disk being offline for a bit

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)