July 29, 20169 yr Hi guys, I'm running the new rc3 and just realized that the dashboard showed the log being 54% full (it's usually at about 2%). When I clicked on the log, I saw that it was constantly being flooded with btrfs errors (here's a snapshot: http://pastebin.com/xgTdwLK6 ) No errors reported in the gui, everything seems fine. The docker containers and the VM are running fine. I immediately stopped the dockers and the vm. Checked the drives in the cache pool. One of them (sdd) shows that it can't retrieve attributes or capabilities. Diagnostics file is too large to attach. Here it is on google drive: https://drive.google.com/open?id=0B6Pdr6ZXGPHDWmYxbWNva3lOQ2M Any help would be appreciated. Thanks EDIT: I stopped the array and it shows no device for that cache drive. It seems the drive has dropped. I did put in a new drive cage recently and I suspect the issue is either with the cage, or the SSD was not properly making contact with the ports in the back
July 29, 20169 yr Author Hi again, Just got home, checked the connections, restarted the server and now the drive is accessible again. However, I am getting some btrfs errors during array start, possibly due to information mismatch (writes happened while one drive was offline). What is the best way to fix that? Should I run a balance again? or a scrub? Will it fix itself automatically? (I did try to restart the array one more time and still getting the errors) Here are the lines in the syslog: Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): disk space caching is enabled Jul 29 19:22:24 Tower kernel: BTRFS: has skinny extents Jul 29 19:22:24 Tower kernel: verify_parent_transid: 1 callbacks suppressed Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397037187072 wanted 1505374 found 1473345 Jul 29 19:22:24 Tower kernel: repair_io_failure: 34 callbacks suppressed Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037187072 (dev /dev/sdd1 sector 82385888) Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037191168 (dev /dev/sdd1 sector 82385896) Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037195264 (dev /dev/sdd1 sector 82385904) Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037199360 (dev /dev/sdd1 sector 82385912) Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397037203456 wanted 1505374 found 1473344 Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037203456 (dev /dev/sdd1 sector 82385920) Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037207552 (dev /dev/sdd1 sector 82385928) Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037211648 (dev /dev/sdd1 sector 82385936) Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397037215744 (dev /dev/sdd1 sector 82385944) Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): bdev /dev/sdd1 errs: wr 2117450, rd 100550, flush 220783, corrupt 0, gen 0 Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397050064896 wanted 1505371 found 1473350 Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397050064896 (dev /dev/sdd1 sector 82411040) Jul 29 19:22:24 Tower kernel: BTRFS info (device sdk1): read error corrected: ino 1 off 1397050068992 (dev /dev/sdd1 sector 82411048) Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1396998029312 wanted 1505361 found 1473319 Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397036580864 wanted 1505372 found 1473344 Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397049393152 wanted 1505371 found 1473350 Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397036630016 wanted 1505372 found 1473346 Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1396962344960 wanted 1505332 found 1473274 Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397036646400 wanted 1505372 found 1473344 Jul 29 19:22:24 Tower kernel: BTRFS error (device sdk1): parent transid verify failed on 1397036843008 wanted 1505372 found 1473344 Jul 29 19:22:24 Tower kernel: BTRFS: detected SSD devices, enabling SSD mode Thanks
July 30, 20169 yr Author So I tried to start docker, but the docker.img would not mount. I got errors about the filesystem. So I did a scrub of the cache pool. It found 1500 errors, and about 16000 checksum errors. I let it correct it all. No more btrfs error messages in the syslog. But docker still would not start. I nuked the docker.img and recreated it. Everything seems to work now, but I can't make sure whether any other files were corrupted during the process (one thing I did not try was to do the scrub for the docker.img). Can someone clarify the process of recovering or protecting the data in the cache pool in a btrfs raid1 setting. I thought the mirroring was foolproof (albeit with a loss of 50% disk space) but now I'm not so sure. With the array, if a disk drops and misses writes, it red balls and you let it rebuild. For the cache pool, I was not given an option. Should I expect corrupt files now just like the docker.img? Thanks
July 30, 20169 yr Community Expert If there are no errors during a scrub in theory all files are good. Docker.img is different, maybe running a scrub on it would fix it, maybe not, it seams very prone to corruption, even when used on a single xfs or reiserfs disk.
July 31, 20169 yr Author Thanks johnnie.black I've been monitoring it for the last couple of days and did not notice any other issues or signs of corruption. All containers and VMs are running just fine. It seems docker.img was the only affected file. Who knows maybe a scrub on it would have fixed it. Anyway, so the only question is, if something like that happens again where a drive drops, I should just reboot and run a scrub? That's the btrfs way? No rebuild necessary as long as the disk itself is fine? Thanks
July 31, 20169 yr Community Expert Scrub and a balance should be enough, another option is clearing the "failed" disk and add it to the pool again, it will be automatically balanced.
Archived
This topic is now archived and is closed to further replies.