(SOLVED) Redundant Cache Pool is having issues, can't start dockers or VMs

MrFrizzy · February 20, 2022

I deleted a folder from one of my unassigned drives and wanted to see if I could recover it. I shut down the machine, pulled the drive out, and tried a recovery through my Windows machine (since the drive is NTFS). I was not successful so I put the drive back into the machine and powered up. Once in Unraid, I noticed that both that unassigned drive and one of my cache pool drives were not showing up. I powered back down and found that I had unplugged the two SATA cables at the controller card. Plugged those back in, booted back up, and noticed that my dockers and VMs would not start up. Dockers kept complaining about something refusing the request and the VMs couldn't talk with libvrt. Turns out, my cache pool that was in RAID 1 was showing a capacity of 1TB (500GB + 500GB) with over 400GB free which is very wrong, it was acting more like RAID 0. When using the terminal to try and access my cache pool, I would just get a message about "wrong fs type, bad option, bad superblock on /dev/sdl1, missing codepage or helper program, or other error.".

I tried a scrub which didn't fix the issue. I tried turning off the array and then bringing it back online with no change. I removed one of the cache drives from the pool and started the array with no change. I tried adding the drive back into the cache pool and it did say it would overwrite the data, but I proceeded anyway which made things worse I think. The system keeps throwing messages about both of the cache pool drives being missing though they are there (probably because they are inaccessible/unmountable). Since I couldn't get anything working even after following this page: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-543490, I pulled the drives out of the unraid machine and attached them to my Debian virtual machine hosted on my Windows desktop. I have not been able to get anywhere on there either.

Using UFS Explorer pro on Debian, the software recognizes that both drives do not have a file system. One drive does show to have a "Btrfs component partition" while the other shows nothing partition wise. Running a scan on both drives shows the correct data and folder structure for both drives (they are identical). Since I do not have a license for this software, I can't recover/copy any of the data off to another drive. If I could do that, I would just wipe the drives afterwards, create a new cache pool, and move the data back over.

Can anyone help direct me on what to try next? Is there a way to rebuild/restore the file system so the drives are accessible again? Do I need to somehow rebuild the RAID 1 with these drives and then try the previously linked post again? I don't have the drives in the Unraid machine, but can put them back if that makes things easier (I am much more familiar with Debian than I am with Unraid and it's interface).

Thanks!

Edited February 25, 2022 by MrFrizzy

JorgeB · February 21, 2022

8 hours ago, MrFrizzy said:

I tried adding the drive back into the cache pool and it did say it would overwrite the data, but I proceeded anyway which made things worse I think.

This deletes the filesystem superblock on the device(s), you might be able to recover with:

btrfs-select-super -s 1 /dev/sdX1

replace X with correct device, don't forget the 1 in the end, do this for both devices and if the command is successful add them to a new pool and post diags after arrays start.

MrFrizzy · February 21, 2022

22 minutes ago, JorgeB said:
This deletes the filesystem superblock on the device(s), you might be able to recover with:
btrfs-select-super -s 1 /dev/sdX1
replace X with correct device, don't forget the 1 in the end, do this for both devices and if the command is successful add them to a new pool and post diags after arrays start.

I will certainly try that once the restore of the data to a backup drive has completed. What is interesting is that the drive I deleted the file system on is the one that I can get a restore going on without any trouble. The drive that has a "btrfs component partition" (according to UFS Explorer) refuses to do anything I have tried so far.

MrFrizzy · February 21, 2022

5 hours ago, MrFrizzy said:

I will certainly try that once the restore of the data to a backup drive has completed. What is interesting is that the drive I deleted the file system on is the one that I can get a restore going on without any trouble. The drive that has a "btrfs component partition" (according to UFS Explorer) refuses to do anything I have tried so far.

The restore was successful. I ran the command on the same drive (the deleted one) with success. I then noticed that my Debian 11 VM recognized both drives, let me mount them, and everything seems to be intact. I didn't run the command on the drive with the "btrfs component partition". I moved them back into the Unraid machine, booted, up and found that the array and cache were both up along with the dockers and VMs. However, my second pool was missing the first drive so I shut the machine back down, checked the connections, booted back up, and now the second pool is doing the same thing the first one is. It shows RAID 1, but the total capacity and free capacity values are definitely wrong.

Attached is the diagnostic zip. I am letting the parity check and rebalance run (usually takes about 24 hours for the main array).

Edited February 21, 2022 by MrFrizzy

JorgeB · February 21, 2022

Both pools have single and raid1 data profiles in use, this will happen when they are mounted with a missing device, run a balance to raid1 to fix it, you can prevent this issue by disabling array auto start, at least when making config changes, and making sure everything is there before starting it

MrFrizzy · February 21, 2022

@JorgeB The rebalance to RAID1 errs out on the cache pool and the process is still running on the second pool. Below is the error I am getting.

ERROR: error during balancing '/mnt/cache': Input/output error

Output of dmesg | tail

[21077.202819] BTRFS error (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 297, gen 694
[21077.202838] BTRFS error (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 298, gen 694
[21077.208635] BTRFS warning (device sdj1): csum failed root -9 ino 586 off 855498752 csum 0x20dbc7d1 expected csum 0x55eff1d8 mirror 1
[21077.208637] BTRFS error (device sdj1): bdev /dev/sdl1 errs: wr 0, rd 0, flush 0, corrupt 183, gen 0
[21077.208792] BTRFS warning (device sdj1): csum failed root -9 ino 586 off 855498752 csum 0x26049628 expected csum 0x55eff1d8 mirror 2
[21077.208795] BTRFS error (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 299, gen 694
[21080.116515] BTRFS info (device sdj1): balance: ended with status: -5

Considering that the cache drive also has my VMs and docker containers, certainly some files changed between the last time it was in a valid RAID 1 and the current rebalance since I have both in auto-start, right? I have found and removed the uncorrectable file (all of Krusader actually) and am trying another rebalance. The process on the second pool also failed with the same error, but the ino listed is on the cache pool so I don't quite understand that one. I will wait for the rebalance to complete successfully on the cache pool before I try it again on the second pool.

JorgeB · February 21, 2022

2 minutes ago, MrFrizzy said:
csum failed 

This means data corruption was detected, you can run a scrub and then check the syslog for a list of the corrupt files, these need to be deleted/restored from backups hen run the balance.

MrFrizzy · February 21, 2022

I removed the 9 uncorrectable files, ran a successful scrub, and a successful rebalance. Running the same procedure on the second pool now.

MrFrizzy · February 21, 2022

@JorgeB The cache pool is working as expected now! Correct size and free space is shown!

The second pool is also showing the correct capacity and free space after running a scrub and balance. It did have a few files fail checksum of which I have removed, but the second balance has erred out on another csum failed message for a file that does not exist (both find /mnt/[secondpool] -inum 946 and btrfs inspect-internal inode-resolve 946 /mnt/[secondpool] fail to find the file). I am trying the scrub again to see if that can fix anything but it will ~~likely take an hour or two~~ take over 3 hours to complete.

[43939.203266] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 44, gen 0
[43939.203270] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 45, gen 0
[43939.252058] BTRFS warning (device sdb1): csum failed root -9 ino 946 off 199262208 csum 0xe16b1f3e expected csum 0x0b5860dc mirror 1
[43939.252064] BTRFS error (device sdb1): bdev /dev/sde1 errs: wr 0, rd 0, flush 0, corrupt 57, gen 0
[43939.252201] BTRFS warning (device sdb1): csum failed root -9 ino 946 off 199262208 csum 0x90983c1a expected csum 0x0b5860dc mirror 2
[43939.252214] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 46, gen 0
[43941.686335] BTRFS info (device sdb1): balance: ended with status: -5

I do have a few questions for you.

Does Unraid ever do a scrub and/or balance on btrfs drives? There really isn't a need to unless there is a RAID1 pool, right?
Should I script out a scrub and/or rebalance once a month alongside the parity check on the array?

Edited February 21, 2022 by MrFrizzy

JorgeB · February 22, 2022

9 hours ago, MrFrizzy said:

Does Unraid ever do a scrub and/or balance on btrfs drives?

Scrub no, balance only if there's a device added or removed.

9 hours ago, MrFrizzy said:

There really isn't a need to unless there is a RAID1 pool, right?

Balance is usually not needed, except in some particular use cases, running a scrub once a month or so it's not a bad idea, especially if the pool is not monitored for errors.

MrFrizzy · February 23, 2022

I'm having a lot of problems with this second pool. Each time I do a scrub, more uncorrectable errors are found. This 3rd scrub is the first one to complete and not terminate early. Since there are uncorrectable, the balance fails as well. I am copying off all of the data now and am getting read corrections in the logs for files that have not come up in the scrub or balance. Not sure what is going on, but I am reaching the ends of my wits here. What should have been a simple file recovery has turned into 3 days of headache.

I plan to copy all of the data off, blow out that second pool, recreate it from scratch, and repopulate the drive from the copy and backups (where needed). I will also implement some sort of scheduled scrub to help maintain data integrity.

ChatNoir · February 23, 2022

5 hours ago, MrFrizzy said:

Each time I do a scrub, more uncorrectable errors are found

Have you run a memtest ?

JorgeB · February 23, 2022

54 minutes ago, ChatNoir said:

run a memtest ?

Do this.

MrFrizzy · February 24, 2022

2 out of 4 matched sticks have errors. I have no idea how long those have been failing, but it has to have been quite some time. Even some files from backups are failing checksum. I've pulled out the bad DIMMs, added my one spare DIMM, and passed a full run of memtest. I've also pulled off all of the "bad" files from the second pool (since they are the same as what I have backed up) and am running another scrub to see if any more errors come up. The main cache pool passes scrub no problem, even before finding the bad DIMMs.

The parity check is passing every time I run it. Does the main array not compare checksums like the btrfs pools?

JorgeB · February 24, 2022

1 minute ago, MrFrizzy said:

Does the main array not compare checksums like the btrfs pools?

No, but a few sync errors after a parity check would be expected if the errors are bad/frequent enough.

MrFrizzy · February 24, 2022

To be clear, the second btrfs pool does not push files to the array. The shares on the second pool will only ever use the (2) drives in that pool, not the cache and not the array.

The scrub is still running on the second pool, but I have gotten some errors like the below which don't point to a file as far as I can tell. Any clue how I can address those?

Feb 24 10:45:55 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 48, gen 0
Feb 24 10:45:55 Tower kernel: BTRFS error (device sdb1): bdev /dev/sde1 errs: wr 0, rd 0, flush 0, corrupt 49, gen 0
Feb 24 10:45:55 Tower kernel: BTRFS error (device sdb1): unable to fixup (regular) error at logical 12169646080 on dev /dev/sdb1
Feb 24 10:45:55 Tower kernel: BTRFS error (device sdb1): unable to fixup (regular) error at logical 12169646080 on dev /dev/sde1
Feb 24 10:45:55 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 49, gen 0
Feb 24 10:45:55 Tower kernel: BTRFS error (device sdb1): bdev /dev/sde1 errs: wr 0, rd 0, flush 0, corrupt 50, gen 0
Feb 24 10:45:55 Tower kernel: BTRFS error (device sdb1): unable to fixup (regular) error at logical 12169650176 on dev /dev/sdb1
Feb 24 10:45:55 Tower kernel: BTRFS error (device sdb1): unable to fixup (regular) error at logical 12169650176 on dev /dev/sde1

EDIT:

From the researching I've done, it seems that whatever was there has been deleted or overwritten but there is some reference to the extent that contains that block.

I ran the below command and got the shown result. I also shifted the logical address +/- 5,000 then +/- 100,000, but got the same return for all 4 of those as that seen below.

btrfs inspect-internal logical-resolve -v -P 12169746080 /mnt/[secondpool]
ioctl ret=0, total_size=65536, bytes_left=65520, bytes_missing=0, cnt=0, missed=0

At this point, do I just need to move forward with rebuilding the csum tree (--init-extent-tree --init-csum-tree) and see if I end up with any messed up files? Or should I format the pool and move everything back over from backups?

Edited February 24, 2022 by MrFrizzy

JorgeB · February 24, 2022

48 minutes ago, MrFrizzy said:

Or should I format the pool and move everything back over from backups?

I would recommend this.

MrFrizzy · February 25, 2022

@JorgeB Thank you for all your help these last few days! Everything seems to be happy and stable now! You've undoubtedly saved me hours if not days of troubleshooting and headache, so again, thank you for the help!

(SOLVED) Redundant Cache Pool is having issues, can't start dockers or VMs

Recommended Posts

MrFrizzy

Link to comment

JorgeB

Link to comment

MrFrizzy

Link to comment

MrFrizzy

Link to comment

JorgeB

Link to comment

MrFrizzy

Link to comment

JorgeB

Link to comment

MrFrizzy

Link to comment

MrFrizzy

Link to comment

JorgeB

Link to comment

MrFrizzy

Link to comment

ChatNoir

Link to comment

JorgeB

Link to comment

MrFrizzy

Link to comment

JorgeB

Link to comment

MrFrizzy

Link to comment

JorgeB

Link to comment

MrFrizzy

Link to comment

Join the conversation