BTRFS cache pool drive failed potentially causing corruption? (6.9.2)

henris · September 22, 2023

I have a two NVMe SSD cache pool taken into use in 2019 on my second server. We had a power outage and although UPS seemed to allow for controlled shutdown the system was not behaving correctly after restarting the server. I noticed some dockers (Plex) not working correctly and started troubleshooting. I had not received any notifications about issues and also the main Unraid GUI pages (Main, Dashboard, Docker) did not indicate any issue. When I took a look in syslog I saw a flood of BTRFS related warnings and errors. Seemed like the whole server was on fire, well at least the cache pool.

I started reading on FAQ and similar problem threads. I got confused fast. I've been using UnRAID since 2009 and pretty good with it but the cache pool BTRFS mechanism, how to see the status of it, how to troubleshoot it and in this case how to fix it seems overwhelming. I've read this FAQ entry and this addition to it. And several troubleshooting threads. And also this "how to monitor btrfs pool for errors" which I will take into use.

My questions are:

How can I see what is actually broken? From the smart logs and "btrfs dev stats /mnt/cace/" it seems like it is my /dev/nvme1n1p1 SSD which has failed. It just baffles me that this is not at all reflected in the UnRAID GUI.
How can I see what data is corrupted or lost? Are there some specific command I can run to see a list of corrupted files?
Why would I have corrupted data? I thought running a RAID1 cache pool would protect me from a single cache drive failure but now I seem to have a single drive failure and still experiencing at least functional loss (ie. unable to run dockers properly).
What is the recommend way to fix this? I have replacement SSDs ready but I cannot connect them at the same time (only two M.2 slots). I'm especially unsure about trusting the data currently in cache pool. I do have CA backups available.

My whole system is currently down so all help is greatly appreciated! I promise to document my path and end result in this thread.

Diagnostics attached. This is though AFTER one shutdown but seems to show the same behavior.

tms-740-diagnostics-20230922-0733.zip

JorgeB · September 22, 2023

One of the devices dropped offline in the past, first thing is to run a correcting scrub and post the results.

henris · September 22, 2023

I will do this and post results. Just to be sure; it is safe to run scrub on the pool regardless of the state of the devices in it? This cannot cause any more corruption / data loss?

JorgeB · September 22, 2023

9 minutes ago, henris said:

it is safe to run scrub on the pool regardless of the state of the devices in it?

Yes.

henris · September 22, 2023

When I try to run scrub via GUI I get "aborted" as the status:

And pretty much the same thing via shell:

image.png.a33359b46ffbec14a706d583f954335f.png

I forgot that I had run the scrub via GUI last night when doing initial troubleshooting. Initially I got the same "aborted" but after I stopped the VMs and Dockers I was able to start the scrub which run for ~ 6 minutes and reported millions of unrecoverable errors. Unfortunately I did not get a screenshot of that result before hitting the Scrub again now...

JorgeB · September 22, 2023

If the scrub is aborting recommend copying what you can from the pool and re-format, also see here for better pool monitoring, in case the device drops again.

henris · September 22, 2023

5 hours ago, JorgeB said:

If the scrub is aborting recommend copying what you can from the pool and re-format, also see here for better pool monitoring, in case the device drops again.

I really appreciate your replies but I'm now really confused. According to SMART report my second cache pool NVMe SSD has failed. Surely I cannot just re-format without replacing the failed drive first? The cache pool seems to in read-only mode and files can be read from it without causing any errors in syslog. Should I just start reading BTRFS manual and try to figure out what is going on? How can this (btrfs pool) be part of Unraids critical cache feature if it so fragile and untroubleshootable? Should I just start from scratch and if so is there something better than BTRFS cache pools? ZFS pools?

I have already purchased two larger replacement NVMe SSDs since the current ones are already four years old and close to their recommended TBW. I'm willing to bite the bullet and start from scratch but it would be great know that the new mechanism actually worked. The only reason for using RAID1 cache pool was to get protection from drive failure and when the drive failure occurred Unraid was totally unaware of it.

Sorry for the ranting, I really like Unraid, it's been serving me well for over a decade. This issue happened at the most inconvenient time and I don't have enough time to investigate this properly.

henris · September 22, 2023

Reading from this:

- When a drive fails in two disk BTRFS RAID1 pool, the pool continues to operate in read-write mode (though some comments indicate that it might go to read-only mode)

- If you reboot in this state, the pool will mounted in read-only mode

- You can mount the pool in read-write mode with some special command (degraded mount)

I could not find the BTRFS command to see the current state of the pool (read-write vs read-only, general healt, anything). The closest is the "device stats" but it provides cumulative historical data, not the current state. Am I missing something here?

JorgeB · September 23, 2023

12 hours ago, henris said:

According to SMART report my second cache pool NVMe SSD has failed

Yes, sorry, missed that, media is read only, so backup what you can, replace the device, create new pool and restore the data, if the current pool is not mounting post new diags.

henris · September 24, 2023

Started the replacement process by doing a full rsync archive copy to a standalone SSD:

rsync -av /mnt/cache/ /mnt/disks/download/cache_backup/ --log-file=log.txt

This seemed to run fine except for one error reported:

vm/domains/hassos_ova-4.16.qcow2
rsync: [sender] read errors mapping "/mnt/cache/vm/domains/hassos_ova-4.16.qcow2": Input/output error (5)
ERROR: vm/domains/hassos_ova-4.16.qcow2 failed verification -- update discarded.

sent 315,302,030,978 bytes  received 10,276,919 bytes  447,568,925.33 bytes/sec
total size is 296,255,909,280  speedup is 0.94
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1330) [sender=3.2.3]

If this is the only corrupted file I will be glad. The "hassos_ova-4.16.qcow2" can be just re-downloaded.

I will next shutdown the server and replace the two 500GB SSDs with new 1TB ones. Then create a new pool and restore the data to it.

JorgeB · September 24, 2023

1 hour ago, henris said:

If this is the only corrupted file I will be glad.

It should be, any others would also fail to copy with an i/o error, assuming COW is enabled.

henris · September 25, 2023

I just successfully re-created my 2 x NVMe SSD cache pool replacing old 500GB with 1TB.

Steps:

Stopped VM/Docker services
Created full backup of cache pool contents with "rsync -av" to a separate SSD
Shutdown the server
Replaced the SSDs
Started the server
Had some jitters since the server refused to boot from usb. Hav had this issue occasionally and finally it booted. Did not change any settings. I think it is due the Asus mb bios getting confused about the 25+ potential boot drives. Took the usb out, made a copy of it to make sure it was still fine and put it back in. And the Unraid was booted.
Stopped the array
Assigned the new SSDs to cache pool
Formatted the SSDs
Restored the cache pool contents with "rsync -av"
Started VM/Docker services
Started verifying docker services. Still going through them but the main ones like Plex seemed to be fully functional. I will check logs for any suspicious issues but looks good so far.

Short rant about BTRFS pool management and troubleshooting tools:

It is a short rant since there ain't no tools for seeing the pool or device status.
Pool was in read-only mode and there was no way to see it
One of the two devices of the pool had failed and there was no way to see it
The only thing "visible" of any issue was the BTRFS device error counts which are NOT reflected in the Unraid GUI
I cannot be sure if the data on the remaining SSD was ok or not. Though apart from one file I was able to copy the data off it.

I will be building a new server in the near future. I will be very closely looking at the ZFS pools if they would provide better experience.

The only file I lost was hassos_ova-4.16.qcow2. Initially I thought this was no biggie since I could just re-download it if needed. But I soon realised that it was the actual disk image of my Home Assistant environment. And then I realised that I had no backup of it anywhere... Arghh... Having no backup is on me, I cannot understand how I missed backing it up.

I still have the old SSDs. I think I will put the non-failed on in a M.2 NVMe enclosure and try to see if the missing file could somehow be recovered. If someone has an idea how to do this, please chime in. If this fails, I guess it is always good to start from scratch sometimes. Fortunately I had mostly prototyping stuff in the HASS but some special things like KNX integration contained own developed parts.

henris · September 25, 2023

Like always the celebration was too early... There seems to be quite some corruption in different places. My other Windows vm has a corrupted filesystem beyond repair. Also the docker file seems corrupt. The docker service hang and I was unable to get dockers to stop. Now rebooting the server. I will try to get the cache pool in unmounted state so I can run filesystem check. Then I will decide if I will just revert to previous docker backups and re-create the docker image.

henris · September 25, 2023

Re-created docker image:

Read the instructions here:
Made sure I had all the docker templates backed up if anything went haywire in the process. These are stored in the flash /boot/config/plugins/dockerMan/templates-user
Deleted the docker image using GUI (docker service was already stopped)
Started docker service
Installed all needed dockers through Apps/Previous Apps page.

I also checked the cache pool's filesystem integrity before the above:

Started the array in maintenance mode
Run the check through GUI (btrsfs check /mnt/cache/ did not work for some reason)
Results seemed ok -> cannot explain the corruption of docker image and windows vm image

[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p1
UUID: 39320790-03d4-4117-a978-033abe08a975
found 309332566016 bytes used, no error found
total csum bytes: 301153416
total tree bytes: 941031424
total fs tree bytes: 576176128
total extent tree bytes: 43843584
btree space waste bytes: 138846305
file data blocks allocated: 1659942002688
 referenced 308089815040

JorgeB · September 26, 2023

10 hours ago, henris said:

cannot explain the corruption of docker image and windows vm image

Most likely those shares were using the old NOCOW default, those won't have checksums, so no way for btrfs to check the data.

henris · September 27, 2023

I was able to read the missing Home Assistant VM image when I mounted the SSD on Ubuntu Live usb. The BTRFS file system was mounted in degraded mode and all the files were readable. When I tried the same when the SSD was mounted in Unraid, some of the files were unreadable.

In the mean while the re-created cache pool with the new SSDs has been functioning properly. It is still a BTRFS pool, I will make the switch to something else once I have upgraded my server.

I still have an opinion that BTRFS is missing critical troubleshooting and management tools for pools and is not meant for production. In my mind it is a summer project which has the functionalities but was left unfinished regarding the non-functional aspects.

BTRFS cache pool drive failed potentially causing corruption? (6.9.2)

Recommended Posts

henris

Link to comment

JorgeB

Link to comment

henris

Link to comment

JorgeB

Link to comment

henris

Link to comment

JorgeB

Link to comment

henris

Link to comment

henris

Link to comment

JorgeB

Link to comment

henris

Link to comment

JorgeB

Link to comment

henris

Link to comment

henris

Link to comment

henris

Link to comment

JorgeB

Link to comment

henris

Link to comment

Join the conversation