February 26, 201511 yr unRAID OS Version: 6.0-beta14 Description: Attempting to use drives, which were once part of the array but have been upgraded/replaced but not cleared, in a cache pool corrupts the new array drive(s). That's probably confusing so... How to reproduce: Recently my cache drive suffered a hardware failure that I wasn't prepared for. Total loss unless a head swap is successful, not worth the cost of professional recovery. Annoying but otherwise meh, whatever. (it probably is not necessary to fail a cache drive to reproduce this error ) So I decided to replace two of my old but still functioning 500GB array drives with new 2TB drives. All my array drives are formatted btrfs. In turn I precleared the 2TB drives and then via the web interface stopped the array, selected the 2TB drive to replace the 500GB drive, started the array and let it rebuild and then expand the filesystem in upgrade mode. Again for the second 2TB drive. This was all successful, no problems. At this point I have the two new 2TB drives in the array, and the two 500GB they replaced are still attached and unmodified. Without formatting, clearing, or zeroing, I adjust so the web interface has two slots for cache (empty, because my cache drive failed and hadn't been replaced). Assign the two 500GB drives to the cache pool, already formatted btrfs from when they were in the array, wanting a RAID1 configuration to protect against future cache failures. This is where things go wrong. I noticed the web interface is reporting 2.5TB total for cache size? Also the cache didn't get reformatted, it still has the files from *one* of the drives on it. I decide I should unassign them and zero/format them myself. IIRC, when I attempted to stop the array at this point, the web interface froze up and would never respond again. Or maybe it was after I unassigned the cache drives and tried to restart the array. Because the web UI was hung, I wound up hard rebooting (I know, I know) and when I could see the web interface again, one of the new 2TB drives was reported as unmountable. Also a parity check had started, reporting thousands of errors before I stopped it. Now, I don't know the internals of what's going on behind the scenes, but I suspect something along the lines of the 500GB and 2TB drives sharing UUIDs because they were rebuilt/copied from parity. Then when a command is issued to created a RAID1 mirror, it pulls in the 2TB array drive instead of the non-array 500GB one. If it matters, the 500GB drives were /dev/sdf and /dev/sdk. The 2TB drives were /dev/sdb and /dev/sdd. Expected results: System would detect and prevent this corruption. Actual results: It didn't. Other information:I blame myself. I should've formatted/cleared before assigning them to the cache pool. Fortunately I seem to have successfully used btrfs restore to recover data from the unmountable drive. I didn't have an inventory so something may be missing, and there may yet be some corrupted bits somewhere, but so far all the files I've looked at seem fine. I've run btrfs restore on both 500GB drives and the unmountable 2TB array drive, and will be comparing checksums/file lists. I've noticed while running btrfs restore on 500GB /dev/sdk1, the still unmountable 2TB is constantly spun up, suggesting to me that restore is somehow redirecting to the 2TB drive to get the data.
February 26, 201511 yr unRAID OS Version: 6.0-beta14 Description: Attempting to use drives, which were once part of the array but have been upgraded/replaced but not cleared, in a cache pool corrupts the new array drive(s). That's probably confusing so... How to reproduce: Recently my cache drive suffered a hardware failure that I wasn't prepared for. Total loss unless a head swap is successful, not worth the cost of professional recovery. Annoying but otherwise meh, whatever. (it probably is not necessary to fail a cache drive to reproduce this error ) So I decided to replace two of my old but still functioning 500GB array drives with new 2TB drives. All my array drives are formatted btrfs. In turn I precleared the 2TB drives and then via the web interface stopped the array, selected the 2TB drive to replace the 500GB drive, started the array and let it rebuild and then expand the filesystem in upgrade mode. Again for the second 2TB drive. This was all successful, no problems. At this point I have the two new 2TB drives in the array, and the two 500GB they replaced are still attached and unmodified. Without formatting, clearing, or zeroing, I adjust so the web interface has two slots for cache (empty, because my cache drive failed and hadn't been replaced). Assign the two 500GB drives to the cache pool, already formatted btrfs from when they were in the array, wanting a RAID1 configuration to protect against future cache failures. This is where things go wrong. I noticed the web interface is reporting 2.5TB total for cache size? Also the cache didn't get reformatted, it still has the files from *one* of the drives on it. I decide I should unassign them and zero/format them myself. IIRC, when I attempted to stop the array at this point, the web interface froze up and would never respond again. Or maybe it was after I unassigned the cache drives and tried to restart the array. Because the web UI was hung, I wound up hard rebooting (I know, I know) and when I could see the web interface again, one of the new 2TB drives was reported as unmountable. Also a parity check had started, reporting thousands of errors before I stopped it. Now, I don't know the internals of what's going on behind the scenes, but I suspect something along the lines of the 500GB and 2TB drives sharing UUIDs because they were rebuilt/copied from parity. Then when a command is issued to created a RAID1 mirror, it pulls in the 2TB array drive instead of the non-array 500GB one. If it matters, the 500GB drives were /dev/sdf and /dev/sdk. The 2TB drives were /dev/sdb and /dev/sdd. Expected results: System would detect and prevent this corruption. Actual results: It didn't. Other information:I blame myself. I should've formatted/cleared before assigning them to the cache pool. Fortunately I seem to have successfully used btrfs restore to recover data from the unmountable drive. I didn't have an inventory so something may be missing, and there may yet be some corrupted bits somewhere, but so far all the files I've looked at seem fine. I've run btrfs restore on both 500GB drives and the unmountable 2TB array drive, and will be comparing checksums/file lists. I've noticed while running btrfs restore on 500GB /dev/sdk1, the still unmountable 2TB is constantly spun up, suggesting to me that restore is somehow redirecting to the 2TB drive to get the data. Ok, something is not right here. The devices that were formatted with btrfs and attached to the array should not have been affected by anything you did with the cache pool unless those devices were at one point in the cache pool, which could explain some odd behavior. Also the cache didn't get reformatted, it still has the files from *one* of the drives on it. When you create a cache pool with devices that were already initialized with btrfs, the first device that is assigned to the pool is considered the "master" and the second device becomes the slave to that master. I use the term initialized because you don't really "format" with BTRFS. I realize the webGui still uses that term, but that's just to avoid confusing people with BTRFS-specific terminology. Because the devices were already formatted with BTRFS, data that was on device 1 will then be copied to device 2 through "rebalancing." This occurs at the starting of the array in this instance, which leads me to my first question: When the array was started after assigning the two 500GB disks to the cache pool, did it take a while to start? when I attempted to stop the array at this point, the web interface froze up and would never respond again. I need to test this myself to see if the rebalance event will hold the system from stopping the array. When you say it "froze", did you see any messages at the bottom? Did you attempt to connect via telnet or SSH before resorting to the hard reset? Now, I don't know the internals of what's going on behind the scenes, but I suspect something along the lines of the 500GB and 2TB drives sharing UUIDs because they were rebuilt/copied from parity. Please connect via telnet or ssh. Type the following command: btrfs filesystem show Please copy and paste the output of that command here. This will help determine what btrfs devices are pooled together are which are single. I blame myself. I should've formatted/cleared before assigning them to the cache pool. Do not blame yourself for this. We need to clean up handling for these scenarios where you move devices from the array to the cache. I could see this as a viable way to repurpose drives that are on their way out the door (put them in a btrfs raid 1 cache pool where their failure won't account for nearly as much risk to data as an array device would). Fortunately I seem to have successfully used btrfs restore to recover data from the unmountable drive. This is one of the big reasons I like BTRFS. COW adds checksumming support and can make it very easy to recover from corruption. I've noticed while running btrfs restore on 500GB /dev/sdk1, the still unmountable 2TB is constantly spun up... This may have been a bug in beta14 (spin status reporting in the webGui). Beta14b is out and should resolve spin detection issues.
February 26, 201511 yr Author Ok, something is not right here. The devices that were formatted with btrfs and attached to the array should not have been affected by anything you did with the cache pool unless those devices were at one point in the cache pool, which could explain some odd behavior.These were new out of the box 2TB drives that I precleared then added to the array as upgrade. Nothing else. When the array was started after assigning the two 500GB disks to the cache pool, did it take a while to start?I didn't notice that it took particularly long. When you say it "froze", did you see any messages at the bottom?I did not notice any messages. I tried multiple attempts to connect to a new instance in a new web browser, unsucessfully. Did you attempt to connect via telnet or SSH before resorting to the hard reset?Actually I said that wrong. I was connected via telnet, preparing to use preclear to zero the 500GB drives after I'd unassigned them from the cache, when things froze up. Rather than walking to the closet to push the reset button, I entered "reboot" via telnet, which was still responding. I probably should have stopped right there and done something differently. root@Tower:/mnt/disk2# btrfs filesystem show Label: none uuid: 9bdb8100-09f2-48f8-ad99-070433c571a2 Total devices 1 FS bytes used 178.63GiB devid 1 size 465.76GiB used 181.04GiB path /dev/md1 Label: none uuid: 5152b326-51bf-4a49-b693-8b50ff27b8ce Total devices 1 FS bytes used 1.25TiB devid 1 size 1.82TiB used 1.26TiB path /dev/md2 Label: none uuid: 5e509301-4b78-4998-a3e5-c802a5a49b07 Total devices 1 FS bytes used 802.88GiB devid 1 size 931.51GiB used 848.04GiB path /dev/md3 Label: none uuid: d21b7ca9-7dec-40b5-aef1-5f4edad05084 Total devices 1 FS bytes used 464.03GiB devid 1 size 465.76GiB used 465.76GiB path /dev/md5 Label: none uuid: 65748c8e-22eb-4eaa-b5a5-cc2ebde8396e Total devices 1 FS bytes used 454.84GiB devid 1 size 465.76GiB used 460.04GiB path /dev/md6 Label: none uuid: af5cc7a3-dc3a-4464-b198-0f825128b45b Total devices 1 FS bytes used 452.23GiB devid 1 size 465.76GiB used 458.04GiB path /dev/md7 Label: none uuid: 95313e3c-6287-4613-aaf3-cca88a939435 Total devices 1 FS bytes used 1.31TiB devid 1 size 1.82TiB used 1.32TiB path /dev/md8 Label: none uuid: a6245f72-4823-45c3-b561-48e247e6c77d Total devices 2 FS bytes used 459.36GiB devid 1 size 1.82TiB used 464.04GiB path /dev/md4 devid 2 size 465.76GiB used 0.00B path /dev/sdk1 Btrfs v3.18.2 The new drives are md2 and md4. EDIT: Maybe my premature reboot interrupted the rebalance, and then automatic recovery attempts after reboot caused it to grab the array drive due to their twin nature? Maybe trying to unassign them while the rebalance was ongoing caused the web UI to hang? The 500GB drive was mostly full, so the rebalance should have taken a good while.
February 27, 201511 yr Author I'm about ready to pull out the 500GB drives and reformat the unmountable md4. Is there any other helpful information I can provide before I start meddling with the current state of things?
February 28, 201511 yr Author Okay... There's something more seriously weird at work here, I think. Without going over everything else in the interim that lead me to this point, here's the current oddity: I've pulled out both 500GB drives. I've also pulled out the unmountable 2TB drive after some attempts to have it in the array and formatted. I cleaned out my boot flash drive and extracted a clean copy of beta14b. So I'm starting from a new config, parity sync is underway. I have no cache assigned. root@Tower:~# btrfs filesystem show Label: none uuid: 9bdb8100-09f2-48f8-ad99-070433c571a2 Total devices 1 FS bytes used 178.63GiB devid 1 size 465.76GiB used 181.04GiB path /dev/md1 Label: none uuid: 95313e3c-6287-4613-aaf3-cca88a939435 Total devices 1 FS bytes used 879.94GiB devid 1 size 1.82TiB used 904.04GiB path /dev/md2 Label: none uuid: 5152b326-51bf-4a49-b693-8b50ff27b8ce Total devices 1 FS bytes used 896.95GiB devid 1 size 1.82TiB used 948.04GiB path /dev/md3 Label: none uuid: 5e509301-4b78-4998-a3e5-c802a5a49b07 Total devices 1 FS bytes used 802.88GiB devid 1 size 931.51GiB used 848.04GiB path /dev/md4 Label: none uuid: 65748c8e-22eb-4eaa-b5a5-cc2ebde8396e Total devices 1 FS bytes used 454.84GiB devid 1 size 465.76GiB used 460.04GiB path /dev/md5 Label: none uuid: d21b7ca9-7dec-40b5-aef1-5f4edad05084 Total devices 1 FS bytes used 464.03GiB devid 1 size 465.76GiB used 465.76GiB path /dev/md6 Label: none uuid: af5cc7a3-dc3a-4464-b198-0f825128b45b Total devices 1 FS bytes used 452.23GiB devid 1 size 465.76GiB used 458.04GiB path /dev/md7 Label: none uuid: 19dba89f-f28f-4741-b8d0-141cd5c51bef Total devices 1 FS bytes used 685.68GiB devid 1 size 931.51GiB used 526.76GiB path /dev/sde1 Btrfs v3.18.2 Look at that last entry. My *PARITY* drive is sde. I didn't mis-select it in the new config. All the other drives are mountable and have their files, working fine.
February 28, 201511 yr Author On my end, it appears to be an artifact of having *all* the array drives formatted btrfs. Moved all my files off of disk1, reformatted it as xfs, rebooted, and things seem normal: (command run with arrary started, then with it stopped) root@Tower:~# btrfs filesystem show Label: none uuid: 95313e3c-6287-4613-aaf3-cca88a939435 Total devices 1 FS bytes used 879.94GiB devid 1 size 1.82TiB used 904.04GiB path /dev/md2 Label: none uuid: 5152b326-51bf-4a49-b693-8b50ff27b8ce Total devices 1 FS bytes used 1.05TiB devid 1 size 1.82TiB used 1.05TiB path /dev/md3 Label: none uuid: 5e509301-4b78-4998-a3e5-c802a5a49b07 Total devices 1 FS bytes used 802.88GiB devid 1 size 931.51GiB used 848.04GiB path /dev/md4 Label: none uuid: 65748c8e-22eb-4eaa-b5a5-cc2ebde8396e Total devices 1 FS bytes used 454.84GiB devid 1 size 465.76GiB used 460.04GiB path /dev/md5 Label: none uuid: d21b7ca9-7dec-40b5-aef1-5f4edad05084 Total devices 1 FS bytes used 464.03GiB devid 1 size 465.76GiB used 465.76GiB path /dev/md6 Label: none uuid: af5cc7a3-dc3a-4464-b198-0f825128b45b Total devices 1 FS bytes used 452.23GiB devid 1 size 465.76GiB used 458.04GiB path /dev/md7 Btrfs v3.18.2 root@Tower:~# btrfs filesystem show Label: none uuid: 5e509301-4b78-4998-a3e5-c802a5a49b07 Total devices 1 FS bytes used 802.88GiB devid 1 size 931.51GiB used 848.04GiB path /dev/sdc1 Label: none uuid: 5152b326-51bf-4a49-b693-8b50ff27b8ce Total devices 1 FS bytes used 1.05TiB devid 1 size 1.82TiB used 1.05TiB path /dev/sdd1 Label: none uuid: 65748c8e-22eb-4eaa-b5a5-cc2ebde8396e Total devices 1 FS bytes used 454.84GiB devid 1 size 465.76GiB used 460.04GiB path /dev/sde1 Label: none uuid: 95313e3c-6287-4613-aaf3-cca88a939435 Total devices 1 FS bytes used 879.94GiB devid 1 size 1.82TiB used 904.04GiB path /dev/sdi1 Label: none uuid: d21b7ca9-7dec-40b5-aef1-5f4edad05084 Total devices 1 FS bytes used 464.03GiB devid 1 size 465.76GiB used 465.76GiB path /dev/sdj1 Label: none uuid: af5cc7a3-dc3a-4464-b198-0f825128b45b Total devices 1 FS bytes used 452.23GiB devid 1 size 465.76GiB used 458.04GiB path /dev/sdl1 Btrfs v3.18.2 (I put the 3 pulled drives back in, so some letters are different. Parity is now sdg)
Archived
This topic is now archived and is closed to further replies.