Cache pool won't balance to RAID-1

t33j4y · May 18, 2015

I can't seem to get my cache pool of two 500GB 7200RPM SATA disks to balance.

I am running v6-rc1, have Parity and 4 data disks on a Plus license, and data disks are limited to that number.

Parity and data disks are on an M1015 - cache is on native SATA.

What I have done is:

1. Backed up existing cache drive

2. Shut down server and removed existing cache drive

3. Installed two 500GB 7200RPM SATA disks.

4. Started server

5. Increased number of cache drives to two

6. Selected to format unmountable drives (the two cache drives)

7. Copied the backed-up data back to the cache

8. Restarted server (just for good measure)

9. Verified that Crashplan, Crashplan Desktop and Plex dockers are working.

They show up fine and seem to be in operation - however, total size is reported as 1TB and attempts at running balance operation end with the user interface idling at 100% remaining and never updating. Accessing the log shows that a lot is going on under the hood. However, when all operations seem to be done (ie no more relocating of block groups showing up in the log), navigating away and back to the cache page shows "no balance found on mnt/cache" - which indicates to me that the pool is STILL operating as either JBOD or RAID-0 - and not RAID-1 which is what the balance operation should result in.

I'm quite unsure what to do next to try and remedy this, so any suggestions would be most welcome.

Thanks in advance!

Attached: screenshots of main screen as well as the cache page and syslog.

syslog.zip

t33j4y · May 18, 2015

Just noticed a weird thing - according to my dashboard, my cache drive utilization is at minus 78%

jonp · May 18, 2015

FYI, I think your raid 1 is fine. I think this is just a misunderstanding of what the "no balance found" means.

It means there is no active balance operation running. That is expected. If you look at the two devices in the pool above the no devices found, you can see that ~100GB of space is being consumed from each of the devices. There is a slight delta in the sizes because metadata is not stored in a RAID1 format.

You can test the validity of this by removing a device from the pool and starting the array.

t33j4y · May 18, 2015

Great - thank you for the swift reply. I will try removing the backup drive tomorrow.

Is there some reason that the main page enumerates the storage size to 1TB and not 500GB (which is what would be available if being RAID1)?

trurl · May 18, 2015

Great - thank you for the swift reply. I will try removing the backup drive tomorrow.

Is there some reason that the main page enumerates the storage size to 1TB and not 500GB (which is what would be available if being RAID1)?

I have 2 x 120GB SSDs and mine shows 120GB total and 38% used on dashboard which is correct.

t33j4y · May 22, 2015

Well, I tested this out now (removing a drive to verify that drives are actually mirrored (raid1) and I am not so sure all is well..

1. Stopped the array and pulled Cache2 drive.

2. Restarted the array - Cache came up as unmountable and o/f no cache drive available.

3. Stopped the array and reduced number of cache disks to 1, thinking that might help out.

4. Started the array and found that it made no difference.

5. Stopped the array and reconnected Cache2 drive.

6. Started the array and the cache drive was again available.

Main still shows two disks and a disk pool of 1 TB. Dash still shows utilization at minus 78%

So going by that I seem to be able to conclude that my pool is either

a. Not a raid1 pool; or

b. Somehow plagued by something else; or

c. Both of the above

I've attached the log which shows a flurry of btrfs "stuff" going on when restarting after the removal of Cache2.

In hope of advice,

Thomas

log.zip

jonp · May 23, 2015

Well, I tested this out now (removing a drive to verify that drives are actually mirrored (raid1) and I am not so sure all is well..

1. Stopped the array and pulled Cache2 drive.

2. Restarted the array - Cache came up as unmountable and o/f no cache drive available.

3. Stopped the array and reduced number of cache disks to 1, thinking that might help out.

4. Started the array and found that it made no difference.

5. Stopped the array and reconnected Cache2 drive.

6. Started the array and the cache drive was again available.

Main still shows two disks and a disk pool of 1 TB. Dash still shows utilization at minus 78%

So going by that I seem to be able to conclude that my pool is either

a. Not a raid1 pool; or

b. Somehow plagued by something else; or

c. Both of the above

I've attached the log which shows a flurry of btrfs "stuff" going on when restarting after the removal of Cache2.

In hope of advice,

Thomas

Will review this and get back to you. Not sure why that didn't work for you. Will confer with Tom.

t33j4y · May 27, 2015

Hi Jon - any news on this?

Cheers,

Thomas

t33j4y · June 1, 2015

Still looking for help on this

jonp · June 2, 2015

Ok I successfully recreated this today in the lab and figured out the issue. We had almost all the steps right, but there was one detail that was missed. Here are the steps again, but the bolded part represents the correction to the procedure:

STARTING POSITION: Two disks or more in a btrfs cache pool; array is running.

1 - Stop Array

2 - Unassign the disk from cache 2 (but do not change the # of slots)

3 - Physically detach cache 2 from the system (remove power / SATA)

4 - Start the array

5 - Open the log, notice a "balance" occurring (you'll see a number of entries such as this:

Jun 2 08:49:55 Tower kernel: BTRFS info (device sdf1): relocating block group 36561747968 flags 17

6 - When the balance completes, the device you physically removed will be deleted from the cache pool.

There are a few "bugs" with this right now in that you can't monitor the status of the balance from anywhere in the webGui. You Just need to wait for the log to stop printing out the messages about relocating block groups and then you'll see this in the log when it's done:

Jun 2 08:52:55 Tower kernel: BTRFS info (device sdf1): disk deleted missing

Let me know if this works for you.

t33j4y · June 2, 2015

Thanks - gonna copy the contents of my cache disk which also houses dockers and apps and test later today.

Maybe the purge of disk two and the subsequent adding of it again once I am done with testing, will make the enumeration of free space correct itself as well :-)

t33j4y · June 2, 2015

Hi.

Have now tested, and unfortunately no luck No balance is starting - comes up as unmountable again.

Here's an excerpt of the immediate log entries.

Jun  2 20:09:20 Tower kernel: BTRFS: open /dev/sdg1 failed
Jun  2 20:09:20 Tower kernel: BTRFS info (device sdf1): allowing degraded mounts
Jun  2 20:09:20 Tower kernel: BTRFS info (device sdf1): disk space caching is enabled
Jun  2 20:09:20 Tower kernel: BTRFS: has skinny extents
Jun  2 20:09:20 Tower kernel: BTRFS: failed to read tree root on sdf1
Jun  2 20:09:20 Tower logger: mount: wrong fs type, bad option, bad superblock on /dev/sdf1,

Have attached the log file.

I am strongly thinking about reverting to only one cache disk, but I'd like to hear if you have any other actions to test out before I reconfigure. For now, I have reinstated cache2 which made the cache functional again. I still believe that it is actually running in RAID0 mode.

Thanks.

log2.zip

Cache pool won't balance to RAID-1

Recommended Posts

t33j4y

Link to comment

t33j4y

Link to comment

jonp

Link to comment

t33j4y

Link to comment

trurl

Link to comment

t33j4y

Link to comment

jonp

Link to comment

t33j4y

Link to comment

t33j4y

Link to comment

jonp

Link to comment

t33j4y

Link to comment

t33j4y

Link to comment

Archived