Replacing Cache drive in array

September 26, 201510 yr

I have one of my drives starting to fail. The reallocated sector count is going up each day. So, I just want to verify if replacing a cache disk in a cache array is the same as a data disk?

1 - Stop the array

2 - Unassign the old drive, if it's still assigned

3 - Power down

4 - [ Optional ] Pull the old drive (you may want to leave it installed for Preclearing or testing)

5 - Install the new drive

6 - Power on

7 - Assign the new drive in the slot of the old drive

8 - Go to the Main -> Array Operation section

9 - Put a check in the Yes, I'm sure checkbox (next to the information indicating the drive will be rebuilt), and click the Start button

Quote

September 26, 201510 yr

No. The cache drive is not fault-tolerant, so there's no "rebuild" involved. You'll lose everything on the old cache drive ... so be sure to copy anything you've got stored on it to a backup location before you replace it. [You can simply create a folder in your array and save everything there.]

Quote

September 26, 201510 yr

Author

Sorry - it's not the individual cache drive. It's part of a cache array, the btrfs. As this is a different raid type than the data array, I wanted to make sure the procedure was the same.

Quote

September 26, 201510 yr

Re-read your question, and noted you're not just replacing your cache disk, but are replacing "... a cache disk in a cache array ..."

If you're in fact replacing a single disk in a protected btrfs cache pool, then it is in fact fault-tolerant. In that case I do not know the steps necessary to replace the drive. I suspect you need to first remove the drive from the cache pool [see the process here: http://lime-technology.com/forum/index.php?topic=39774.msg379017#msg379017 ]; then add a new drive to the pool and do another balance operation.

Quote

September 26, 201510 yr

Author

Just tried un-assigning the drive from the cache pool and starting the array and seeing what happens. The cache array was unmountable. Leaves me a little concerned what happens if the drive actually fails...

Hopefully someone who has done this can chime in.

Quote

September 26, 201510 yr

Did you "balance" the pool before trying that? (per JonP's suggestion in the thread I referred to above)

Quote

September 27, 201510 yr

Author

Running that right now to test that

Quote

September 27, 201510 yr

Author

Well that was a bad thing to do apparently... Now one of the drives, not the one that was having reallocated sector issues, no longer spins up. But, isn't being marked as failed, and a bunch of erros showing up in the log..

Sep 27 12:52:12 Tower kernel: sd 1:0:7:0: [sdr] tag#0 CDB: opcode=0x2a 2a 00 00 00 00 c0 00 00 08 00
Sep 27 12:52:12 Tower kernel: BTRFS: lost page write due to I/O error on /dev/sdr1
Sep 27 12:52:12 Tower kernel: sd 1:0:7:0: [sdr] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Sep 27 12:52:12 Tower kernel: sd 1:0:7:0: [sdr] tag#0 CDB: opcode=0x2a 2a 00 00 0b b1 e0 00 00 60 00
Sep 27 12:52:12 Tower kernel: sd 1:0:7:0: [sdr] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Sep 27 12:52:12 Tower kernel: sd 1:0:7:0: [sdr] tag#0 CDB: opcode=0x2a 2a 00 00 00 00 c0 00 00 08 00
Sep 27 12:52:12 Tower kernel: BTRFS: lost page write due to I/O error on /dev/sdr1
Sep 27 12:52:13 Tower kernel: BTRFS: lost page write due to I/O error on /dev/sdr1
Sep 27 12:52:13 Tower kernel: BTRFS: lost page write due to I/O error on /dev/sdr1
Sep 27 12:52:13 Tower kernel: BTRFS: lost page write due to I/O error on /dev/sdr1
Sep 27 12:52:14 Tower kernel: BTRFS: lost page write due to I/O error on /dev/sdr1
Sep 27 12:52:14 Tower kernel: BTRFS: lost page write due to I/O error on /dev/sdr1
Sep 27 12:52:14 Tower kernel: BTRFS: lost page write due to I/O error on /dev/sdr1

If I stop the array, that drive is marked as a unassigned drive, within that cache pool... But... it's there... assigned...

Quote

September 27, 201510 yr

Clearly you need to contact JonP and see what the best way to proceed is with regards to getting your cache pool working. Based on your problem, and the one discussed in the thread I referred to above, it seems that resolving failed drives in a cache pool is nowhere near as easily done as those in the array.

Quote

September 27, 201510 yr

Author

Thanks! I have reached out to him. I think more documentation around the cache pool and dealing with failures is definitely needed.

So... tried to stop the array again, the web interface went for a dump on spinning down drives and crashed. Went to the console, used powerdown to shut down server. Booted back up and... the drive that wasn't doing anything is back in the array, but still generating errors.

Going to try and stop the VM's and copy them off.

EDIT - cache is offline. Although looks to be up in the GUI, it is offline... cannot access anything stored on cache...

EDIT 2 - looks like I can access content now on the cache, just super slow.

EDIT 3 - Vm's showed up, but Docker tab missing from GUI and none running

Using MC to copy everything off and might just blow the whole thing away and start from scratch. If so, big issue with that and failures...

tower-syslog-20150927-1317.zip

Quote

September 27, 201510 yr

Author

This message is displayed on the console:

"mount: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error. In some cases useful info is found in syslog - try dmesg | tail or so"

Ya, cryptic.

Quote

September 28, 201510 yr

Author

Have to say pretty upset with how this new cache array is working out. No clear documentation on removing and replacing a drive. Ran a balance and that seems to have pooched the whole thing. One disk spinning down now and won't spin up. Docker won't start, won't even make the img file. And couple of days and only Gary has replied in here... Have to say, I have lost all confidence in the btrfs array setup at this point. If a drive actually failed, at least from what I am seeing, all hell breaks loose. I cannot even stop the array any more, just hangs, which is a PAIN when trying to trouble shoot. I have a bunch of dockers and VM's and now nada... fun times...

Quote

September 28, 201510 yr

Since the btrfs cache pool is a limetech supported feature, I think it is time to email Tom and get him directly involved. The community here has very little experience with btrfs cache pools, and what experience I personally have had trying to help people with them has been frustrating, to say the least. Commands that "should work" end up not working for whatever reason, and the best advice I could end up giving was to salvage the data and rebuild the pool from scratch. On the plus side, I have personally been able to recover data, and help others recover data as well, but not using the built in redundancy features.

It's very possible that once we gain the knowledge needed to recover from events like these, that it will become just as smooth as replacing an array disk. At this point in time, however, things are not smooth. It's also possible that something else is at play with your particular situation, and nothing could have been done to help. We just don't know.

Quote

September 28, 201510 yr

Fair points regarding the lack of documentation on this. I will work some up to get added to the wiki asap. Now to resolving your current issue.

From your last results, it sounds like you said you recovered everything off the cache pool but your Docker image isn't working properly, is that correct?

Can you upload your system diagnostics file from the Tools -> Diagnostics page?

Quote

September 28, 201510 yr

Author

So, yes, I had to rebuild to cache array from scratch....

Some things I noticed - I unassigned all the drives in the pool, started the array and then stopped the array, reassigned my 4 drives back to the pool, hoping that would "wipe" that cache array. It didn't. I ended up doing a new config, reassigning all my drives to the data array, starting the array and not run parity check. Things working so far. Assigned then one drive to the cache. Funny enough, as it was previously a btrfs formatted drive, it just spun up, but empty. Progress. Assigned the other three drives, now 4 drive cache array, empty and no errors in the syslog. Yippee

to test, I go to start docker with a new file, nada. Won't run. Not sure what is going on at this point. I am clicking around and notice something VERY weird. At some point, a "cache" directory, with nothing in it, was made on my data drives, drive one and four. I have no clue how these got there. I deleted them. Restarted the array, and bam, Docker starts.

I have now copied back my data, started a fresh docker image and reloaded my dockers from the templates and my VM's are up and running with no errors.

Overall this has been the weirdest experience with unRaid I have ever had. Just seemed to be a bunch of errors and problem with no clear solutions for, even just removing a failing drive.

Thanks for looking in jonp - at this point everything looks to be running ok. That said, how much testing was done in regards to what happens when drives fail, etc? Seems me and a few others could only fix our cache array's by complete rebuilds. As well, weird how everything went nuts after a rebalance.

Rebalance - is this something that should be run regularly? The command on the web page looks a little weird - -dconvert=raid1 -mconvert=raid1. I am not linux guy at all, but it almost looks like two process being called by the same function? A Dconvert and a Mconvert?

Cheers

Quote

September 28, 201510 yr

JonP => Definitely good that you jumped in here ... and clearly there needs to be FAR better documentation on the use of cache pools. It's not at all clear that btrfs is "ready for prime time" based on the experiences of Whaler_99 and others in a few other threads with cache pool issues. The idea of a fault-tolerant pool is GREAT ... it eliminates the "files aren't protected if you use cache" problem => but it sounds like that protection isn't very reliable in the current implementation [or perhaps it is, but there is a completely different and non-intuitive technique for recovering them].

Quote

November 7, 201510 yr

Author

Just following up on this... There still seems to be a complete lack of documentation on this, in regards to support in the event of a failed drive or upgrading existing drives to larger ones (or is that even possible?).

When can we expect some updated information on what is now a VERY important and integral part of unRAID from the LimeTech team?

Quote

November 18, 201510 yr

Author

Is there any way to tag the admin team on this thread? There doesn't seem to have been any response to this as of yet or updated documentation.

Although a great solution/option I am frankly pretty nervous now about running dockers to an extent but more so virtual machines on this cache array when it seems to be a bit less than ideal, has no real documentation for trouble shooting and not many people on the site have a lot of experience with.

First - in an array, how do you add drives, what can your expect? (let's not assume it is the same as the data array)

Second - If you want to upgrade existing drives, how and what can you expect? (Again, we cannot assume here)

Third - in the event of a failing drive, what do you do? (this clearly needs a lot of work simply based on my experience and a few others)

Fourth - in the event of a failed drive, what do you do?

Hopefully we can see this all flushed out over the coming months. I see 6.1.4 was released and work is ongoing on 6.2, great news.... but how about some updates on the current solution and issues that have been seen?

Quote

Replacing Cache drive in array

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)