Jump to content

[SOLVED] Assistance recovering cache


Recommended Posts

Hi all. I have a Proliant DL380pG8 running unRAID 6.8.3. I had 2 x 240GB SSDs setup as a cache pool. The other day, one of my cache drives just went offline. I saw a number of errors in unRAID like:
 

Jul 31 16:41:29 UNRAID01 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 4442242, rd 272474, flush 44513, corrupt 0, gen 0

Jul 31 16:41:28 UNRAID01 kernel: BTRFS warning (device sdh1): lost page write due to IO error on /dev/sdh1

Jul 31 16:41:27 UNRAID01 kernel: BTRFS error (device sdh1): error writing primary super block to device 1

But since I still had a healthy drive in the pool and my system was still running, I just ordered a replacement drive.

Yesterday the new drive arrived and I tried to replace it. I stopped the array, then removed what I thought was the bad drive. Unfortunately, I actually pulled out the good cache drive instead. "No bad" I thought "the array is offline, I'll just put that one back in".

 

But when I slotted it back in, it was immediately also stuffed. My led indicator on the drive was orange - which according to HP means "The drive has failed" and unRAID couldn't see the drive at all. The same happened when I tried to put the old failed cache drive in.

 

I've ruled out a controller or a drive caddy error - the new drive is recognised by the server and unRAID in both slots. So I tried connecting the two old cache drives into a normal Linux box to see what it says. When I do so, I get the following error: ata1 comreset failed errno=-16.

 

Some Googling tells me that this could simply be due to errors in the partition or maybe the btfrs file system. However, I'm loathe to try any repairs at this stage in case I do further damage. My gut tells me this is a data error rather than an actual hardware error - because the coincidence would be too high...

 

I do have an appdata backup from the night before, but I'm running a number of mysql databases in Docker containers and it would be a real pain to lose a day's data on them.

 

Thanks in advance. I really appreciate in advance any advice anyone could give!

Link to comment

Thanks for the feedback. Yes the drives are both showing as failed on the server. However, I was hoping that the problem is one of a locked partition or something like that, rather than a physical failure. The reason I suspect it might be that was due to the timing. As I say - one failed, but the other was looking fine - until I pulled it out (while the array was stopped) then as soon as I put it back in, it failed straight away. When I put the drive in another Linux server, I see this error: I get the following error: ata1 comreset failed errno=-16. I read elsewhere that this could be a BTFRS or a patition problem. Even if I can't recreate the cache pool with these drives, I'm hoping I can get the data off them? For example, this post: https://askubuntu.com/questions/62295/how-to-fix-a-comreset-failed-error says to run fsck. But I wanted to see here if this is a good idea or if it might be destructive on the drive. Also, if the drive is part of a cache pool, will it have a full copy of the data on it (like a standard mirror) or are both drives needed to reconstruct the data?

Link to comment
1 hour ago, TheSpook said:

I read elsewhere that this could be a BTFRS or a patition problem.

No, it's a hardware problem.

 

1 hour ago, TheSpook said:

Even if I can't recreate the cache pool with these drives, I'm hoping I can get the data off them?

Not unless you can get them detected.

 

1 hour ago, TheSpook said:

will it have a full copy of the data on it (like a standard mirror) or are both drives needed to reconstruct the data?

Depends on what profile you were using, default is raid1, so you'd just need 1.

Link to comment

OK - sounds like I have to bite the bullet and restore the backup. So is my plan off attack:

 

1. Tell the system to stop using a cache drive (how do I do that?)

2. Put the new SSD in and build a new cache

3. Restore the Appdata to the new cache

4. Will my docker containers magically come back (at the moment I'm seeing no containers because they were all in appdata)?

5. Order a second SSD for redundancy on the cache again (not that it really helped last time!)

 

Thanks again - I really appreciate the assistance.

Link to comment
  • JorgeB changed the title to [SOLVED] Assistance recovering cache

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...