[SOLVED] Assistance recovering cache

August 5, 20205 yr

Hi all. I have a Proliant DL380pG8 running unRAID 6.8.3. I had 2 x 240GB SSDs setup as a cache pool. The other day, one of my cache drives just went offline. I saw a number of errors in unRAID like:

Jul 31 16:41:29 UNRAID01 kernel: BTRFS error (device sdh1): bdev /dev/sdh1 errs: wr 4442242, rd 272474, flush 44513, corrupt 0, gen 0

Jul 31 16:41:28 UNRAID01 kernel: BTRFS warning (device sdh1): lost page write due to IO error on /dev/sdh1

Jul 31 16:41:27 UNRAID01 kernel: BTRFS error (device sdh1): error writing primary super block to device 1

But since I still had a healthy drive in the pool and my system was still running, I just ordered a replacement drive.

Yesterday the new drive arrived and I tried to replace it. I stopped the array, then removed what I thought was the bad drive. Unfortunately, I actually pulled out the good cache drive instead. "No bad" I thought "the array is offline, I'll just put that one back in".

But when I slotted it back in, it was immediately also stuffed. My led indicator on the drive was orange - which according to HP means "The drive has failed" and unRAID couldn't see the drive at all. The same happened when I tried to put the old failed cache drive in.

I've ruled out a controller or a drive caddy error - the new drive is recognised by the server and unRAID in both slots. So I tried connecting the two old cache drives into a normal Linux box to see what it says. When I do so, I get the following error: ata1 comreset failed errno=-16.

Some Googling tells me that this could simply be due to errors in the partition or maybe the btfrs file system. However, I'm loathe to try any repairs at this stage in case I do further damage. My gut tells me this is a data error rather than an actual hardware error - because the coincidence would be too high...

I do have an appdata backup from the night before, but I'm running a number of mysql databases in Docker containers and it would be a real pain to lose a day's data on them.

Thanks in advance. I really appreciate in advance any advice anyone could give!

Quote

August 5, 20205 yr

Author

I should also add that I have tried the steps listed here:

The problem is, the drives aren't showing at all. Even if I run fdisk -l, I can't see the two old cache drives.

Edited August 5, 20205 yr by TheSpook

Quote

August 5, 20205 yr

Community Expert

Please post the diagnostics: Tools -> Diagnostics

Quote

August 5, 20205 yr

Author

Sure. Attached. Thanks for helping!

unraid01-diagnostics-20200805-1749.zip

Quote

August 5, 20205 yr

Community Expert

Diags are after rebooting so can't see what happened, but cache devices are not being detected, that's a hardware issue, you need to fix that first.

Quote

August 5, 20205 yr

Author

Thanks for the feedback. Yes the drives are both showing as failed on the server. However, I was hoping that the problem is one of a locked partition or something like that, rather than a physical failure. The reason I suspect it might be that was due to the timing. As I say - one failed, but the other was looking fine - until I pulled it out (while the array was stopped) then as soon as I put it back in, it failed straight away. When I put the drive in another Linux server, I see this error: I get the following error: ata1 comreset failed errno=-16. I read elsewhere that this could be a BTFRS or a patition problem. Even if I can't recreate the cache pool with these drives, I'm hoping I can get the data off them? For example, this post: https://askubuntu.com/questions/62295/how-to-fix-a-comreset-failed-error says to run fsck. But I wanted to see here if this is a good idea or if it might be destructive on the drive. Also, if the drive is part of a cache pool, will it have a full copy of the data on it (like a standard mirror) or are both drives needed to reconstruct the data?

Quote

August 5, 20205 yr

Community Expert

1 hour ago, TheSpook said:

I read elsewhere that this could be a BTFRS or a patition problem.

No, it's a hardware problem.

1 hour ago, TheSpook said:

Even if I can't recreate the cache pool with these drives, I'm hoping I can get the data off them?

Not unless you can get them detected.

1 hour ago, TheSpook said:

will it have a full copy of the data on it (like a standard mirror) or are both drives needed to reconstruct the data?

Depends on what profile you were using, default is raid1, so you'd just need 1.

Quote

August 5, 20205 yr

Author

OK - sounds like I have to bite the bullet and restore the backup. So is my plan off attack:

1. Tell the system to stop using a cache drive (how do I do that?)

2. Put the new SSD in and build a new cache

3. Restore the Appdata to the new cache

4. Will my docker containers magically come back (at the moment I'm seeing no containers because they were all in appdata)?

5. Order a second SSD for redundancy on the cache again (not that it really helped last time!)

Thanks again - I really appreciate the assistance.

Quote

August 5, 20205 yr

Community Expert

38 minutes ago, TheSpook said:

Tell the system to stop using a cache drive (how do I do that?)

Since there's no cache it won't be used, though if you're using /mnt/cache paths for anything it will try to use them and fail.

Quote

August 6, 20205 yr

Author

Thanks for your help. I ended up using Backup-Restore Appdata to get back the copy form the night before. Not too much lost, but a good lesson that next time I do a manual backup before replacing a failed drive.

Quote

[SOLVED] Assistance recovering cache

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)