Jump to content

Need help replacing a cache disk


Go to solution Solved by JorgeB,

Recommended Posts

Hi,

 

I had a cache pool of two 1TB NVME devices: a Kingston A2000 and a Samsung PM981a. Because the Kingston has some issues with getting disconnected now and then, I decided to replace it with a new 1TB one (WD SN770). I figured it'd be easy since they are all 1TB, and it was a raid 1 cache so the data is on the PM981a as well. So I did the following:

- Stop VM and Docker services

- Shutdown server

- Replace A2000 with SN770 in server

- Replace A2000 with SN770 in cache array

- Start array

 

After this, it asked me to format. However, it showed it had to format both the new SN770 and the old PM981a. I figured this was some sort of mistake, since it should just copy the contents of the PM981a over to the new SN770 in the array. However, it actually did format both devices.

 

Now I still have the A2000 unwiped, so I'm hoping I can restore all data from that SSD. What would be the best way to do this? I removed the PM981a from the system and put the A2000 back in. Now I have the A2000 as nvme1n1 and the SN770 as nvme0n1. I somehow can't just seem to mount the old A2000 as unassigned device, it says the filesystem is incorrect:

Mount of 'nvme1n1p1' failed: 'mount: /mnt/disks/50026B768465C999: wrong fs type, bad option, bad superblock on /dev/nvme1n1p1, missing codepage or helper program, or other error. dmesg(1) may have more information after failed mount system call.

 

Am I screwed or can we still fix this?

Running Unraid 6.12.1 by the way.

Edited by Ruuddie
Link to comment

I'm not really understanding what you were trying to do, you mentioned replacing the Kingston with the WD, but the diags only show those 2 connected, where is the Samsung? To do a normal replacement replacement you would just assign the new device in place of the old one, leaving the Samsung alone.

 

Disconnect the WD and post new diags with just the Kingston and the Samsung connected.

Link to comment

First step was a reboot disconnect the Kingston and connect the WD770 instead (as that is my go-to setup). After failing, I rebooted and switched to having the Kingston and WD770 installed (since I figured the Samsung might still have some files on it and decided to keep it offline as a backup). This second reboot is what I sent diags of; I don't have any diags of the first step unfortunately.

 

I'm back on the Kingston and Samsung now and included the diags of this setup, but I don't want to just assign them to the cache and pray. How should I do this?

Both the Kingston and the Samsung are unassigned devices now, because of my earlier tries of swapping it around.

unraid-diagnostics-20230701-1238.zip

Edited by Ruuddie
Link to comment
  • Solution

So not sure how you arrived at this scenario, since previous diags are missing, but currently the Samsung and the WD NVMe are both part of the same pool, and it's a new pool, so that won't have any of your data, the Kingston is the remaining member of an old pool, depending on how the pool was configured it might still be recoverable, try this:

 

-start the array without any device assigned to the pool so it can be reset

-stop array

-assign only the Kingston NVMe to the pool and start array

 

Post new diags after that.

Link to comment

Cool, it does seem to recognise data on the Kingston! It's doing a BTRFS operation now. I see the used data going down slowly (started at 400GB, down to 333GB now), I hope it's not removing my files 😛 But the used and free data combined (used 350GB and free 300GB) is lower than my 1TB storage, so I hope it's just doing some background wizardry and it's all good in a couple of hours.

 

If it worked, I'll first migrate all data to my normal disks, put in my new (empty) cache nvme's and move it all back. Not going to risk the direct-swap again.

unraid-diagnostics-20230701-1321.zip

Edited by Ruuddie
Link to comment
9 minutes ago, Ruuddie said:

It's doing a BTRFS operation now.

This is normal, it's doing a balance to remove the missing device, data will be all there, at least all that was there since this device was last used with the pool.

 

When the balance finishes you can add the Samsung NVMe device to the pool to create a mirror, then and if you still want to replace the Kingston just do a direct replacement, there's no need to stop docker/VM services.

Link to comment

Thanks a lot for your help! Removing all devices from the cache pool, starting it, stopping it, and adding back the A2000 with all the data on it worked well.

I decided to not take any risks, and just move all files over from Cache to the Array. This process actually failed 3 times, on some corrupted file. I had to reboot the server (couldn't stop the mover) and do the trick with removing the device from the cache pool, starting, stopping and adding again, because the array wouldn't start with it.

 

I guess my A2000 is half dead, which probably was the cause of all my issues in the first place (random disconnects).

 

Thanks again for your tips. I just finished moving all data from the cache SSD's to my disk array. Now I'll rebuild a brand new cache (great moment to switch to ZFS!) and move the data back.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...