Cache disc 1 not installed, Shares unprotected

dan4UR · April 27, 2022

Hello guys,

today I went to the unraid Dashboard I don't know whats going on.

Every Monday I'm receiving a array health report and so it did on Monday, 25th. Everything just fine [PASS]

Now I noticed, thats Cache 1 NVMe is gone ...

and my shares are partially unprotected

Whats going on here? Hopefully rebooted my rig, but still the same

apollon-diagnostics-20220427-0803.zip

JorgeB · April 27, 2022

Is there supposed to be a cache1 NVMe device? If yes it's not being detected on a hardware level.

dan4UR · April 27, 2022

49 minutes ago, JorgeB said:

Is there supposed to be a cache1 NVMe device? If yes it's not being detected on a hardware level.

Thats the point. I have two identical WD NVMe drives set to a Cache Pool. Running since more than one year. Now Cache 1 drive is no longer detected ...

itimpi · April 27, 2022

1 hour ago, dan4UR said:

Now Cache 1 drive is no longer detected ...

Probably means it has failed

dan4UR · April 27, 2022

Thats what I think. I'll shutdown my rig and connect a monitor since its headles. Maybe I'll find some information in the Bios.

dan4UR · May 10, 2022

Hello, had some big problems with my machine. I`ve got a Enermax AIO watercooler and this little guy just damaged.

Was wondering my server was shut down. Then I started it, and just after booting and going to the Dashboard I've seen my lost Cache NVMe. But it was listed in UD 😀 (Yeah its alive)

Then I saw my Cache is: Unmountable: no filesystem.

After I realized this, the hole machine was hardly shut down. Went to the cellar and realized "Yes, my Server is offline." Took it, and connected it to a monitor to check the BIOS. Ohh hell. CPU Temp went in Seconds from 30°C to something 109°C and then it shut down. Checked the watercooler, disassembled it, cleaned the CPU and cooler block, applied new thermal paste and did another boot up. Same result.

Now I changed to the boxed air cooler and e voila, CPU is chilling at 42°C 😇

But what now to do with my broken Cache pool. Disc 2 is still part of the pool, but disc 1 is positioned at unassigned devices. Shoul I still check if NVMe 1 is broken ? SMART doesn't show any error.

Only Information I found was for Dev 1 (Cache Pool NVME 1):

How can I restore the whole thing. All my Docker and Download Temp is/was stored in the Cache Pool ... 😪

JorgeB · May 11, 2022

Please post the diagnostics.

dan4UR · May 11, 2022

6 hours ago, JorgeB said:

Please post the diagnostics.

Hey @JorgeB,

two diagnostics.

First after fresh reboot with array stopped (...-1536.zip)

Second with manually starting the array (...-1537.zip)

Dashboard Screen:

I hope someone could help finding a solution for rescueing oder rebuilding my cache pool.

apollon-diagnostics-20220511-1536.zip apollon-diagnostics-20220511-1537.zip

JorgeB · May 11, 2022

May 11 15:36:33 Apollon kernel: BTRFS error (device nvme0n1p1): super_num_devices 1 mismatch with num_devices 1 found here

The superblock is corrupt, and the other pool member is way out of sync, though that's expected if it dropped offline earlier, you can try a backup superblock to see if it works, stop the array, type:

btrfs-select-super -s 1 /dev/nvme0n1p1

Then reboot and post new diags.

dan4UR · May 11, 2022

And here we go

grafik.png.59244050e2bb7517c6e9aaaf82d8e70f.png

See diags after the reboot

apollon-diagnostics-20220511-2141.zip

Edit: Or should I also start the array ?

Edited May 11, 2022 by dan4UR
Also array start?

JorgeB · May 12, 2022

11 hours ago, dan4UR said:

Or should I also start the array ?

Yes, sorry, forgot to mention that.

dan4UR · May 12, 2022

And here is the new file, but still the same 🤔

apollon-diagnostics-20220512-0951.zip

JorgeB · May 12, 2022

You can try this, physically disconnect the other NVMe device, the one that is currently unassigned and try again.

dan4UR · May 12, 2022

16 minutes ago, JorgeB said:

You can try this, physically disconnect the other NVMe device, the one that is currently unassigned and try again.

Okay, will try it. Not that nice, since the NVMes are stored under a heavy heat spreader @ my Mainborad. But hopefully it will work after that 🤗

JorgeB · May 12, 2022

If the device is in its own IOMMU group and can be bound to vfio-pci that's also an option, important part is the device not being visible to Unraid.

dan4UR · May 12, 2022

Or maybe possible to deactivate the device in BIOS ? Or would it still be visible to unraid ?

JorgeB · May 12, 2022

It should be possible to disable in the BIOS but only if that m.2 slot is shared with a PCIe slot and you can set which one to use.

dan4UR · May 12, 2022

You know what. After my homeoffice session I`ll disconnect it from the mb and give it a try.

When it's done, just boot and starting the array or anything else to do ?

JorgeB · May 12, 2022

1 minute ago, dan4UR said:

just boot and starting the array

This, just check you disconnected the correct one.

dan4UR · May 12, 2022

Just now, JorgeB said:

This, just check you disconnected the correct one.

That will be hardest part of the operation 🤣

dan4UR · May 20, 2022

Hey, back again.

Did not have much time the last days, but I think I have partially some good news.

Disconnected the NVMe placed under UD and startet my rig.

But what to do now?

Delete dev1 under Historical Devices ? (So I think when I put it back in again, unraid shouldn't places it under UD again)
Stop Array, Set Number of Discs of Cachepool to 1 and backup my Data/move it to the Array ?

Or Should I now backup/move my Data to the Array ?

I think first I'll wait for @JorgeB to look at my diagnostics if everything is OK.

apollon-diagnostics-20220520-1035.zip

JorgeB · May 20, 2022

That's good news!

UD historical devices don't really matter for this, but you can remove it now or later, I assume you plan do re-add the other device o the pool?

If yes first make sure backups are up to date, then you'll need to wipe the other device before adding it back to the pool, you can do it like this:

-check that array auto start is disable, shutdown server

-reconnect the other NVMe device

-power on the server, don't start the array

-wipe the unassigned device with

blkdiscard /dev/nvme#n1

Replace # with correct number, not sure if 6.9.2 needs -f for blkdiscard if a data is detected, if yes use it.

-assign it back to the pool

-start array to begin balance

dan4UR · May 20, 2022

12 minutes ago, JorgeB said:

(...) I assume you plan do re-add the other device o the pool?

Thats correct.

12 minutes ago, JorgeB said:

If yes first make sure backups are up to date

Is it enough to change the share settings from "Cache: only" to "yes" and start the mover ? I'll also shutdown any docker/VM from the settings tab. Maybe also some CA Backup. Some temporary files are not a big deal to lose them, since I can get them back. But docker instances with all the settings would be horror to me.

12 minutes ago, JorgeB said:
-check that array auto start is disable, shutdown server

-reconnect the other NVMe device

-power on the server, don't start the array

-wipe the unassigned device with
blkdiscard /dev/nvme#n1
Replace # with correct number, not sure if 6.9.2 needs -f for blkdiscard if a data is detected, if yes use it.

-assign it back to the pool

-start array to begin balance

After I added the freshly wiped NVMe back to the cache pool just setting up the old cache shares to "Only" and starting the mover ?

JorgeB · May 20, 2022

IMHO moving everything to the array and back is overkill, just make sure anything important like appdata is backed up, you should always have backups of anything important, redundancy is not a substitute, when you add the device it will keep the existing pool data, and you don't even need to shutdown docker/VMs, they can be online, data is just replicated to the other device.

dan4UR · May 20, 2022

2 hours ago, JorgeB said:
(...)

-wipe the unassigned device with
blkdiscard /dev/nvme#n1
Replace # with correct number, not sure if 6.9.2 needs -f for blkdiscard if a data is detected, if yes use it.

Had to use it with "-f"

2 hours ago, JorgeB said:

-assign it back to the pool

Now the drive is still listed under UD. Just add it back to the cache pool, or hit the Format button before?

Cache disc 1 not installed, Shares unprotected

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation