Cache drive mounted read-only - Docker and VM does not start.

onkelsmily · June 27, 2023

Hello world,

Since the first reboot after upgrade to 6.12.1. I'm facing an issue that one of my cache drives is marked as read only (from "Fix Common Problems" plugin). Both cache drives are mounted and ther is enough space available. Any idea how to fix this? Data seems available due to I let them run in Raid 1 configuration. After an additional reboot the drive had not been recognized but came up again after an additional reboot. Both SSDs mounted on a PCIE card on an HP microserver Gen10 and did not face any issues since years....

Nevertheless my dockers and VM (with HASS) will not start any more and I really would like to get them back running with the available data.

I appreciate any help due to I'm more a user than an expert

image.png.aa38db19041babd8199ae40915014633.png

onkel-diagnostics-20230627-2052.zip

JorgeB · June 28, 2023

One of the NVMe devices dropped offline:

Jun 27 20:39:08 ONKEL kernel: nvme nvme1: Abort status: 0x0
Jun 27 20:39:08 ONKEL kernel: nvme nvme1: I/O 162 (I/O Cmd) QID 2 timeout, aborting
Jun 27 20:39:08 ONKEL kernel: nvme nvme1: Abort status: 0x0
Jun 27 20:39:35 ONKEL kernel: Uhhuh. NMI received for unknown reason 24 on CPU 0.
Jun 27 20:39:35 ONKEL kernel: Dazed and confused, but trying to continue
Jun 27 20:39:38 ONKEL kernel: nvme nvme1: I/O 0 QID 2 timeout, reset controller
Jun 27 20:39:47 ONKEL kernel: nvme nvme1: failed to set APST feature (2)
Jun 27 20:39:47 ONKEL kernel: nvme nvme1: 4/0/0 default/read/poll queues
Jun 27 20:40:17 ONKEL kernel: nvme nvme1: I/O 0 QID 2 timeout, disable controller
Jun 27 20:40:53 ONKEL kernel: nvme nvme1: I/O 14 QID 0 timeout, disable controller

Since your pool is not redundant it went read-only, power cycle the server (not just reboot) and post new diags after array start.

onkelsmily · June 28, 2023

Hi JorgeB,

Ok. Here is the diags. Powered the server off. Started again. Unraid came up but didn't start the array automatically. Needed to do it manually and starting took quite a while. After the array started I created the diags. Hope this will help go further.

"Since your pool is not redundant" - The pool isn't but the drives should. Is it a better way to clone the pool instead of the drives? Thought I'm save using two redundant SSDs in one pool

Thanks in advance!

onkel-diagnostics-20230628-1950.zip

JorgeB · June 28, 2023

It dropped again 3 minutes after the pool mounted, try re-seating the device, or swap slots with the other one.

26 minutes ago, onkelsmily said:

The pool isn't but the drives should.

It's using the single profile, not the default raid1, so if one device drops like it's happening it makes the pool read-only, you can covert to raid1, if you can get the other device to stay up.

onkelsmily · June 28, 2023

Hello,

Changed SSDs on the QNAP card but it seems to be the same behaviour

Does this mean the SSD is damaged and I'm not able to restore? Where can I see what profile I used? Pool size was just one SSD and I thought I set the pool correct...

I'm starting to worry about my docker configurations

onkelsmily · June 28, 2023

One more thing to report: After the system did not recognized the SSD one time I recreated the config to add the drives to the correct place again. Could this may be an issue or is it irrelevant...?

After I set the basic Disk-settings back to btrfs and auto start the cache is now unmountable due to an unsupported or no file system.... very ugly

Edited June 28, 2023 by onkelsmily

JorgeB · June 29, 2023

Post new diags.

onkelsmily · June 29, 2023

Here they are

onkel-diagnostics-20230629-2006.zip

JorgeB · June 29, 2023

One of the devices is missing the filesystem, possibly it was wiped, post the output of:

fdisk -l /dev/nvme0n1

onkelsmily · June 29, 2023

root@ONKEL:~# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 476.94 GiB, 512110190592 bytes, 1000215216 sectors
Disk model: Patriot M.2 P300 512GB
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

JorgeB · June 30, 2023

That device has no partition, suggesting it was wiped, we can see if it's recoverable, type

sfdisk /dev/nvme0n1

then type 2048 and hit return and post a screenshot of the results.

onkelsmily · June 30, 2023

Hi JorgeB,
Here is the result. Did something but array is still unmountable.

Regards!

onkel-diagnostics-20230630-2254.zip

JorgeB · July 1, 2023

That command by itself doesn't do anything, just trying to see if it can recover the old partition, there's no old partition starting on sector 2048, retry with 64, type:

sfdisk /dev/nvme0n1

then type 64 and hit return and post a screenshot of the results.

onkelsmily · July 1, 2023

image.png.3f5635e7507a7afd726d02507ae4ca69.png

JorgeB · July 2, 2023

You can abort that with CTRL + C, looks like the device was fully wiped, possibly with a full device trim (blkdiscard), so don't see any option to recover the data.

onkelsmily · July 2, 2023

Ok, this is really strange due to I didn't work on the server except for the upgrade. The only place where this could happend is the mentioned "New Config" assignment as the drive failed the first time. Could that be a reason why the drive had been whiped - because before I still saw the organge folders on the remaining drive? If not I really don't know where this should had happened. If this was the case I assume the recovery of the data will also not work

Is there a manual how to set up the cache drives in Raid1 correctly. If I need to re-install everything from scratch I'lll like to do it correct this time.

Thanks

JorgeB · July 2, 2023

A new config by itself would not wipe drives.

By default a two pool device will be raid1, but you can post new diags once done to confirm.

onkelsmily · July 2, 2023

Ok, then I'm really confused because there was no error message of warning an originally I was sure that I configured Raid1 which it is not in the actual status as you mentioned. I really don't know where this could happend

Is there any possibility that the data remains on the other drive (nvme0n1) and that the cache just do not start because of the failing disk?
Before I'm going to format everything I like to be sure about that. Is there a way to remove the failing disk from pool an load only the other one?

This is the error log from the disk may telling the same we identified already....

JorgeB · July 2, 2023

2 minutes ago, onkelsmily said:

Is there any possibility that the data remains on the other drive

Don't think so but post the output of

btrfs fi show

onkelsmily · July 2, 2023

image.png.8330ada109ea2562e84f59b9323c0bec.png

JorgeB · July 3, 2023

If you look at the original diags you can see how the pool was configured:

                  Data     Metadata  System                              
Id Path           single   single    single   Unallocated Total     Slack
-- -------------- -------- --------- -------- ----------- --------- -----
 2 /dev/nvme0n1p1 60.00GiB   1.00GiB 32.00MiB   415.91GiB 476.94GiB     -
 3 /dev/nvme1n1p1  1.00GiB         -        -   475.94GiB 476.94GiB     -
-- -------------- -------- --------- -------- ----------- --------- -----
   Total          61.00GiB   1.00GiB 32.00MiB   891.85GiB 953.88GiB 0.00B
   Used           54.25GiB 475.09MiB 16.00KiB

Not sure how the pool ended up like this but it's not normal, first note that you have devices 2 and 3, device #1 isn't there, so a device was removed/replaced before, then notice that both data and metadata are single profile, when the user balances a pool to single profile (non redundant) metadata is still raid1 (or DUP depending on the Unraid version), in any case note that the other NVMe device, the one that's empty now (nvme0n1) had almost all the data and all the metadata, the remaining device only has max 1GB of data and no metadata, sorry but don't see any way of recovering data from this pool without the other device.

Cache drive mounted read-only - Docker and VM does not start.

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation