Jump to content

Cache drive mounted read-only - Docker and VM does not start.


Recommended Posts

Hello world,

Since the first reboot after upgrade to 6.12.1. I'm facing an issue that one of my cache drives is marked as read only (from "Fix Common Problems" plugin). Both cache drives are mounted and ther is enough space available. Any idea how to fix this? Data seems available due to I let them run in Raid 1 configuration. After an additional reboot the drive had not been recognized but came up again after an additional reboot. Both SSDs mounted on a PCIE card on an HP microserver Gen10 and did not face any issues since years....

Nevertheless my dockers and VM (with HASS) will not start any more and I really would like to get them back running with the available data. 

I appreciate any help due to I'm more a user than an expert :)

 

image.png.aa38db19041babd8199ae40915014633.png

onkel-diagnostics-20230627-2052.zip

Link to comment

One of the NVMe devices dropped offline:

 

Jun 27 20:39:08 ONKEL kernel: nvme nvme1: Abort status: 0x0
Jun 27 20:39:08 ONKEL kernel: nvme nvme1: I/O 162 (I/O Cmd) QID 2 timeout, aborting
Jun 27 20:39:08 ONKEL kernel: nvme nvme1: Abort status: 0x0
Jun 27 20:39:35 ONKEL kernel: Uhhuh. NMI received for unknown reason 24 on CPU 0.
Jun 27 20:39:35 ONKEL kernel: Dazed and confused, but trying to continue
Jun 27 20:39:38 ONKEL kernel: nvme nvme1: I/O 0 QID 2 timeout, reset controller
Jun 27 20:39:47 ONKEL kernel: nvme nvme1: failed to set APST feature (2)
Jun 27 20:39:47 ONKEL kernel: nvme nvme1: 4/0/0 default/read/poll queues
Jun 27 20:40:17 ONKEL kernel: nvme nvme1: I/O 0 QID 2 timeout, disable controller
Jun 27 20:40:53 ONKEL kernel: nvme nvme1: I/O 14 QID 0 timeout, disable controller

 

Since your pool is not redundant it went read-only, power cycle the server (not just reboot) and post new diags after array start.

Link to comment

Hi JorgeB,

Ok. Here is the diags. Powered the server off. Started again. Unraid came up but didn't start the array automatically. Needed to do it manually and starting took quite a while. After the array started I created the diags. Hope this will help go further.

 

"Since your pool is not redundant" - The pool isn't but the drives should. Is it a better way to clone the pool instead of the drives? Thought I'm save using two redundant SSDs in one pool :S

 

Thanks in advance!

onkel-diagnostics-20230628-1950.zip

Link to comment

It dropped again 3 minutes after the pool mounted, try re-seating the device, or swap slots with the other one.

 

26 minutes ago, onkelsmily said:

The pool isn't but the drives should.

It's using the single profile, not the default raid1, so if one device drops like it's happening it makes the pool read-only, you can covert to raid1, if you can get the other device to stay up.

 

 

Link to comment

Hello,

Changed SSDs on the QNAP card but it seems to be the same behaviour :(

Does this mean the SSD is damaged and I'm not able to restore? Where can I see what profile I used? Pool size was just one SSD and I thought I set the pool correct...

I'm starting to worry about my docker configurations :S

Link to comment

One more thing to report: After the system did not recognized the SSD one time I recreated the config to add the drives to the correct place again. Could this may be an issue or is it irrelevant...?

 

After I set the basic Disk-settings back to btrfs and auto start the cache is now unmountable due to an unsupported or no file system.... very ugly

 

image.thumb.png.1dc88d11e1c2b5ef84a7be66cf3fb5f7.png

Edited by onkelsmily
Link to comment

root@ONKEL:~# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 476.94 GiB, 512110190592 bytes, 1000215216 sectors
Disk model: Patriot M.2 P300 512GB                  
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Link to comment

Ok, this is really strange due to I didn't work on the server except for the upgrade. The only place where this could happend is the mentioned "New Config" assignment as the drive failed the first time. Could that be a reason why the drive had been whiped - because before I still saw the organge folders on the remaining drive? If not I really don't know where this should had happened. If this was the case I assume the recovery of the data will also not work :S
 

Is there a manual how to set up the cache drives in Raid1 correctly. If I need to re-install everything from scratch I'lll like to do it correct this time.

Thanks :)

 

 

Link to comment

Ok, then I'm really confused because there was no error message of warning an originally I was sure that I configured Raid1 which it is not in the actual status as you mentioned. I really don't know where this could happend :(

Is there any possibility that the data remains on the other drive  (nvme0n1) and that the cache just do not start because of the failing disk?
Before I'm going to format everything I like to be sure about that. Is there a way to remove the failing disk from pool an load only the other one?

 

This is the error log from the disk may telling the same we identified already.... 
image.thumb.png.f3f8202e12888972303ad10aa8b19b71.png

Link to comment

If you look at the original diags you can see how the pool was configured:

 

                  Data     Metadata  System                              
Id Path           single   single    single   Unallocated Total     Slack
-- -------------- -------- --------- -------- ----------- --------- -----
 2 /dev/nvme0n1p1 60.00GiB   1.00GiB 32.00MiB   415.91GiB 476.94GiB     -
 3 /dev/nvme1n1p1  1.00GiB         -        -   475.94GiB 476.94GiB     -
-- -------------- -------- --------- -------- ----------- --------- -----
   Total          61.00GiB   1.00GiB 32.00MiB   891.85GiB 953.88GiB 0.00B
   Used           54.25GiB 475.09MiB 16.00KiB  

 

Not sure how the pool ended up like this but it's not normal, first note that you have devices 2 and 3, device #1 isn't there, so a device was removed/replaced before, then notice that both data and metadata are single profile, when the user balances a pool to single profile (non redundant) metadata is still raid1 (or DUP depending on the Unraid version), in any case note that the other NVMe device, the one that's empty now (nvme0n1) had almost all the data and all the metadata, the remaining device only has max 1GB of data and no metadata, sorry but don't see any way of recovering data from this pool without the other device.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...