onkelsmily Posted June 27, 2023 Share Posted June 27, 2023 Hello world, Since the first reboot after upgrade to 6.12.1. I'm facing an issue that one of my cache drives is marked as read only (from "Fix Common Problems" plugin). Both cache drives are mounted and ther is enough space available. Any idea how to fix this? Data seems available due to I let them run in Raid 1 configuration. After an additional reboot the drive had not been recognized but came up again after an additional reboot. Both SSDs mounted on a PCIE card on an HP microserver Gen10 and did not face any issues since years.... Nevertheless my dockers and VM (with HASS) will not start any more and I really would like to get them back running with the available data. I appreciate any help due to I'm more a user than an expert onkel-diagnostics-20230627-2052.zip Quote Link to comment
JorgeB Posted June 28, 2023 Share Posted June 28, 2023 One of the NVMe devices dropped offline: Jun 27 20:39:08 ONKEL kernel: nvme nvme1: Abort status: 0x0 Jun 27 20:39:08 ONKEL kernel: nvme nvme1: I/O 162 (I/O Cmd) QID 2 timeout, aborting Jun 27 20:39:08 ONKEL kernel: nvme nvme1: Abort status: 0x0 Jun 27 20:39:35 ONKEL kernel: Uhhuh. NMI received for unknown reason 24 on CPU 0. Jun 27 20:39:35 ONKEL kernel: Dazed and confused, but trying to continue Jun 27 20:39:38 ONKEL kernel: nvme nvme1: I/O 0 QID 2 timeout, reset controller Jun 27 20:39:47 ONKEL kernel: nvme nvme1: failed to set APST feature (2) Jun 27 20:39:47 ONKEL kernel: nvme nvme1: 4/0/0 default/read/poll queues Jun 27 20:40:17 ONKEL kernel: nvme nvme1: I/O 0 QID 2 timeout, disable controller Jun 27 20:40:53 ONKEL kernel: nvme nvme1: I/O 14 QID 0 timeout, disable controller Since your pool is not redundant it went read-only, power cycle the server (not just reboot) and post new diags after array start. Quote Link to comment
onkelsmily Posted June 28, 2023 Author Share Posted June 28, 2023 Hi JorgeB, Ok. Here is the diags. Powered the server off. Started again. Unraid came up but didn't start the array automatically. Needed to do it manually and starting took quite a while. After the array started I created the diags. Hope this will help go further. "Since your pool is not redundant" - The pool isn't but the drives should. Is it a better way to clone the pool instead of the drives? Thought I'm save using two redundant SSDs in one pool Thanks in advance! onkel-diagnostics-20230628-1950.zip Quote Link to comment
JorgeB Posted June 28, 2023 Share Posted June 28, 2023 It dropped again 3 minutes after the pool mounted, try re-seating the device, or swap slots with the other one. 26 minutes ago, onkelsmily said: The pool isn't but the drives should. It's using the single profile, not the default raid1, so if one device drops like it's happening it makes the pool read-only, you can covert to raid1, if you can get the other device to stay up. Quote Link to comment
onkelsmily Posted June 28, 2023 Author Share Posted June 28, 2023 Hello, Changed SSDs on the QNAP card but it seems to be the same behaviour Does this mean the SSD is damaged and I'm not able to restore? Where can I see what profile I used? Pool size was just one SSD and I thought I set the pool correct... I'm starting to worry about my docker configurations Quote Link to comment
onkelsmily Posted June 28, 2023 Author Share Posted June 28, 2023 (edited) One more thing to report: After the system did not recognized the SSD one time I recreated the config to add the drives to the correct place again. Could this may be an issue or is it irrelevant...? After I set the basic Disk-settings back to btrfs and auto start the cache is now unmountable due to an unsupported or no file system.... very ugly Edited June 28, 2023 by onkelsmily Quote Link to comment
onkelsmily Posted June 29, 2023 Author Share Posted June 29, 2023 Here they are onkel-diagnostics-20230629-2006.zip Quote Link to comment
JorgeB Posted June 29, 2023 Share Posted June 29, 2023 One of the devices is missing the filesystem, possibly it was wiped, post the output of: fdisk -l /dev/nvme0n1 Quote Link to comment
onkelsmily Posted June 29, 2023 Author Share Posted June 29, 2023 root@ONKEL:~# fdisk -l /dev/nvme0n1 Disk /dev/nvme0n1: 476.94 GiB, 512110190592 bytes, 1000215216 sectors Disk model: Patriot M.2 P300 512GB Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Quote Link to comment
JorgeB Posted June 30, 2023 Share Posted June 30, 2023 That device has no partition, suggesting it was wiped, we can see if it's recoverable, type sfdisk /dev/nvme0n1 then type 2048 and hit return and post a screenshot of the results. Quote Link to comment
onkelsmily Posted June 30, 2023 Author Share Posted June 30, 2023 Hi JorgeB, Here is the result. Did something but array is still unmountable. Regards! onkel-diagnostics-20230630-2254.zip Quote Link to comment
JorgeB Posted July 1, 2023 Share Posted July 1, 2023 That command by itself doesn't do anything, just trying to see if it can recover the old partition, there's no old partition starting on sector 2048, retry with 64, type: sfdisk /dev/nvme0n1 then type 64 and hit return and post a screenshot of the results. Quote Link to comment
JorgeB Posted July 2, 2023 Share Posted July 2, 2023 You can abort that with CTRL + C, looks like the device was fully wiped, possibly with a full device trim (blkdiscard), so don't see any option to recover the data. Quote Link to comment
onkelsmily Posted July 2, 2023 Author Share Posted July 2, 2023 Ok, this is really strange due to I didn't work on the server except for the upgrade. The only place where this could happend is the mentioned "New Config" assignment as the drive failed the first time. Could that be a reason why the drive had been whiped - because before I still saw the organge folders on the remaining drive? If not I really don't know where this should had happened. If this was the case I assume the recovery of the data will also not work Is there a manual how to set up the cache drives in Raid1 correctly. If I need to re-install everything from scratch I'lll like to do it correct this time. Thanks Quote Link to comment
JorgeB Posted July 2, 2023 Share Posted July 2, 2023 A new config by itself would not wipe drives. By default a two pool device will be raid1, but you can post new diags once done to confirm. Quote Link to comment
onkelsmily Posted July 2, 2023 Author Share Posted July 2, 2023 Ok, then I'm really confused because there was no error message of warning an originally I was sure that I configured Raid1 which it is not in the actual status as you mentioned. I really don't know where this could happend Is there any possibility that the data remains on the other drive (nvme0n1) and that the cache just do not start because of the failing disk? Before I'm going to format everything I like to be sure about that. Is there a way to remove the failing disk from pool an load only the other one? This is the error log from the disk may telling the same we identified already.... Quote Link to comment
JorgeB Posted July 2, 2023 Share Posted July 2, 2023 2 minutes ago, onkelsmily said: Is there any possibility that the data remains on the other drive Don't think so but post the output of btrfs fi show Quote Link to comment
JorgeB Posted July 3, 2023 Share Posted July 3, 2023 If you look at the original diags you can see how the pool was configured: Data Metadata System Id Path single single single Unallocated Total Slack -- -------------- -------- --------- -------- ----------- --------- ----- 2 /dev/nvme0n1p1 60.00GiB 1.00GiB 32.00MiB 415.91GiB 476.94GiB - 3 /dev/nvme1n1p1 1.00GiB - - 475.94GiB 476.94GiB - -- -------------- -------- --------- -------- ----------- --------- ----- Total 61.00GiB 1.00GiB 32.00MiB 891.85GiB 953.88GiB 0.00B Used 54.25GiB 475.09MiB 16.00KiB Not sure how the pool ended up like this but it's not normal, first note that you have devices 2 and 3, device #1 isn't there, so a device was removed/replaced before, then notice that both data and metadata are single profile, when the user balances a pool to single profile (non redundant) metadata is still raid1 (or DUP depending on the Unraid version), in any case note that the other NVMe device, the one that's empty now (nvme0n1) had almost all the data and all the metadata, the remaining device only has max 1GB of data and no metadata, sorry but don't see any way of recovering data from this pool without the other device. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.