dan4UR Posted April 27, 2022 Share Posted April 27, 2022 Hello guys, today I went to the unraid Dashboard I don't know whats going on. Every Monday I'm receiving a array health report and so it did on Monday, 25th. Everything just fine [PASS] Now I noticed, thats Cache 1 NVMe is gone ... and my shares are partially unprotected Whats going on here? Hopefully rebooted my rig, but still the same apollon-diagnostics-20220427-0803.zip Quote Link to comment
JorgeB Posted April 27, 2022 Share Posted April 27, 2022 Is there supposed to be a cache1 NVMe device? If yes it's not being detected on a hardware level. Quote Link to comment
dan4UR Posted April 27, 2022 Author Share Posted April 27, 2022 49 minutes ago, JorgeB said: Is there supposed to be a cache1 NVMe device? If yes it's not being detected on a hardware level. Thats the point. I have two identical WD NVMe drives set to a Cache Pool. Running since more than one year. Now Cache 1 drive is no longer detected ... Quote Link to comment
itimpi Posted April 27, 2022 Share Posted April 27, 2022 1 hour ago, dan4UR said: Now Cache 1 drive is no longer detected ... Probably means it has failed Quote Link to comment
dan4UR Posted April 27, 2022 Author Share Posted April 27, 2022 Thats what I think. I'll shutdown my rig and connect a monitor since its headles. Maybe I'll find some information in the Bios. Quote Link to comment
dan4UR Posted May 10, 2022 Author Share Posted May 10, 2022 Hello, had some big problems with my machine. I`ve got a Enermax AIO watercooler and this little guy just damaged. Was wondering my server was shut down. Then I started it, and just after booting and going to the Dashboard I've seen my lost Cache NVMe. But it was listed in UD 😀 (Yeah its alive) Then I saw my Cache is: Unmountable: no filesystem. After I realized this, the hole machine was hardly shut down. Went to the cellar and realized "Yes, my Server is offline." Took it, and connected it to a monitor to check the BIOS. Ohh hell. CPU Temp went in Seconds from 30°C to something 109°C and then it shut down. Checked the watercooler, disassembled it, cleaned the CPU and cooler block, applied new thermal paste and did another boot up. Same result. Now I changed to the boxed air cooler and e voila, CPU is chilling at 42°C 😇 But what now to do with my broken Cache pool. Disc 2 is still part of the pool, but disc 1 is positioned at unassigned devices. Shoul I still check if NVMe 1 is broken ? SMART doesn't show any error. Only Information I found was for Dev 1 (Cache Pool NVME 1): How can I restore the whole thing. All my Docker and Download Temp is/was stored in the Cache Pool ... 😪 Quote Link to comment
JorgeB Posted May 11, 2022 Share Posted May 11, 2022 Please post the diagnostics. Quote Link to comment
dan4UR Posted May 11, 2022 Author Share Posted May 11, 2022 6 hours ago, JorgeB said: Please post the diagnostics. Hey @JorgeB, two diagnostics. First after fresh reboot with array stopped (...-1536.zip) Second with manually starting the array (...-1537.zip) Dashboard Screen: I hope someone could help finding a solution for rescueing oder rebuilding my cache pool. apollon-diagnostics-20220511-1536.zip apollon-diagnostics-20220511-1537.zip Quote Link to comment
JorgeB Posted May 11, 2022 Share Posted May 11, 2022 May 11 15:36:33 Apollon kernel: BTRFS error (device nvme0n1p1): super_num_devices 1 mismatch with num_devices 1 found here The superblock is corrupt, and the other pool member is way out of sync, though that's expected if it dropped offline earlier, you can try a backup superblock to see if it works, stop the array, type: btrfs-select-super -s 1 /dev/nvme0n1p1 Then reboot and post new diags. Quote Link to comment
dan4UR Posted May 11, 2022 Author Share Posted May 11, 2022 (edited) And here we go See diags after the reboot apollon-diagnostics-20220511-2141.zip Edit: Or should I also start the array ? Edited May 11, 2022 by dan4UR Also array start? Quote Link to comment
JorgeB Posted May 12, 2022 Share Posted May 12, 2022 11 hours ago, dan4UR said: Or should I also start the array ? Yes, sorry, forgot to mention that. Quote Link to comment
dan4UR Posted May 12, 2022 Author Share Posted May 12, 2022 And here is the new file, but still the same 🤔 apollon-diagnostics-20220512-0951.zip Quote Link to comment
JorgeB Posted May 12, 2022 Share Posted May 12, 2022 You can try this, physically disconnect the other NVMe device, the one that is currently unassigned and try again. Quote Link to comment
dan4UR Posted May 12, 2022 Author Share Posted May 12, 2022 16 minutes ago, JorgeB said: You can try this, physically disconnect the other NVMe device, the one that is currently unassigned and try again. Okay, will try it. Not that nice, since the NVMes are stored under a heavy heat spreader @ my Mainborad. But hopefully it will work after that 🤗 Quote Link to comment
JorgeB Posted May 12, 2022 Share Posted May 12, 2022 If the device is in its own IOMMU group and can be bound to vfio-pci that's also an option, important part is the device not being visible to Unraid. Quote Link to comment
dan4UR Posted May 12, 2022 Author Share Posted May 12, 2022 Or maybe possible to deactivate the device in BIOS ? Or would it still be visible to unraid ? Quote Link to comment
JorgeB Posted May 12, 2022 Share Posted May 12, 2022 It should be possible to disable in the BIOS but only if that m.2 slot is shared with a PCIe slot and you can set which one to use. Quote Link to comment
dan4UR Posted May 12, 2022 Author Share Posted May 12, 2022 You know what. After my homeoffice session I`ll disconnect it from the mb and give it a try. When it's done, just boot and starting the array or anything else to do ? Quote Link to comment
JorgeB Posted May 12, 2022 Share Posted May 12, 2022 1 minute ago, dan4UR said: just boot and starting the array This, just check you disconnected the correct one. Quote Link to comment
dan4UR Posted May 12, 2022 Author Share Posted May 12, 2022 Just now, JorgeB said: This, just check you disconnected the correct one. That will be hardest part of the operation 🤣 Quote Link to comment
dan4UR Posted May 20, 2022 Author Share Posted May 20, 2022 Hey, back again. Did not have much time the last days, but I think I have partially some good news. Disconnected the NVMe placed under UD and startet my rig. But what to do now? Delete dev1 under Historical Devices ? (So I think when I put it back in again, unraid shouldn't places it under UD again) Stop Array, Set Number of Discs of Cachepool to 1 and backup my Data/move it to the Array ? Or Should I now backup/move my Data to the Array ? I think first I'll wait for @JorgeB to look at my diagnostics if everything is OK. apollon-diagnostics-20220520-1035.zip Quote Link to comment
Solution JorgeB Posted May 20, 2022 Solution Share Posted May 20, 2022 That's good news! UD historical devices don't really matter for this, but you can remove it now or later, I assume you plan do re-add the other device o the pool? If yes first make sure backups are up to date, then you'll need to wipe the other device before adding it back to the pool, you can do it like this: -check that array auto start is disable, shutdown server -reconnect the other NVMe device -power on the server, don't start the array -wipe the unassigned device with blkdiscard /dev/nvme#n1 Replace # with correct number, not sure if 6.9.2 needs -f for blkdiscard if a data is detected, if yes use it. -assign it back to the pool -start array to begin balance Quote Link to comment
dan4UR Posted May 20, 2022 Author Share Posted May 20, 2022 12 minutes ago, JorgeB said: (...) I assume you plan do re-add the other device o the pool? Thats correct. 12 minutes ago, JorgeB said: If yes first make sure backups are up to date Is it enough to change the share settings from "Cache: only" to "yes" and start the mover ? I'll also shutdown any docker/VM from the settings tab. Maybe also some CA Backup. Some temporary files are not a big deal to lose them, since I can get them back. But docker instances with all the settings would be horror to me. 12 minutes ago, JorgeB said: -check that array auto start is disable, shutdown server -reconnect the other NVMe device -power on the server, don't start the array -wipe the unassigned device with blkdiscard /dev/nvme#n1 Replace # with correct number, not sure if 6.9.2 needs -f for blkdiscard if a data is detected, if yes use it. -assign it back to the pool -start array to begin balance After I added the freshly wiped NVMe back to the cache pool just setting up the old cache shares to "Only" and starting the mover ? Quote Link to comment
JorgeB Posted May 20, 2022 Share Posted May 20, 2022 IMHO moving everything to the array and back is overkill, just make sure anything important like appdata is backed up, you should always have backups of anything important, redundancy is not a substitute, when you add the device it will keep the existing pool data, and you don't even need to shutdown docker/VMs, they can be online, data is just replicated to the other device. Quote Link to comment
dan4UR Posted May 20, 2022 Author Share Posted May 20, 2022 2 hours ago, JorgeB said: (...) -wipe the unassigned device with blkdiscard /dev/nvme#n1 Replace # with correct number, not sure if 6.9.2 needs -f for blkdiscard if a data is detected, if yes use it. Had to use it with "-f" 2 hours ago, JorgeB said: -assign it back to the pool Now the drive is still listed under UD. Just add it back to the cache pool, or hit the Format button before? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.