v6.11.5 Unable to start array after loss of cache pool [SOLVED]

Followers

February 24, 20233 yr

I had a pair of Crucial MX500 SSDs in RAID1 for my cache pool and had noticed the logs filling up with IO errors. Thinking these errors were related to the previously reported BTRFS problems with the Crucial firmware, I planned to replace the cache drives with another brand.

In the interim, I applied the update from 6.11.3 to 6.11.5 and rebooted. Upon rebooting, the array refused to start, complaining the cache drives were unmountable. I foolishly tried removing and re-adding the drives to the pool but applied them in the wrong position and lost the MBRs - cache pool was a total loss.

In any case, I am now trying to rebuild with my replacement cache drive installed. I precleared it, formatted, and added to the cache pool, and click Start array. The array does not start, though. The parity check begins but the array stays offline. I tried letting the parity check complete, thinking maybe the array would start after the check is completed, but the parity check just hangs at 29.1% complete. I let it sit at that level of completion for about an hour but no further progress was had.

I've rebooted again and am still getting the same behavior. Parity check starts but array does not. I am also getting a message about a stale configuration. I'm not sure where to go from here.

unraid-diagnostics-20230223-1858.zip

Quote

Solved by JorgeB

February 25, 20233 yr

Go to solution

February 24, 20233 yr

Community Expert

If a parity check started the array is started, are you using Firefox? If yes reboot first then use a different browser.

P.S. if there was important data there the old cache might still be recoverable, if they were just wiped by Unraid and you still have the devices untouched.

Quote

February 24, 20233 yr

Author

Last night I rebooted again and this time the parity check did finish. It still showed the array stopped but after rebooting once more it looks normal and the array is started and I was able to start the docker service.

I do still have the removed cache disks. I had tried mounting them with Unassigned Devices without luck. Any other suggestions to recover the data?

Quote

February 24, 20233 yr

Community Expert

Post new diags with both cache disks connected, and identify which devices they are.

Quote

February 24, 20233 yr

Author

Thanks, attached. It's sdh & sdg.

unraid-diagnostics-20230224-0948.zip

Quote

February 24, 20233 yr

Community Expert

Post the output of

btrfs-select-super -s 1 /dev/sdg

and

btrfs-select-super -s 1 /dev/sdh

Quote

February 24, 20233 yr

Author

root@unraid:~# btrfs-select-super -s 1 /dev/sdg
No valid Btrfs found on /dev/sdg
ERROR: open ctree failed
root@unraid:~# btrfs-select-super -s 1 /dev/sdh
No valid Btrfs found on /dev/sdh
ERROR: open ctree failed

Quote

February 24, 20233 yr

Community Expert

Oops, sorry, my bad, should be:

btrfs-select-super -s 1 /dev/sdg1

and

btrfs-select-super -s 1 /dev/sdh1

Quote

February 24, 20233 yr

Author

oot@unraid:~# btrfs-select-super -s 1 /dev/sdg1
warning, device 5 is missing
using SB copy 1, bytenr 67108864
root@unraid:~# btrfs-select-super -s 1 /dev/sdh1
using SB copy 1, bytenr 67108864
root@unraid:~#

Quote

February 24, 20233 yr

Community Expert

Now assign both devices to a pool, it must be a new pool or pool that was reset, i.e., it can't show a "data on this device will be deleted" for any of the pool devices, then start array and the old pool should import, if it doesn't post new diags.

Quote

February 24, 20233 yr

Author

Nice! Yes it did import as a new pool and I can see my shares again.

I tried toggling docker services off & on but it's still not starting up. Will I need to reboot again?

Quote

February 24, 20233 yr

Community Expert

Reboot should not be needed, please post diags.

Quote

February 24, 20233 yr

Author

Posted. I'm having a hard time from the logs determining which drive is throwing the IO errors as they are both referenced.

unraid-diagnostics-20230224-1105.zip

Quote

February 24, 20233 yr

Community Expert

One of the pool devices is out of sync, it likely dropped offline in the past, run a correcting scrub and post the results.

Quote

February 24, 20233 yr

Author

scrub complete

unraid-diagnostics-20230224-1133.zip

Quote

February 24, 20233 yr

Community Expert

And the output of the scrub?

Quote

February 24, 20233 yr

Author

UUID: 40f53594-6f8d-42e0-afaa-a94807c15c5c Scrub started: Fri Feb 24 11:14:51 2023 Status: finished Duration: 0:17:43 Total to scrub: 778.31GiB Rate: 749.75MiB/s Error summary: verify=13827 csum=1531847 Corrected: 1545674 Uncorrectable: 0 Unverified: 0

Quote

February 24, 20233 yr

Community Expert

OK, now reboot to clear the syslog and post new diags after array start (and docker start if it's not enabled)

Quote

February 24, 20233 yr

Author

Still getting a "Docker service failed to start" message after reboot.

unraid-diagnostics-20230224-1202.zip

Quote

February 24, 20233 yr

Community Expert

Recreate the docker image.

Quote

February 24, 20233 yr

Author

Does this mean I need to configure all my containers anew after they are recreated?

Quote

February 24, 20233 yr

Community Expert

Nope, see the link.

Quote

February 24, 20233 yr

Author

OK, that was really very impressive. All my docker containers appear to be in working order now, and one of my VMs is working properly. I do have a Windows 10 VM that will not boot now, though. Getting bluescreens repeatedly, so I'm guessing that one is a loss.

So some lessons learned on this one. I'll get those sketchy cache disks replaced, keep the appdata folder backed up, and what is the best practice for backing up VMs, since snapshots are not supported? I'm fine with shutting the VM down before backing up - just copy the vdisk + libvert?

unraid-diagnostics-20230224-1353.zip

Quote

February 25, 20233 yr

Community Expert
Solution

Feb 24 13:22:14 unraid kernel: BTRFS info (device loop3): bdev /dev/loop3 errs: wr 0, rd 0, flush 0, corrupt 25, gen 0

libvirt.img is also corrupt, but with that one if you re-create it you will need to reconfigure the VMs, do you have a backup?

11 hours ago, AjaxMpls said:

since snapshots are not supported?

Snapshots are supported, it's how I backup my VMs, there's just no GUI support, but you can take them manually (or with a script) or by using the Snapshot plugin.

Quote

February 26, 20233 yr

Author

Thanks for your help, Jorge. I did not have a backup so I recreated that VM and have now configured backups using https://github.com/danioj/unraid-autovmbackup script.

Quote

3 yr3 yr AjaxMpls changed the title to v6.11.5 Unable to start array after loss of cache pool [SOLVED]

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Followers

Go to topic listing

v6.11.5 Unable to start array after loss of cache pool [SOLVED]

Featured Replies

Solved by JorgeB

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)