Cache pool drive failure. Need assistance.

Followers

August 30, 20241 yr

Cache pool is two 2TB NVME mirrored BTRFS

About a week ago, I woke up to my cache pool being offline due errors. This happened once before about 9 months ago. I was able to mostly recover this time. While I wasn't sure, I suspected I may have run out of memory(now think that's wrong). I've been watching the logs the last week and all seemed fine until today. I noticed a bunch of errors again, but the drive hadn't gone offline yet. I figured I'd stop the array, remove the problem drive and be able to continue on the remaining drive. I rebooted and the cache is reporting no file system. So now I'm not sure what to do.

The data should be there, I think just screwed up by removing the other drive. Can anyone assist with getting my remaining drive back online?

unraid-diagnostics-20240830-1248 2.zip

Edited August 30, 20241 yr by WashingtonMatt
Type-o

Quote

August 30, 20241 yr

Author

I was able to use This post and mount my drive read only. Currently copying my data to the array. Hopefully not corrupted. Both drives are currently in unassigned devices. Oddly the "bad" drive is showing some reads as I copy the data, and the system log is barfing a ton of errors...

I could use some help in determining what actually went wrong. The whole point of mirroring the drives was to be able to have a failure.

Quote

August 31, 20241 yr

Community Expert

The syslog doesn't show the start of the issue, but looks like one of the NVMe devices dropped offline, once the backup is done, power cycle the server and post new diags after array start.

Quote

2 weeks later...

September 8, 20241 yr

Author

Finally getting an opportunity to get back to this. These things never happen a convenient time. To get working again, I just recreated the cache with a single drive. Things seem to have been running fine with no errors.

I would still like to run the other drive in it's own pool, but it seems to be locked up by unraid. I'm not really sure the proper way to proceed. The attached diagnostics is just after a cold boot and attempting to mount the drive via unassigned devices.

unraid-diagnostics-20240908-1349.zip

Quote

September 8, 20241 yr

Community Expert

If you want to use it with a different pool, wipe the second device with

blkdiscard -f /dev/nvmeXn1

Then add it to a new pool and format it.

Quote

September 8, 20241 yr

Author

Yes, that worked, thank you.

Now I'm still unclear what happened with my original corruption issue. I'm not convinced it's a hardware issue, both times this occurred, I think I was pushing the limits of available system memory with many VM's and dockers running, then when overnight backup tasks starting running, the cache pool corrupted and went read only. Does that seem like a possible cause?

What's a good way to test this cache drive?

Quote

September 9, 20241 yr

Community Expert

NVMe devices dropping offline is usually not a device problem, if it happens again post new diags before rebooting.

Quote

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Followers

Go to topic listing

Cache pool drive failure. Need assistance.

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)