(SOLVED) Loss of VMs, Dockers, Plugins

SmokeyColes · May 26, 2021

Hi

This morning my place had a power cut, after rebooting - I am now no longer able to mount my system_cache.

My system cache had:

appdata
domains
isos
system

It was mirrored to harden it on loss of a drive.

I only transferred across to the new setup on Monday, yesterday I spent my time sorting the remaining files, and plex.

The only backups I managed was libvert.img and my usb drive - it was on my to do list today.

I don't want to loose those VM's really. One had HomeAutomation software which I had all registered.

Brand new server and this happens, I am gutted 😔

Can someone please give advice?

Edited June 2, 2021 by SmokeyColes

ChatNoir · May 26, 2021

Hello @SmokeyColes, your system diagnostics could provide more information.

Please go to Tools / Diagnostics and attach the full zip file to your next post.

SmokeyColes · May 26, 2021

haccnas-diagnostics-20210526-0822.zip

Please see attached

JorgeB · May 26, 2021

See here for some recovery options, first one to try in this case would be the nologreplay mount option.

SmokeyColes · May 26, 2021

Hi @JorgeB

Forgive me, my ability isn't there. Here's what I tried:

mkdir / x

mount -o degraded,usebackuproot,ro /dev/nvme0n1 /x

I just want to check before I move to BTRFS restore (safe to use) step - that I've done the command correct?

I'm happy to pay for your time on this, with the help you've provided - you deserve a couple of pints anyway for it all.

Edited May 26, 2021 by SmokeyColes

JorgeB · May 26, 2021

Try this option first, but note that the device was missing the partition:

mount -o ro,notreelog,nologreplay /dev/nvme0n1p1 /x

SmokeyColes · May 26, 2021

Oh thank goodness; honestly my nerves! I'm copying everything across in MC (as per your instructions) into a new usr/share now.

Copying it.

What would you suggest my next step?

Should I move to step2 - BTRFS restore (safe to use)

or should I format which I think is what I need to do, and then once formatted move it all back, and restart the server?

Please PM me or let me know your paypal email too, buying you a few pints - its the least I can do for not loosing my VMs.

Edited May 26, 2021 by SmokeyColes

JorgeB · May 26, 2021

10 minutes ago, SmokeyColes said:

What would you suggest my next step?

After everything is copied you can try to init the log tree:

First unmount filesystem with:

umount /x

Then:

btrfs rescue zero-log /dev/nvme0n1p1

Start the array normally and see if the pool mounts, if it does and all looks fine nothing more should be needed, but make sure you do regular backups of anything important.

SmokeyColes · May 26, 2021

I have spoken to soon.

appdata and system no issues with the copy

1 of 3 VM's transferred, but the 2 I really needed did not.

2 of the 3 ISOs did not - but in fairness I can redownload (I'm not excessively worried) but would like to recover the VMs.

Should I move onto init log tree, or should I try something else?

Edited May 26, 2021 by SmokeyColes

JorgeB · May 26, 2021

That suggest those files might be corrupt, post current syslog.

SmokeyColes · May 26, 2021

syslog.txt

Attached

JorgeB · May 26, 2021

Yes, they are failing checksums, I also noticed data corruption was detected in you other pool, you likely have bad RAM or other hardware problem.

You can use btrfs restore since it won't check for that, but files will still be corrupt, but depending on the severity they might still work.

SmokeyColes · May 26, 2021

Ok I will give it a shot, would it be worth trying mounting the other nvme? Or does it mount in a mirrored pair?

With the RAM should I run memtest?

JorgeB · May 26, 2021

3 minutes ago, SmokeyColes said:

Or does it mount in a mirrored pair?

When you mount one it mounts the other.

4 minutes ago, SmokeyColes said:

With the RAM should I run memtest?

Yep.

SmokeyColes · May 26, 2021

I'm having a really bad day.

Neither VM worked; I have decided to format. But for now the server is OOA, and wont be switched on until new memory modules are bought.

I'm purely to blame for this; I bought second hand DDR4 and now paying for it! 30208 errors at 21%.

I wouldn't mind if you could just tell me what you meant about corrupt data in the other pool. I sent over a number of TV shows to the download_cache then ran mover to the array.

Is it safe to say that that now my array has some corrupt data as well?

Sounds to me like I need to delete the files I sent across to my data drives, and will need to format the system_cache and download_cache all again (after replacing ram).

Once I hear back, I'll mark this as SOLVED and want to thank you for your help.

JonathanM · May 26, 2021

15 minutes ago, SmokeyColes said:

Is it safe to say that that now my array has some corrupt data as well?

It's suspect, but not proven corrupt until you compare checksums. Unless you already have a solution in place to do that, many times it's faster to assume corruption and recopy.

There are many ways of doing checksum comparisons of files, almost as many as there are methods of copying the files in the first place. I recommend doing some research and getting a handle on at least one way of doing it, that way you won't get caught flat footed again.

10,000 foot overview.

You can compare files bit by bit, there are programs that read all the bits in each file in parallel and compare as they go. That method is SLOW, especially over a network.

You can compare the checksum of the files, which is a short string that depending on the method of obtaining the string is almost totally guaranteed to be different if the file is different. That means once you know the name, size, and checksum of the file, you can transfer just that information to the other system and compare it to the list generated by the copied files, and if everything matches you are 99.999% sure you don't have corruption. The amount of 9's in that confidence level can be huge, the chances of a hash collision are pretty much 0 for any decent sized files.

Generating the list of checksums for both the original and copy, plus comparing the list, can be time consuming, so if the amount of data is small, recopying is faster.

JorgeB · May 26, 2021

24 minutes ago, SmokeyColes said:

I wouldn't mind if you could just tell me what you meant about corrupt data in the other pool. I sent over a number of TV shows to the download_cache then ran mover to the array.

You can run a scrub on that pool to look for any corrupt files, if there are any they will be listed on the syslog, files corrupted on btrfs would be detected and you'd get an i/o error when trying to move/copy them, like the ones you got above, but some files might get corrupt after leaving btrfs at the time they got written to the xfs array disks, and no way to find those.

JonathanM · May 26, 2021

If the files were corrupted in RAM before being written to the cache, BTRFS would have no way of determining that the incoming data was bad.

JorgeB · May 26, 2021

5 minutes ago, jonathanm said:

If the files were corrupted in RAM before being written to the cache, BTRFS would have no way of determining that the incoming data was bad.

Correct, it can detect some if there are errors during read, i.e. calculating the checksum for the stored block failed to match the written one due to RAM errors, I believe it's possible that either the block or checksum are corrupted before storing and those can also be detected since they won't match, but if the corruption happens before the data is written and the checksum is calculated for the corrupt block then those would go undetected, unless there were more errors during reads.

JonathanM · May 26, 2021

Bad RAM is bad juju. You really never know what it is or isn't going to corrupt, since pretty much every operation uses RAM to move data. Honestly, best case scenario is RAM so bad the machine just crashes hard, that way you at least can't get much done before figuring out you have a major issue.

SmokeyColes · June 2, 2021

Hi just an update before I close the post; I was fairly lucky as a lot of the files I had transferred to the cache but not transferred cache to the array as I thought.

I was easily able to identify corrupted stuff as the moves left on the cache - were the corrupted files (the move from cache to array was done with the new ram in).

So I have no idea, but its sorted now. I lost about 40GB overall.

I have slowly been working my way back to repairing the damaged of the corrupt VMs taking regular backups - this is where it hit me hardest (in time and effort).

Unfortunately I received a new error tonight which I'm going to no doubt raise with you pro's again (I'm not having much luck!)

Thank you all for you kind help and support.

(SOLVED) Loss of VMs, Dockers, Plugins

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation