Cache seems corrupted, nothing to see in GUI

WEHA · November 2, 2020

I shut down my unraid server to replace the ram (64GB to 128GB)

Once restarted I enabled docker and virtual machines...

Docker page says service failed to start, vm's are barely starting claiming corruption everywhere.

Checked the syslog, saw these messages everywhere:

BTRFS error (device nvme0n1p1): parent transid verify failed on 1326431879168 wanted 214008171 found 213104704

BTRFS error (device nvme0n1p1): parent transid verify failed on 612809326592 wanted 213820848 found 213103567

dev stats:

[/dev/nvme0n1p1].write_io_errs    291117377
[/dev/nvme0n1p1].read_io_errs     382285749
[/dev/nvme0n1p1].flush_io_errs    2039859
[/dev/nvme0n1p1].corruption_errs 0
[/dev/nvme0n1p1].generation_errs 0
[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     51
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs 0
[/dev/nvme1n1p1].generation_errs 0

fi usage

Overall:
    Device size:                   3.64TiB
    Device allocated:              3.46TiB
    Device unallocated:          177.91GiB
    Device missing:                  0.00B
    Used:                          3.00TiB
    Free (estimated):            326.36GiB      (min: 326.36GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

                  Data    Metadata System
Id Path           RAID1   RAID1    RAID1     Unallocated
-- -------------- ------- -------- --------- -----------
1 /dev/nvme0n1p1 1.73TiB 4.00GiB 64.00MiB    88.95GiB
2 /dev/nvme1n1p1 1.73TiB 4.00GiB 64.00MiB    88.95GiB
-- -------------- ------- -------- --------- -----------
   Total          1.73TiB 4.00GiB 64.00MiB   177.91GiB
   Used           1.50TiB 1.62GiB 272.00KiB

What would be the best course of action to make sure my data is not gone?

I can still access shares but I'm concerned about that state

How can I identify the problem, is it the nvme disk or motherboard?

nvme disk is on a motherboard slot so no addon card

Stopping and starting the array does not make the error count go up though..

Edited November 2, 2020 by WEHA

JorgeB · November 2, 2020

Start by running a scrub. more info here.

WEHA · November 2, 2020

34 minutes ago, JorgeB said:

Start by running a scrub. more info here.

In meanwhile I disconnected the nvme that generated the errors to recover from the other nvme disk, I did not start the array yet.

Then I saw your message and reconnected it, but now unraid is started and does not recognize the reconnected disk??

Am I f'ed now?

EDIT: just to be clear, the device is listed but blue. So when selecting it it says "data will be overwritten"

Edited November 2, 2020 by WEHA
just to clear

JorgeB · November 2, 2020

If you didn't start the array without it it should still accept both devices, but if it doesn't best to just leave the known good one, if all is well after array start, them wipe and re-add the other one.

WEHA · November 2, 2020

8 minutes ago, JorgeB said:

If you didn't start the array without it it should still accept both devices, but if it doesn't best to just leave the known good one, if all is well after array start, them wipe and re-add the other one.

So I started the array with the "bad" nvme unassigned but it seems that it is still using it?

Status has both devices, syslog containers many errors about it again.

On the main page it still shows the "bad" one as unassigned

This scares me...

When I click the cache disk in the gui it says a balance is running? :s

Edited November 2, 2020 by WEHA

JorgeB · November 2, 2020

Run a scrub and if all errors are corrected re-set the pool, to do that:

Stop the array, if Docker/VM services are using the cache pool disable them, unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, reassign all cache devices (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), re-enable Docker/VMs if needed, start array.

WEHA · November 2, 2020

3 minutes ago, JorgeB said:

Run a scrub and if all errors are corrected re-set the pool, to do that:

Stop the array, if Docker/VM services are using the cache pool disable them, unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, reassign all cache devices (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), re-enable Docker/VMs if needed, start array.

Run the scrub while the balance is running?

JorgeB · November 2, 2020

After it finishes, though not much point in running a balance.

WEHA · November 2, 2020

1 minute ago, JorgeB said:

After it finishes, though not much point in running a balance.

So should I just cancel it then? Because it will take a while to finnish...

JorgeB · November 2, 2020

Might as well.

WEHA · November 2, 2020

3 minutes ago, JorgeB said:

Might as well.

Ok, balance canceled and now running scrub.

It now says that it's 4TB instead of 2TB (it's 2x 2TB nvme)

Running status now is counting up corruption & generation errors on the bad one, is this a problem?

It was 0 just before the scrub.

JorgeB · November 2, 2020

3 minutes ago, WEHA said:

is this a problem?

Not if all are correct, if there are any uncorrectable errors you need to re-do the pool.

WEHA · November 2, 2020

Scrub has finished, no uncorrectable errors.

I have started to move date of the cache, the vm's that were started before are still having problemset,

I suppose it was "permanently" corrupted when they were started before the scrub.

Didn't try docker yet, I'm assuming this will be the same.

Is there a reason these errors are not taken into account when starting the array?

I'm guessing I will have to spend hours getting my vm's fixed... not to mention possible other data that has been corrupted while there was still a good drive from what I can tell :(

JorgeB · November 3, 2020

Like mentioned in the link above any share set to NOCOW (default for system shares) can't be fixed since checksums are disable.

WEHA · November 3, 2020

1 hour ago, JorgeB said:

Like mentioned in the link above any share set to NOCOW (default for system shares) can't be fixed since checksums are disable.

Sure but like you also mentioned to hopefully use btsfs stats in the near future.. that was 2 years ago

I have the notification in the linked thread, only this error count occurred after a reboot (it was 0 before) and unraid just happily started the array...

So the notification was useless in this case.

JorgeB · November 3, 2020

That part can be correct by the user, just create new shares with COW on, it's what I've always done.

WEHA · November 3, 2020

55 minutes ago, JorgeB said:

That part can be correct by the user, just create new shares with COW on, it's what I've always done.

I never knew about COW & NOCOW because the shares are created by default and all the rest are set to auto by default...

This should be marked more abundantly on the shares list like the warning sign.

That warning sign being set to a green light is to me that the share is save, but in reality it's not...

I just spent 16 hours overnight with no sleep to get everything working again because of a silly parameter...

Very dissapointing, thank you for your quick responses though.

JorgeB · November 3, 2020

COW enable has some disadvantages for VMs, like increased fragmentation and write amplification, that's why LT disabled it by default for those shares, but I still prefer to have it enable for data integrity.

WEHA · November 3, 2020

Just now, JorgeB said:

COW enable has some disadvantages for VMs, like increased fragmentation and write amplification, that's why LT disabled it by default for those shares, but I still prefer to have it enable for data integrity.

True but that is negated when we're talking about SSD is it not?

I too prefer integrity and was not aware my data was not safe.

JorgeB · November 3, 2020

16 minutes ago, WEHA said:

True but that is negated when we're talking about SSD is it not?

Fragmentation yes, increased write amplification not so good, since it can reduce the SSD life.

WEHA · November 3, 2020

Just now, JorgeB said:

Fragmentation yes, increased write amplification not so good, since it can reduce the SSD life.

There are arguments for both but I still think the share list gives a false sense of security.

I should have never started the array with the "bad" one still connected, as this corrupted everything on domains & system share.

It's disappointing that it seems too difficult to check btrfs stats after n years and show NOCOW / COW status more clearly.

For being a NAS at it's core, it seem quite relevant and important.

WEHA · November 7, 2020

On 11/3/2020 at 1:33 PM, JorgeB said:

Fragmentation yes, increased write amplification not so good, since it can reduce the SSD life.

So I read about the "bug" that causes many writes to sdd's especially evo...

Mine have a 1200TBW and are around 1500TBW now (in 2 years time)

In the new beta there is a solution, but there also issues.

My thought is, can I upgrade to the new beta, recreate the cache (on new drives) with the new partition layout and revert back to 6.8.3 if the need arises?

Edited November 7, 2020 by WEHA

JorgeB · November 8, 2020

14 hours ago, WEHA said:

My thought is, can I upgrade to the new beta, recreate the cache (on new drives) with the new partition layout and revert back to 6.8.3 if the need arises?

Yes, if you're using a pool.

Cache seems corrupted, nothing to see in GUI

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation