Cache seems corrupted, nothing to see in GUI


WEHA

Recommended Posts

I shut down my unraid server to replace the ram (64GB to 128GB)

Once restarted I enabled docker and virtual machines...

Docker page says service failed to start, vm's are barely starting claiming corruption everywhere.

 

Checked the syslog, saw these messages everywhere:

BTRFS error (device nvme0n1p1): parent transid verify failed on 1326431879168 wanted 214008171 found 213104704

BTRFS error (device nvme0n1p1): parent transid verify failed on 612809326592 wanted 213820848 found 213103567

 

dev stats:

[/dev/nvme0n1p1].write_io_errs    291117377
[/dev/nvme0n1p1].read_io_errs     382285749
[/dev/nvme0n1p1].flush_io_errs    2039859
[/dev/nvme0n1p1].corruption_errs  0
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     51
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  0
[/dev/nvme1n1p1].generation_errs  0

 

fi usage

Overall:
    Device size:                   3.64TiB
    Device allocated:              3.46TiB
    Device unallocated:          177.91GiB
    Device missing:                  0.00B
    Used:                          3.00TiB
    Free (estimated):            326.36GiB      (min: 326.36GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

                  Data    Metadata System               
Id Path           RAID1   RAID1    RAID1     Unallocated
-- -------------- ------- -------- --------- -----------
 1 /dev/nvme0n1p1 1.73TiB  4.00GiB  64.00MiB    88.95GiB
 2 /dev/nvme1n1p1 1.73TiB  4.00GiB  64.00MiB    88.95GiB
-- -------------- ------- -------- --------- -----------
   Total          1.73TiB  4.00GiB  64.00MiB   177.91GiB
   Used           1.50TiB  1.62GiB 272.00KiB  

 

 

What would be the best course of action to make sure my data is not gone?

I can still access shares but I'm concerned about that state :(

 

How can I identify the problem, is it the nvme disk or motherboard?

nvme disk is on a motherboard slot so no addon card

 

Stopping and starting the array does not make the error count go up though..

 

Edited by WEHA
Link to comment
34 minutes ago, JorgeB said:

Start by running a scrub. more info here.

In meanwhile I disconnected the nvme that generated the errors to recover from the other nvme disk, I did not start the array yet.

Then I saw your message and reconnected it, but now unraid is started and does not recognize the reconnected disk??

 

Am I f'ed now?

 

EDIT: just to be clear, the device is listed but blue. So when selecting it it says "data will be overwritten"

Edited by WEHA
just to clear
Link to comment
8 minutes ago, JorgeB said:

If you didn't start the array without it it should still accept both devices, but if it doesn't best to just leave the known good one, if all is well after array start, them wipe and re-add the other one.

So I started the array with the "bad" nvme unassigned but it seems that it is still using it?

Status has both devices, syslog containers many errors about it again.

On the main page it still shows the "bad" one as unassigned :(

This scares me...

 

When I click the cache disk in the gui it says a balance is running? :s

Edited by WEHA
Link to comment

Run a scrub and if all errors are corrected re-set the pool, to do that:

 

Stop the array, if Docker/VM services are using the cache pool disable them, unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, reassign all cache devices (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), re-enable Docker/VMs if needed, start array.

Link to comment
3 minutes ago, JorgeB said:

Run a scrub and if all errors are corrected re-set the pool, to do that:

 

Stop the array, if Docker/VM services are using the cache pool disable them, unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, reassign all cache devices (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), re-enable Docker/VMs if needed, start array.

Run the scrub while the balance is running?

Link to comment
3 minutes ago, JorgeB said:

Might as well.

Ok, balance canceled and now running scrub.

It now says that it's 4TB instead of 2TB (it's 2x 2TB nvme)

Running status now is counting up corruption & generation errors on the bad one, is this a problem?

It was 0 just before the scrub.

Link to comment

Scrub has finished, no uncorrectable errors.

I have started to move date of the cache, the vm's that were started before are still having problemset,

I suppose it was "permanently" corrupted when they were started before the scrub.

Didn't try docker yet, I'm assuming this will be the same.

Is there a reason these errors are not taken into account when starting the array?

I'm guessing I will have to spend hours getting my vm's fixed... not to mention possible other data that has been corrupted while there was still a good drive from what I can tell :(

 

Link to comment
1 hour ago, JorgeB said:

Like mentioned in the link above any share set to NOCOW (default for system shares) can't be fixed since checksums are disable.

Sure but like you also mentioned to hopefully use btsfs stats in the near future.. that was 2 years ago :(

I have the notification in the linked thread, only this error count occurred after a reboot (it was 0 before) and unraid just happily started the array...

So the notification was useless in this case.

Link to comment
55 minutes ago, JorgeB said:

That part can be correct by the user, just create new shares with COW on, it's what I've always done.

I never knew about COW & NOCOW because the shares are created by default and all the rest are set to auto by default...

This should be marked more abundantly on the shares list like the warning sign.

That warning sign being set to a green light is to me that the share is save, but in reality it's not...

I just spent 16 hours overnight with no sleep to get everything working again because of a silly parameter...

Very dissapointing, thank you for your quick responses though.

 

Link to comment
Just now, JorgeB said:

COW enable has some disadvantages for VMs, like increased fragmentation and write amplification, that's why LT disabled it by default for those shares, but I still prefer to have it enable for data integrity.

True but that is negated when we're talking about SSD is it not?

I too prefer integrity and was not aware my data was not safe.

Link to comment
Just now, JorgeB said:

Fragmentation yes, increased write amplification not so good, since it can reduce the SSD life.

There are arguments for both but I still think the share list gives a false sense of security.

I should have never started the array with the "bad" one still connected, as this corrupted everything on domains & system share.

It's disappointing that it seems too difficult to check btrfs stats after n years and show NOCOW / COW status more clearly.

For being a NAS at it's core, it seem quite relevant and important.

Link to comment
On 11/3/2020 at 1:33 PM, JorgeB said:

Fragmentation yes, increased write amplification not so good, since it can reduce the SSD life.

So I read about the "bug" that causes many writes to sdd's especially evo...

Mine have a 1200TBW and are around 1500TBW now (in 2 years time) :(

In the new beta there is a solution, but there also issues.

My thought is, can I upgrade to the new beta, recreate the cache (on new drives) with the new partition layout and revert back to 6.8.3 if the need arises?

Edited by WEHA
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.