Version: 6.9.2: nvme btrfs pool gone in to read-ony mode

DeathStar Darth · June 17, 2021

TL/DR : nvme cache pool gone in to read-only mode. How to recover (put in read/write mode and scrub?)

A few hours after a long-running (several days) pre-clear of a new disk completed my server unexpectedly reset itself.

That was pretty unusual in itself.

Upon reboot, I kept on being directed to BIOS setup rather than POST.

After multiple attempts using the old tricks - reseating hardware, cmos factory defaults etc - failed I took the USB boot and placed it in my Mac to have a look at the content. All the files present were readable, and I took a backup so as to ensure details of disks, shares, networks etc were available to me later.

I noticed that the syslinux folder was missing off the usb, and ldlinux.c32 etc was not present anywhere (https://forums.unraid.net/topic/37416-boot-failed-failed-to-load-ldlinuxc32-help/?do=findComment&comment=360045).

I downloaded fresh copy of install files from LT and copied them on to my usb, ensuring I didn't overwrite config files that appear intact.

After some more 'fun' with the bios (and a bios upgrade as it was a year and 6 or so releases out of date) I eventually got it to get me to the Unraid boot options.

Unraid booted fine...but I noticed one of my 2 nvme drives in cahe pool were not appearing.

Reboot and look at BIOS and see only one is being reported. More BIOS changes (using new options made available in earlier bios upgrade) and they are visible again in the BIOS.

Reboot into Unraid.

Re-add the 2nd nvme back into the cache pool. (and add the new pre-cleared disk(3) to the array)

All appeared ok, but then I notice Fix Common Problems is flagging an issue "Unable to write to cache_app" "Drive mounted read-only or completely full. Begin Investigation Here - in this ccse it is rad-only, not completely full issue, as shown on the main dashboard (which shows the utilisation exactly as I recall it before all the issues started):

image.png.430538ca9b2aeafad8bafd74cc8db67b.png

Running

btrfs dev stats /dev/cache

returns the hoped for 0 count (this is an ssd pool), but running

btrfs dev stats /dev/cache_apps

shows issues (this is the pool the nvme's are in)

[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs     0
[/dev/nvme0n1p1].corruption_errs  0
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    42491
[/dev/nvme1n1p1].read_io_errs     3211
[/dev/nvme1n1p1].flush_io_errs     1416
[/dev/nvme1n1p1].corruption_errs  2777
[/dev/nvme1n1p1].generation_errs  0

Trying to scrub it with

btrfs scrub start -dB /dev/nvme1n1p1

highlights the fact it's in read-only mode

ERROR: scrubbing /dev/nvme1n1p1 failed for device id 2: ret=-1, errno=30 (Read-only file system)

Scrub device /dev/nvme1n1p1 (id 2) canceled
Scrub started:    Thu Jun 17 20:28:50 2021
Status:           aborted
Duration:         0:00:00
Total to scrub:   522.03GiB
Rate:             0.00B/s
Error summary:    no errors found

Checking the syslog shows the pain:

So my question is, where do I go from here?

JorgeB · June 18, 2021

Please post the diagnostics: Tools -> Diagnostics

DeathStar Darth · June 18, 2021

1 hour ago, JorgeB said:

Please post the diagnostics: Tools -> Diagnostics

thedeathstar-diagnostics-20210618-0921.zip

JorgeB · June 18, 2021

12 hours ago, DeathStar Darth said:

[/dev/nvme1n1p1].write_io_errs 42491

[/dev/nvme1n1p1].read_io_errs 3211

[/dev/nvme1n1p1].flush_io_errs 1416

[/dev/nvme1n1p1].corruption_errs 2777

[/dev/nvme1n1p1].generation_errs 0

Log doesn't cover it but those errors suggest one of the NVMe devices dropped offline at some point in the past, see here for better pool monitoring for the future.

As for the current issue, reboot then immediately run a scrub to see if it can bring the other device up to date, if the fs goes read only again best bet is to backup and reformat.

DeathStar Darth · June 19, 2021

On 6/18/2021 at 9:36 AM, JorgeB said:

Log doesn't cover it but those errors suggest one of the NVMe devices dropped offline at some point in the past, see here for better pool monitoring for the future.

As for the current issue, reboot then immediately run a scrub to see if it can bring the other device up to date, if the fs goes read only again best bet is to backup and reformat.

Yes, when I eventually got Unraid to fire-up after the series of unfortunate events, the 2nd nvme was not showing, nor was it showing in the bios after closer attention.

I'd already come across that post whist trying to resolve the issue, and it was immediately added to my userscripts 👍

I've already rebooted any it stays in RO mode. I've tied to copy the whole FS of the content to another drive ahead of reformatting, and whilst many thousands of files copy cleanly, I am seeing a lot of others with errors

cp: error reading 'foobar.file': Input/output error

I'm wondering if it's worth me trying to

stop the array
remove the nvme that went missing (nvme1n1)
restart the array - and see if in RW mode
scrub (if in rw mode)

or btrfs restore -v /dev/nvme1n1 /mnt/disk3/restore (the nvme that has always been present, and the newly added disk in the array) ?

Edited June 19, 2021 by DeathStar Darth

JorgeB · June 20, 2021

17 hours ago, DeathStar Darth said:
I am seeing a lot of others with errors
cp: error reading 'foobar.file': Input/output error

These suggest there's data corruption, but would need to see diags to confirm, if that's true you could use btrfs restore, though files will still be corrupt, but it's likely your best option.

Version: 6.9.2: nvme btrfs pool gone in to read-ony mode

Recommended Posts

DeathStar Darth

Link to comment

JorgeB

Link to comment

DeathStar Darth

Link to comment

JorgeB

Link to comment

DeathStar Darth

Link to comment

JorgeB

Link to comment

Join the conversation