Version: 6.9.2: nvme btrfs pool gone in to read-ony mode


Recommended Posts

TL/DR : nvme cache pool gone in to read-only mode. How to recover (put in read/write mode and scrub?) 

 

A few hours after a long-running (several days) pre-clear of a new disk completed my server unexpectedly reset itself.

 

That was pretty unusual in itself.

 

Upon reboot, I kept on being directed to BIOS setup rather than POST.

 

After multiple attempts using the old tricks - reseating hardware, cmos factory defaults etc - failed I took the USB boot and placed it in my Mac to have a look at the content. All the files present were readable, and I took a backup so as to ensure details of disks, shares, networks etc were available to me later. 

 

I noticed that the syslinux folder was missing off the usb, and ldlinux.c32 etc was not present anywhere (https://forums.unraid.net/topic/37416-boot-failed-failed-to-load-ldlinuxc32-help/?do=findComment&comment=360045).

 

I downloaded fresh copy of install files from LT and copied them on to my usb, ensuring I didn't overwrite config files that appear intact.

 

After some more 'fun' with the bios (and a bios upgrade as it was a year and 6 or so releases out of date) I eventually got it to get me to the Unraid boot options. 

Unraid booted fine...but I noticed one of my 2 nvme drives in cahe pool were not appearing.

 

Reboot and look at BIOS and see only one is being reported. More BIOS changes (using new options made available in earlier bios upgrade) and they are visible again in the BIOS.

 

Reboot into Unraid.

 

Re-add the 2nd nvme back into the cache pool. (and add the new pre-cleared disk(3) to the array)

 

All appeared ok, but then I notice Fix Common Problems is flagging an issue "Unable to write to cache_app" "Drive mounted read-only or completely full. Begin Investigation Here - in this ccse it is rad-only, not completely full issue, as shown on the main dashboard (which shows the utilisation exactly as I recall it before all the issues started):

 

image.png.430538ca9b2aeafad8bafd74cc8db67b.png

 

Running 

btrfs dev stats /dev/cache

returns the hoped for 0 count (this is an ssd pool), but running 

btrfs dev stats /dev/cache_apps 

shows issues (this is the pool the nvme's are in)

[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs     0
[/dev/nvme0n1p1].corruption_errs  0
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    42491
[/dev/nvme1n1p1].read_io_errs     3211
[/dev/nvme1n1p1].flush_io_errs     1416
[/dev/nvme1n1p1].corruption_errs  2777
[/dev/nvme1n1p1].generation_errs  0

 

Trying to scrub it with 

btrfs scrub start -dB /dev/nvme1n1p1

highlights the fact it's in read-only mode 

ERROR: scrubbing /dev/nvme1n1p1 failed for device id 2: ret=-1, errno=30 (Read-only file system)

Scrub device /dev/nvme1n1p1 (id 2) canceled
Scrub started:    Thu Jun 17 20:28:50 2021
Status:           aborted
Duration:         0:00:00
Total to scrub:   522.03GiB
Rate:             0.00B/s
Error summary:    no errors found

 

Checking the syslog shows the pain:

image.thumb.png.80e5b10067de2b679aaa6db85641bd67.png

 

image.thumb.png.d1a8847deccc5d6d1720a50ec87c6ef8.png

 

image.thumb.png.f72e34298c5d1c42d8cfea4287659f49.png

 

image.thumb.png.06e6d0680da70b1102c7531f765c1a56.png

 

So my question is, where do I go from here?

 

Link to comment
12 hours ago, DeathStar Darth said:

[/dev/nvme1n1p1].write_io_errs    42491

[/dev/nvme1n1p1].read_io_errs     3211

[/dev/nvme1n1p1].flush_io_errs    1416

[/dev/nvme1n1p1].corruption_errs  2777

[/dev/nvme1n1p1].generation_errs  0

 

Log doesn't cover it but those errors suggest one of the NVMe devices dropped offline at some point in the past, see here for better pool monitoring for the future.

 

As for the current issue, reboot then immediately run a scrub to see if it can bring the other device up to date, if the fs goes read only again best bet is to backup and reformat.

 

 

Link to comment
On 6/18/2021 at 9:36 AM, JorgeB said:

 

Log doesn't cover it but those errors suggest one of the NVMe devices dropped offline at some point in the past, see here for better pool monitoring for the future.

 

As for the current issue, reboot then immediately run a scrub to see if it can bring the other device up to date, if the fs goes read only again best bet is to backup and reformat.

 

 

 

Yes, when I eventually got Unraid to fire-up after the series of unfortunate events, the 2nd nvme was not showing, nor was it showing in the bios after closer attention.

 

I'd already come across that post whist trying to resolve the issue, and it was immediately added to my userscripts 👍

 

I've already rebooted any it stays in RO mode. I've tied to copy the whole FS of the content to another drive ahead of reformatting, and whilst many thousands of files copy cleanly, I am seeing a lot of others with errors

cp: error reading 'foobar.file': Input/output error

 

 

I'm wondering if it's worth me trying to 

  • stop the array
  • remove the nvme that went missing (nvme1n1)
  • restart the array - and see if in RW mode
  • scrub (if in rw mode)

 

or btrfs restore -v /dev/nvme1n1 /mnt/disk3/restore (the nvme that has always been present, and the newly added disk in the array) ?

Edited by DeathStar Darth
Link to comment
17 hours ago, DeathStar Darth said:

I am seeing a lot of others with errors


cp: error reading 'foobar.file': Input/output error

These suggest there's data corruption, but would need to see diags to confirm, if that's true you could use btrfs restore, though files will still be corrupt, but it's likely your best option.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.