DeathStar Darth Posted June 17, 2021 Share Posted June 17, 2021 TL/DR : nvme cache pool gone in to read-only mode. How to recover (put in read/write mode and scrub?) A few hours after a long-running (several days) pre-clear of a new disk completed my server unexpectedly reset itself. That was pretty unusual in itself. Upon reboot, I kept on being directed to BIOS setup rather than POST. After multiple attempts using the old tricks - reseating hardware, cmos factory defaults etc - failed I took the USB boot and placed it in my Mac to have a look at the content. All the files present were readable, and I took a backup so as to ensure details of disks, shares, networks etc were available to me later. I noticed that the syslinux folder was missing off the usb, and ldlinux.c32 etc was not present anywhere (https://forums.unraid.net/topic/37416-boot-failed-failed-to-load-ldlinuxc32-help/?do=findComment&comment=360045). I downloaded fresh copy of install files from LT and copied them on to my usb, ensuring I didn't overwrite config files that appear intact. After some more 'fun' with the bios (and a bios upgrade as it was a year and 6 or so releases out of date) I eventually got it to get me to the Unraid boot options. Unraid booted fine...but I noticed one of my 2 nvme drives in cahe pool were not appearing. Reboot and look at BIOS and see only one is being reported. More BIOS changes (using new options made available in earlier bios upgrade) and they are visible again in the BIOS. Reboot into Unraid. Re-add the 2nd nvme back into the cache pool. (and add the new pre-cleared disk(3) to the array) All appeared ok, but then I notice Fix Common Problems is flagging an issue "Unable to write to cache_app" "Drive mounted read-only or completely full. Begin Investigation Here - in this ccse it is rad-only, not completely full issue, as shown on the main dashboard (which shows the utilisation exactly as I recall it before all the issues started): Running btrfs dev stats /dev/cache returns the hoped for 0 count (this is an ssd pool), but running btrfs dev stats /dev/cache_apps shows issues (this is the pool the nvme's are in) [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 0 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 0 [/dev/nvme0n1p1].generation_errs 0 [/dev/nvme1n1p1].write_io_errs 42491 [/dev/nvme1n1p1].read_io_errs 3211 [/dev/nvme1n1p1].flush_io_errs 1416 [/dev/nvme1n1p1].corruption_errs 2777 [/dev/nvme1n1p1].generation_errs 0 Trying to scrub it with btrfs scrub start -dB /dev/nvme1n1p1 highlights the fact it's in read-only mode ERROR: scrubbing /dev/nvme1n1p1 failed for device id 2: ret=-1, errno=30 (Read-only file system) Scrub device /dev/nvme1n1p1 (id 2) canceled Scrub started: Thu Jun 17 20:28:50 2021 Status: aborted Duration: 0:00:00 Total to scrub: 522.03GiB Rate: 0.00B/s Error summary: no errors found Checking the syslog shows the pain: So my question is, where do I go from here? Quote Link to comment
JorgeB Posted June 18, 2021 Share Posted June 18, 2021 Please post the diagnostics: Tools -> Diagnostics Quote Link to comment
DeathStar Darth Posted June 18, 2021 Author Share Posted June 18, 2021 1 hour ago, JorgeB said: Please post the diagnostics: Tools -> Diagnostics thedeathstar-diagnostics-20210618-0921.zip Quote Link to comment
JorgeB Posted June 18, 2021 Share Posted June 18, 2021 12 hours ago, DeathStar Darth said: [/dev/nvme1n1p1].write_io_errs 42491 [/dev/nvme1n1p1].read_io_errs 3211 [/dev/nvme1n1p1].flush_io_errs 1416 [/dev/nvme1n1p1].corruption_errs 2777 [/dev/nvme1n1p1].generation_errs 0 Log doesn't cover it but those errors suggest one of the NVMe devices dropped offline at some point in the past, see here for better pool monitoring for the future. As for the current issue, reboot then immediately run a scrub to see if it can bring the other device up to date, if the fs goes read only again best bet is to backup and reformat. Quote Link to comment
DeathStar Darth Posted June 19, 2021 Author Share Posted June 19, 2021 (edited) On 6/18/2021 at 9:36 AM, JorgeB said: Log doesn't cover it but those errors suggest one of the NVMe devices dropped offline at some point in the past, see here for better pool monitoring for the future. As for the current issue, reboot then immediately run a scrub to see if it can bring the other device up to date, if the fs goes read only again best bet is to backup and reformat. Yes, when I eventually got Unraid to fire-up after the series of unfortunate events, the 2nd nvme was not showing, nor was it showing in the bios after closer attention. I'd already come across that post whist trying to resolve the issue, and it was immediately added to my userscripts 👍 I've already rebooted any it stays in RO mode. I've tied to copy the whole FS of the content to another drive ahead of reformatting, and whilst many thousands of files copy cleanly, I am seeing a lot of others with errors cp: error reading 'foobar.file': Input/output error I'm wondering if it's worth me trying to stop the array remove the nvme that went missing (nvme1n1) restart the array - and see if in RW mode scrub (if in rw mode) or btrfs restore -v /dev/nvme1n1 /mnt/disk3/restore (the nvme that has always been present, and the newly added disk in the array) ? Edited June 19, 2021 by DeathStar Darth Quote Link to comment
JorgeB Posted June 20, 2021 Share Posted June 20, 2021 17 hours ago, DeathStar Darth said: I am seeing a lot of others with errors cp: error reading 'foobar.file': Input/output error These suggest there's data corruption, but would need to see diags to confirm, if that's true you could use btrfs restore, though files will still be corrupt, but it's likely your best option. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.