Cache pool btrfs missing issue/disappeared/now unassigned


Go to solution Solved by JorgeB,

Recommended Posts

hi everyone,

 

So I had a weird sequence of events happen here with my cache disk.

 

I got a cache pool btrfs missing warning for my cache disk. I then ran scrub with fix errors checked. Still showed the error. Then I stopped the array and restarted. It disappeared when I restarted the array.

 

Then I restarted my whole unraid system and now it is showing up in unassigned devices. Hoping it can be remounted or salvaged somehow. I have diagnostics here. I screenshotted how it appears. It's the dev3 device.

 

any help would be great. thank you

Screen Shot 2021-12-22 at 2.34.12 PM.png

zigplex2-diagnostics-20211222-1430.zip

Link to comment
3 minutes ago, Adam Kesher said:

hi everyone,

 

So I had a weird sequence of events happen here with my cache disk.

 

I got a cache pool btrfs missing warning for my cache disk. I then ran scrub with fix errors checked. Still showed the error. Then I stopped the array and restarted. It disappeared when I restarted the array.

 

Then I restarted my whole unraid system and now it is showing up in unassigned devices. Hoping it can be remounted or salvaged somehow. I have diagnostics here. I screenshotted how it appears. It's the dev3 device.

 

any help would be great. thank you

Screen Shot 2021-12-22 at 2.34.12 PM.png

zigplex2-diagnostics-20211222-1430.zip 140.99 kB · 0 downloads

You might try reassigning it back to the array, but I'd probably do a UD file system check first.  You'd have to mount it then click on the check icon.

 

It looks mountable in UD if you'd just rather mount it and unload it.  I'd suggest setting it to read only to stop any write activity.  Click on the three gears and set 'Read Only' On.

Link to comment
38 minutes ago, dlandon said:

You might try reassigning it back to the array, but I'd probably do a UD file system check first.  You'd have to mount it then click on the check icon.

 

It looks mountable in UD if you'd just rather mount it and unload it.  I'd suggest setting it to read only to stop any write activity.  Click on the three gears and set 'Read Only' On.

so i did try this, tried the scrub and got this in the log. Changed the last digits of the local IPs...,but that XX ending IP is an old IP that this server had before I gave it a dedicated IP on my router at YY. Could this be part of the issue, or just an issue running the scrub?

 

Quote

 

Dec 22 12:09:31 Zigplex2 nginx: 2021/12/22 12:09:31 [error] 7434#7434: *7170 upstream timed out (110: Connection timed out) while reading upstream, client: 192.168.1.XX, server: , request: "GET /plugins/unassigned.devices/include/fsck.php?device=/dev/nvme0n1p1&fs=btrfs&luks=&serial=Samsung_SSD_970_EVO_1TB_S467NX0M827957N&check_type=ro&type=Done HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "192.168.1.YY", referrer: "http://192.168.1.YY/Main"

 

Link to comment
3 hours ago, dlandon said:

Ok, looking at the message again I see where the UD fsck had a problem.  I assume that the scrub didn't show any results?

 

Back out of any browsers you had been using to access the server and then start the browser over and try again.

ok i ran it again adn this time that same upstream timeout error popped up but this time i also got

 

Quote

Dec 22 15:53:30 Zigplex2 kernel: BTRFS info (device nvme0n1p1): scrub: finished on devid 1 with status: 0

 

the pop up window for the scrub only says this (though it did say 'transferring data from IP' for a bit

Quote

FS: btrfs

Executing file system scrub: /sbin/btrfs scrub start -B -R -d -r /dev/nvme0n1p1 2>&1

attaching diagnostics after all this

zigplex2-diagnostics-20211222-1859.zip

Edited by Adam Kesher
missing info, adding diagnostics
Link to comment
5 hours ago, JorgeB said:

If it was a single device cache you can just re-assign it.

it is a single cache.

 

when i stop the array to re-assign it this is how it appears in unassigned devices657632507_ScreenShot2021-12-23at9_05_41AM.thumb.png.fd2c9eb0db243d900833bdeaa0c29e4c.png

 

if i click that blue disk name i get this 

182509729_ScreenShot2021-12-23at9_06_08AM.thumb.png.f9e6e87aa808ba36e8c7059f5c06739b.png

 

when i assign it to the cache pool, this historical devices shows up

 

2038706667_ScreenShot2021-12-23at9_07_53AM.thumb.png.f278a572309d691abe88aa4039d3b69a.png

 

 

is this something i need to be worried about before starting the array with the disk put back in the cache pool (formatting or anything like that?)

 

do I need to change the disk mount point first? or do I just assign the disk?

 

Thanks for all your help everyone I am just paranoid. 

Screen Shot 2021-12-23 at 9.09.37 AM.png

Link to comment
On 12/23/2021 at 9:28 AM, JorgeB said:

Pool is showing 3 slots, if there was just one assigned it's still OK, but make sure you've started the array once before without any device assigned, there can't be a "all data on this device will be deleted" warning for the pool device.

thanks for this.

 

re-assigning it worked and i was up and running just fine yesterday but today i woke up to the same Cache pool BTRFS missing device(s) warning.

 

is there anything else I can do to prevent this from happening? when i ran the scrub yesterday there were no errors. very frustrating. all the help has been great.

zigplex2-diagnostics-20211224-1116.zip

Edited by Adam Kesher
added diagnostics
Link to comment

I stopped the array and the cache disappeared again, but on a full shutdown and then starting back up it re-appeared and i got this

 

it auto-mounted and was correctly assigned again. this is super confusing. when i had the issue from the previous post the 'missing devices' warning happened overnight with the machine running. no restart or shutdown. diagnostics from this boot included.

Screen Shot 2021-12-24 at 12.15.49 PM.png

zigplex2-diagnostics-20211224-1218.zip

Link to comment
  • Solution

NVMe device dropped offline:

 

Dec 23 22:48:36 Zigplex2 kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Dec 23 22:48:36 Zigplex2 kernel: nvme 0000:08:00.0: enabling device (0000 -> 0002)
Dec 23 22:48:36 Zigplex2 kernel: nvme nvme0: Removing after probe failure status: -19

 

Look for a BIOS update, the below also helps sometimes, failing that try a different brand/model device or board.

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0


Reboot and see if it makes a difference.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.