6.9.2 - Cache pool BTRFS missing device


Recommended Posts

Suddenly had the message "Cache pool BTRFS missing device"

The pool which I use for VMs and Docker, running from 2 nvme drives had a problem that a drive suddenly went missing.

I closed the VMs, took a diagnostics, and rebooted.

On reboot, the pool appears fine and the VMs & dockers are running, but I don't know what to do to check the health of the cache pool - will BTRFS have fixed any differences automatically, is there something I should do to force checks, is there something I must avoid doing after this issue?

Diagnostics attached (there might be a mess of other issues in there as I have a tendency to fiddle...)

tower-diagnostics-20210424-0857.zip

Link to comment
Please post after reboot diags to see current pool status.
 
P.S. you need to run xfs_repair on URBACKUP UD device.
Thanks, have seen recent errors on that. Thought it was due to backups from locations with intermittent connectivity.
Pool status looked ok, but will get some diagnostics. It's currently doing a parity check due to an unrelated unplanned reboot.


Sent from my ONEPLUS A5000 using Tapatalk

Link to comment
18 hours ago, JorgeB said:

You can also take a look here since it might take me longer to reply due to weekend:

 

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

 

 

Checking btrfs dev stats showed lots of errors. 

 

I ran scrub firstly without the "Repair corrupted blocks" option and it showed the following (whih I first thought was no errors, but presumably is).

 

UUID:             e8b8d9ec-0ad2-4867-b3cf-87b43a0d9d15
Scrub started:    Sun Apr 25 07:16:27 2021
Status:           finished
Duration:         0:03:07
Total to scrub:   1.10TiB
Rate:             6.02GiB/s
Error summary:    verify=1438 csum=167501
  Corrected:      0
  Uncorrectable:  0
  Unverified:     0

I then ran it with the "Repair corrupted blocks" option anyway, and it corrected loads of errors. (Seems a bit odd that the first "scrub" didn't highlight that there was something that needed correcting. Then I looked at the line "Error summary:    verify=1438 csum=167501" from the first run).

 

Finished, but odd that the "verify" number is a bit short of the first scrub run, and the corrected doesn't match the csum (but I don't know if it should).

UUID:             e8b8d9ec-0ad2-4867-b3cf-87b43a0d9d15
Scrub started:    Sun Apr 25 07:28:00 2021
Status:           finished
Duration:         0:03:07
Total to scrub:   1.10TiB
Rate:             6.02GiB/s
Error summary:    verify=1347 csum=167501
  Corrected:      168848
  Uncorrectable:  0
  Unverified:     0

 

ran another scrub just to be sure, and this time there's clearly no errors.

UUID:             e8b8d9ec-0ad2-4867-b3cf-87b43a0d9d15
Scrub started:    Sun Apr 25 07:39:58 2021
Status:           finished
Duration:         0:03:03
Total to scrub:   1.10TiB
Rate:             6.15GiB/s
Error summary:    no errors found

Set up the script from the FAQ on the main pool and the cache pool so I will get notified of errors.

Then started some dockers - OK, started some VMS and then I had the error again - repeating every minute, just in case I missed it the first time:

image.png.5c7d7b15f74d8f8e189adb438f755ac4.png

 

Not sure what do do now. Can't check cables as there are none - the nvme's are in the motherboard.

Time to take some extra backups (assuming it's not too late!)

 

Link to comment

Hi @JorgeB Many thanks for your support on this - really appreciate it.

 

Have now rebooted (switched off auto start of dockers & vms before doing this). scrubbed the disk again to fix the errors. Will double check the error count and zero them.

 

Trying to fix the UD URBACKUP disk I get the following: 

root@Tower:~# xfs_repair /dev/sdh
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
unable to verify superblock, continuing...
.found candidate secondary superblock...
unable to verify superblock, continuing...
................................................

The dots then continue to fill up the window - not sure how long it will take but I'll just leave it running.

When that's done I can reboot again and will take some diagnostics hoping that it's clean.

 

It's still unclear why the nvme drive is dropping off. I can try re-seating them but they've been good for quite a few months at least.

Link to comment
14 minutes ago, jsebright said:

Hi @JorgeB Many thanks for your support on this - really appreciate it.

 

Have now rebooted (switched off auto start of dockers & vms before doing this). scrubbed the disk again to fix the errors. Will double check the error count and zero them.

 

Trying to fix the UD URBACKUP disk I get the following: 




root@Tower:~# xfs_repair /dev/sdh
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
unable to verify superblock, continuing...
.found candidate secondary superblock...
unable to verify superblock, continuing...
................................................

The dots then continue to fill up the window - not sure how long it will take but I'll just leave it running.

When that's done I can reboot again and will take some diagnostics hoping that it's clean.

 

It's still unclear why the nvme drive is dropping off. I can try re-seating them but they've been good for quite a few months at least.

 

 

You have used the wrong device name in the xfs_repair command.  If using raw device names you need to include the partition number (e.g. /dev/sdh1).  You can only omit the partition number if using the /dev/mdX type device names.   Also note that raw device names will invalidate parity whereas the /dev/mdX type names do not.

 

Earlier you were talking about a BTRFS format drive while now it is an XFS one you seem to be trying to fix - just checking this is intentional.

 

 

Link to comment
18 minutes ago, itimpi said:

 

 

You have used the wrong device name in the xfs_repair command.  If using raw device names you need to include the partition number (e.g. /dev/sdh1).  You can only omit the partition number if using the /dev/mdX type device names.   Also note that raw device names will invalidate parity whereas the /dev/mdX type names do not.

 

Earlier you were talking about a BTRFS format drive while now it is an XFS one you seem to be trying to fix - just checking this is intentional.

 

 

Ah, thanks. Just cancelled it and rebooted. This is an unassigned drive - not part of the array.  So it looks like /dev/sdh1 is OK for this.

 

The BTRFS cache issue was the primary error (and still probably is). It's just that @JorgeB spotted another issue to do with this other drive that needs fixing. One problem turns into two...

 

Link to comment

Your getting device missing error s because now the other NVMe device (nvme1n1) dropped offline, look for a board BIOS update, this sometimes also helps:

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append" and before "initrd=/bzroot"

 

nvme_core.default_ps_max_latency_us=0

Reboot and see if it makes a difference.

Link to comment

This problem occurred again, then I think I worked out what was going on. The device was "disappearing" when I started a VM, but only a certain one. I had had to fiddle with it a day or so ago as it wouldn't start. Something must have got messed up meaning the VM was trying to take control of the nvme drive.

I could spot the device in the xml, but am not confident enough to edit it. Just saving the VM settings from the forms view didn't clear the device, but selecting all the possible usb devices and the one pcie device, saving, then clearing them all and saving seems to have sorted it out.

Thanks for the support - I know a bit more about checking disks now.

 

  • Like 1
Link to comment
  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.