Jump to content

Log filled with BTRFS errors


Recommended Posts

I've noticed my log file getting full for seemingly no reason and found it filled with the following error:

 

BTRFS error (device nvme0n1p1): bad tree block start, want 31566561280 have 16777216

 

The numbers on each line in the log file are different. I can't see any other symptoms; all of my VMs, shares, and Docker containers are functioning as far as I can tell. 

 

I've attached the diagnostics archive.

 

Should I run a BTRFS Scrub operation on the cache drive? Is this a sign of an impending hardware failure? 

tower-diagnostics-20210429-1343.zip

Link to comment

Oof - that is a lot of logspam. It's actually enough log spam that there's very little helpful information left in the kernel logs. I would advise backing up super-critical data Just In Case(tm) but there is definitely something unhealthy with the BTRFS partition, and using it in that condition could compound things. Good catch.

 

Personally, I would absolutely recommend performing the scrub. Following that, so we can make sure there isn't an underlying problem causing the the BTRFS issue, if you want to be extra thorough you could let it run for a few hours (or less if you see anything important in logs) and post a fresh diagnostics.zip for us to check over. Looking at your SMART data, I see reports of unexpected power loss -- this could easily cause the BTRFS issue you're seeing, but it wouldn't hurt to be careful.

Link to comment
13 hours ago, codefaux said:

Oof - that is a lot of logspam. It's actually enough log spam that there's very little helpful information left in the kernel logs. I would advise backing up super-critical data Just In Case(tm) but there is definitely something unhealthy with the BTRFS partition, and using it in that condition could compound things. Good catch.

 

Personally, I would absolutely recommend performing the scrub. Following that, so we can make sure there isn't an underlying problem causing the the BTRFS issue, if you want to be extra thorough you could let it run for a few hours (or less if you see anything important in logs) and post a fresh diagnostics.zip for us to check over. Looking at your SMART data, I see reports of unexpected power loss -- this could easily cause the BTRFS issue you're seeing, but it wouldn't hurt to be careful.

 

I ran the scrub and here are the results (not much): 

Scrub started: Thu Apr 29 14:41:28 2021

Status: finished

Duration: 0:01:54

Total to scrub: 270.54GiB

Rate: 2.37GiB/s

Error summary: verify=13 csum=4 Corrected: 0 Uncorrectable: 17 Unverified: 0

 

I'll reboot, verify BIOS settings mentioned in JorgeB's post (I'm on a Ryzen X570/3950X - thank you for that, Jorge), and rebuild and upload the diagnostics. I suspect the corruption happened when I was fighting hardware pass-through to a VM and had to hard-reset the machine a few times; that matches the info in the Ryzen stability thread.

 

To be continued! Many thanks for your help thus far. :)

Link to comment
9 hours ago, DougP said:

had to hard-reset the machine a few times

I would definitely agree that was likely the issue.

 

 

9 hours ago, DougP said:

Error summary: verify=13 csum=4 Corrected: 0 Uncorrectable: 17 Unverified: 0

 

 

I agree with @JorgeB here, too - there were errors. You'll absolutely want to scan the log to see which files are damaged, because they were not corrected. This means the filesystem structure is intact, but the contents of those 17 files could be anything from one bit flipped to complete garbage.

 

 

-- 

 

For safety, if you hard reset, I believe it would be best to bring the array up in maintenence mode, and scan all of the disks to be sure there wasn't a corrupt write, THEN consider Parity-related operations. In maintenence mode, unRAID will update parity as you correct filesystem errors.

 

I've crashed a few dozen times over the last month trying to figure out what turned out to be a known issue, and every time I now come up in maintenence mode, xfs_repair every single drive, come up in normal mode, scrub btrfs, THEN resume working on it. I do it this way because when I didn't, containers and/or normal access would find corrupt files, and cause problems, which would cause more problems I couldn't figure out.

 

I emphasize that this is my belief mostly because I actually haven't verified that this is sane practice. If someone with demigod status like @JorgeB cares to weigh in on this practice, I'd love to hear if it's a bad idea before a) continuing to do it, or b) recommending it, lol...

Link to comment
  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...