Log filled with BTRFS errors

DougP · April 29, 2021

I've noticed my log file getting full for seemingly no reason and found it filled with the following error:

BTRFS error (device nvme0n1p1): bad tree block start, want 31566561280 have 16777216

The numbers on each line in the log file are different. I can't see any other symptoms; all of my VMs, shares, and Docker containers are functioning as far as I can tell.

I've attached the diagnostics archive.

Should I run a BTRFS Scrub operation on the cache drive? Is this a sign of an impending hardware failure?

tower-diagnostics-20210429-1343.zip

codefaux · April 29, 2021

Oof - that is a lot of logspam. It's actually enough log spam that there's very little helpful information left in the kernel logs. I would advise backing up super-critical data Just In Case(tm) but there is definitely something unhealthy with the BTRFS partition, and using it in that condition could compound things. Good catch.

Personally, I would absolutely recommend performing the scrub. Following that, so we can make sure there isn't an underlying problem causing the the BTRFS issue, if you want to be extra thorough you could let it run for a few hours (or less if you see anything important in logs) and post a fresh diagnostics.zip for us to check over. Looking at your SMART data, I see reports of unexpected power loss -- this could easily cause the BTRFS issue you're seeing, but it wouldn't hurt to be careful.

JorgeB · April 30, 2021

You should backup and reformat that filesystem, you should also see here, Ryzen with overclocked RAM is known to corrupt data, and btrfs can get corrupted fast with RAM errors.

DougP · April 30, 2021

13 hours ago, codefaux said:

Oof - that is a lot of logspam. It's actually enough log spam that there's very little helpful information left in the kernel logs. I would advise backing up super-critical data Just In Case(tm) but there is definitely something unhealthy with the BTRFS partition, and using it in that condition could compound things. Good catch.

Personally, I would absolutely recommend performing the scrub. Following that, so we can make sure there isn't an underlying problem causing the the BTRFS issue, if you want to be extra thorough you could let it run for a few hours (or less if you see anything important in logs) and post a fresh diagnostics.zip for us to check over. Looking at your SMART data, I see reports of unexpected power loss -- this could easily cause the BTRFS issue you're seeing, but it wouldn't hurt to be careful.

I ran the scrub and here are the results (not much):

Scrub started: Thu Apr 29 14:41:28 2021

Status: finished

Duration: 0:01:54

Total to scrub: 270.54GiB

Rate: 2.37GiB/s

Error summary: verify=13 csum=4 Corrected: 0 Uncorrectable: 17 Unverified: 0

I'll reboot, verify BIOS settings mentioned in JorgeB's post (I'm on a Ryzen X570/3950X - thank you for that, Jorge), and rebuild and upload the diagnostics. I suspect the corruption happened when I was fighting hardware pass-through to a VM and had to hard-reset the machine a few times; that matches the info in the Ryzen stability thread.

To be continued! Many thanks for your help thus far.

JorgeB · April 30, 2021

19 minutes ago, DougP said:

Uncorrectable: 17

You can check the syslog to see which files are corrupt and need to be replaced/deleted.

codefaux · April 30, 2021

9 hours ago, DougP said:

had to hard-reset the machine a few times

I would definitely agree that was likely the issue.

9 hours ago, DougP said:


Error summary: verify=13 csum=4 Corrected: 0 Uncorrectable: 17 Unverified: 0

I agree with @JorgeB here, too - there were errors. You'll absolutely want to scan the log to see which files are damaged, because they were not corrected. This means the filesystem structure is intact, but the contents of those 17 files could be anything from one bit flipped to complete garbage.

--

For safety, if you hard reset, I believe it would be best to bring the array up in maintenence mode, and scan all of the disks to be sure there wasn't a corrupt write, THEN consider Parity-related operations. In maintenence mode, unRAID will update parity as you correct filesystem errors.

I've crashed a few dozen times over the last month trying to figure out what turned out to be a known issue, and every time I now come up in maintenence mode, xfs_repair every single drive, come up in normal mode, scrub btrfs, THEN resume working on it. I do it this way because when I didn't, containers and/or normal access would find corrupt files, and cause problems, which would cause more problems I couldn't figure out.

I emphasize that this is my belief mostly because I actually haven't verified that this is sane practice. If someone with demigod status like @JorgeB cares to weigh in on this practice, I'd love to hear if it's a bad idea before a) continuing to do it, or b) recommending it, lol...

DougP · June 21, 2021

My apologies for not getting back to this swiftly; I no longer work for the company and am free-lancing support for this Unraid server.

Here is the latest syslog, immediately after a reboot.

tower-syslog-20210621-1311.zip

JorgeB · June 21, 2021

2 minutes ago, DougP said:

Here is the latest syslog,

That shows serious filesystem corruption, best bet is to backup cache and re-format, make sure RAM is running at or below max supported speed and also good idea to run memtest.

Log filled with BTRFS errors

Recommended Posts

DougP

Link to comment

codefaux

Link to comment

JorgeB

Link to comment

DougP

Link to comment

JorgeB

Link to comment

codefaux

Link to comment

DougP

Link to comment

JorgeB

Link to comment

Join the conversation