BTRFS error transid verfify failed - 2 callbacks suppressed - General Support

February 18, 20233 yr

Hi,

I woke this morning to find most of my dockers stopped, it looks like ca backup stopped my dockers ready to backup as the backup folder was created but it is empty and I can see all these BTRFS errors on nvme2n1p1 but I cannot work out which physical drive this is? This is the 3rd nvme related problem I have had on this new motherboard in as many weeks so really need to get to the bottom of it, any help will be greatly appreciated!

image.png.fe233f5d6541bb270244a9146851f967.png

unraid1-diagnostics-20230218-0934.zip

Quote

February 18, 20233 yr

Author

I also, just saw this in the log when I started to shutdown after I took the diags

Feb 18 09:48:42 UNRAID1 kernel: BUG: Bad rss-counter state mm:00000000ee5ad6d6 type:MM_ANONPAGES val:1
Feb 18 09:48:45 UNRAID1 kernel: I/O error, dev loop3, sector 83360 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
Feb 18 09:48:45 UNRAID1 kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Feb 18 09:48:45 UNRAID1 kernel: I/O error, dev loop3, sector 188192 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
Feb 18 09:48:45 UNRAID1 kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state A) in __btrfs_update_delayed_inode:999: errno=-5 IO failure
Feb 18 09:48:45 UNRAID1 kernel: BTRFS info (device loop3: state EA): forced readonly
Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state EA) in __btrfs_run_delayed_items:1092: errno=-5 IO failure
Feb 18 09:48:45 UNRAID1 kernel: BTRFS warning (device loop3: state EA): Skipping commit of aborted transaction.
Feb 18 09:48:45 UNRAID1 kernel: BTRFS: error (device loop3: state EA) in cleanup_transaction:1982: errno=-5 IO failure
Feb 18 09:48:45 UNRAID1 kernel: docker0: port 4(vethb7c8447) entered disabled state
Feb 18 09:48:45 UNRAID1 kernel: vethd950096: renamed from eth0
Feb 18 09:48:45 UNRAID1 root: Error response from daemon: error while removing network: network br0 id b58d2467fa57ae7061498d882571927884c846ee782347c86f3071b85475f1f0 has active endpoints

Quote

February 18, 20233 yr

Community Expert

There was some previous data corruption on the pool detected by btrfs, and now there's this error:

Feb 18 03:43:10 UNRAID1 kernel: BTRFS error (device nvme2n1p1): block=161059291136 write time tree block corruption detected

This means new data corruption was detected before writing the data to the filesystem, and it's usually a sign of bad RAM, start by running memtest, then post new diags after array start.

Also:

Feb 16 08:44:56 UNRAID1 kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Feb 16 08:44:56 UNRAID1 kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]

Macvlan call traces are usually the result of having dockers with a custom IP address and will end up crashing the server, recommend switching to ipvlan (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right))

Quote

February 18, 20233 yr

Author

Thank you @JorgeB

1, I have run the memtest which passed but I have another 4 sticks which I have now replaced the existing RAM with.

2, You say there was existing corruption? how do I address that please?

3, Is this not an NVME problem then? "BTRFS error (device nvme2n1p1): block=161059291136 write time tree block corruption detected"

4, I will look into ipvlan now.

Thanks again!

Quote

February 18, 20233 yr

Community Expert

Post new diags after array start.

Quote

February 18, 20233 yr

Author

4 minutes ago, JorgeB said:

Post new diags after array start.

Attached, many thanks

unraid1-diagnostics-20230218-1125.zip

Quote

February 18, 20233 yr

Community Expert

Run a correction scrub on the pool to see if any errors are detected, if there are uncorrectable errors post new diags after the scrub.

Quote

February 18, 20233 yr

Author

9 minutes ago, JorgeB said:

Run a correction scrub on the pool to see if any errors are detected, if there are uncorrectable errors post new diags after the scrub.

Hi, which pool do you mean please?

Quote

February 18, 20233 yr

Community Expert

cachetwo

Quote

February 18, 20233 yr

Community Expert

And cache pool also, since the write aborted because of corruption was on that one.

Quote

February 18, 20233 yr

Author

ok, so have scrubbed all 3 pools and no errors found on any of them!

Quote

February 18, 20233 yr

Community Expert
Solution

That's good, see here for how to reset stats for cachetwo, then see also there how to better monitor the pools, if more corruption errors appear you likely still have some hardware issue.

Quote

1

February 18, 20233 yr

Sorry for the stupid question but how do I run such a scrub process?

Quote

February 18, 20233 yr

Community Expert

3 minutes ago, Civic1201 said:

Sorry for the stupid question but how do I run such a scrub process?

Click on the pool then scroll down to the scrub section.

Quote

1

February 18, 20233 yr

Author

1 hour ago, JorgeB said:

That's good, see here for how to reset stats for cachetwo, then see also there how to better monitor the pools, if more corruption errors appear you likely still have some hardware issue.

Thank you, I will read through this to better monitor the pools, again, huge thanks for your help!

Quote

1

BTRFS error transid verfify failed - 2 callbacks suppressed

Featured Replies

Solved by JorgeB

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)