BTRFS Read Errors


Go to solution Solved by Jarsky,

Recommended Posts

Hi all, 

 

I just logged into my UnRAID and noticed a popup about checksum issues from file integrity. 

On looking at the log, I realised there are ongoing BTRFS error notifications for my NVMe drives 

 

Im running UnRAID 6.10.3. 

ASRock X570 Taichi with 4.60 BIOS (November 2021). 

2 x 1TB Crucial P1 M.2 NVMe's on the board. 

 

The drives are in a "cache pool" that are a Mirror, used for holding my Dockers & VM's

image.png.e932ae61fb5396400daea7acda63f9b6.png

 

I'm seeing these errors in the Log

 

Aug 13 06:46:20 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 752, gen 0
Aug 13 06:46:20 TOWER kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 819355 off 11784499200 (dev /dev/nvme0n1p1 sector 1268009080)
Aug 13 06:46:31 TOWER kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 819355 off 49962971136 csum 0x53014952 expected csum 0x26cf4179 mirror 2
Aug 13 06:46:31 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 753, gen 0
Aug 13 06:46:31 TOWER kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 819355 off 49962971136 (dev /dev/nvme0n1p1 sector 1671624720)
Aug 13 07:52:39 TOWER kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 819355 off 50563215360 csum 0x4fe4c983 expected csum 0x7c3f67e3 mirror 1
Aug 13 07:52:39 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 754, gen 0
Aug 13 07:52:40 TOWER kernel: BTRFS info (device nvme0n1p1): read error corrected: ino 819355 off 50563215360 (dev /dev/nvme0n1p1 sector 1019641256)

 

Aug 14 02:06:16 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 755, gen 0
Aug 14 02:07:54 TOWER kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 516561707008 on dev /dev/nvme1n1p1, physical 469262540800, root 5, inode 819355, offset 1276887040, length 4096, links 1 (path: VM/Windows Server 2022/vdisk1.img)
Aug 14 02:07:54 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 756, gen 0
Aug 14 02:07:59 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 757, gen 0

 

The NVMe drives look physically healthy from smartctl 

 

root@TOWER:~#  smartctl --all -H /dev/nvme0n1

Spoiler

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.46-Unraid] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       CT1000P1SSD8
Serial Number:                      1942E22465EB
Firmware Version:                   P3CR013
PCI Vendor/Subsystem ID:            0xc0a9
IEEE OUI Identifier:                0x00a075
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 19e22465eb
Local Time is:                      Mon Aug 15 17:20:18 2022 NZST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0016):   Format Frmw_DL Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        5       5
 1 +     4.60W       -        -    1  1  1  1       30      30
 2 +     3.80W       -        -    2  2  2  2       30      30
 3 -   0.0500W       -        -    3  3  3  3     1000    1000
 4 -   0.0040W       -        -    4  4  4  4     6000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    99%
Data Units Read:                    24,395,362 [12.4 TB]
Data Units Written:                 279,765,903 [143 TB]
Host Read Commands:                 272,087,789
Host Write Commands:                5,433,505,136
Controller Busy Time:               69,775
Power Cycles:                       82
Power On Hours:                     22,735
Unsafe Shutdowns:                   37
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    66
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               47 Celsius
Temperature Sensor 5:               55 Celsius
Thermal Temp. 1 Transition Count:   21217
Thermal Temp. 2 Transition Count:   795
Thermal Temp. 1 Total Time:         28233845
Thermal Temp. 2 Total Time:         4910317

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

 

root@TOWER:~#  smartctl --all -H /dev/nvme1n1

Spoiler

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.46-Unraid] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       CT1000P1SSD8
Serial Number:                      1934E21B2513
Firmware Version:                   P3CR013
PCI Vendor/Subsystem ID:            0xc0a9
IEEE OUI Identifier:                0x00a075
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 19e21b2513
Local Time is:                      Mon Aug 15 17:20:23 2022 NZST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0016):   Format Frmw_DL Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        5       5
 1 +     4.60W       -        -    1  1  1  1       30      30
 2 +     3.80W       -        -    2  2  2  2       30      30
 3 -   0.0500W       -        -    3  3  3  3     1000    1000
 4 -   0.0040W       -        -    4  4  4  4     6000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    89%
Data Units Read:                    26,242,139 [13.4 TB]
Data Units Written:                 278,143,832 [142 TB]
Host Read Commands:                 280,790,435
Host Write Commands:                5,490,535,695
Controller Busy Time:               64,057
Power Cycles:                       79
Power On Hours:                     22,622
Unsafe Shutdowns:                   34
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               45 Celsius
Temperature Sensor 5:               54 Celsius
Thermal Temp. 1 Transition Count:   318
Thermal Temp. 1 Total Time:         1965751

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

 

I ran a check on the file system with BTRFS, and im getting errors with nvme0n1p1

 

root@TOWER:~# btrfs check --force /dev/nvme0n1p1 

Spoiler

root 5 inode 8752743 errors 2001, no inode item, link count wrong
        unresolved ref dir 8749677 index 366 namelen 10 name _00365.wal filetype 1 errors 4, no inode ref
root 5 inode 8752744 errors 2001, no inode item, link count wrong
        unresolved ref dir 8749678 index 211 namelen 23 name 000000091-000000001.tsm filetype 1 errors 4, no inode ref
root 5 inode 8752745 errors 2001, no inode item, link count wrong
        unresolved ref dir 3480794 index 196000 namelen 45 name 8e6baa014e0c6025c722620f6b38c16c550e0892.json filetype 1 errors 4, no inode ref
root 5 inode 8752746 errors 2001, no inode item, link count wrong
        unresolved ref dir 8749677 index 367 namelen 10 name _00366.wal filetype 1 errors 4, no inode ref
root 5 inode 8752747 errors 2001, no inode item, link count wrong
        unresolved ref dir 3480794 index 196002 namelen 45 name 823ae318b4984e81a0759b9ca9f73f9ea0c4c759.json filetype 1 errors 4, no inode ref
ERROR: errors found in fs roots
found 215344644096 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 89866240
total fs tree bytes: 0
total extent tree bytes: 89686016
btree space waste bytes: 22181978
file data blocks allocated: 0
 referenced 0

 

root@TOWER:~# btrfs check --force /dev/nvme1n1p1 

Spoiler

Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on /dev/nvme1n1p1
UUID: f7f0ec9a-281c-4352-8557-336143708cd1
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 916449837056 bytes used, no error found
total csum bytes: 881724912
total tree bytes: 2476294144
total fs tree bytes: 834617344
total extent tree bytes: 523468800
btree space waste bytes: 472071382
file data blocks allocated: 15728413237248
 referenced 736101736448

 

 

Is this an issue from space? I have 80GB total free left of the 1TB 

Do i need to delete/move data? 

Do i need to unmount it and run a repair? e.g btrfs check —repair /dev/nvme0n1p1 

 

Some help would be appreciated

 

jarskynas-diagnostics-20220815-1745.zip

Edited by Jarsky
Link to comment

Looks like from a scrub, the corruption is in a VM disk image. 

I have a backup from a few months ago, but is there any other repair activities I can do, short of restoring from a backup? 

 

Aug 15 18:48:14 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 795, gen 0
Aug 15 18:48:14 TOWER kernel: BTRFS error (device nvme0n1p1): fixed up error at logical 566991183872 on dev /dev/nvme0n1p1
Aug 15 18:48:14 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 796, gen 0
Aug 15 18:48:14 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 567391494144 on dev /dev/nvme0n1p1
Aug 15 18:48:21 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 797, gen 0
Aug 15 18:48:21 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 573802946560 on dev /dev/nvme0n1p1
Aug 15 18:50:44 TOWER kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 824454868992 on dev /dev/nvme1n1p1, physical 729911062528, root 5, inode 819355, offset 12810076160, length 4096, links 1 (path: VM/Windows Server 2022/vdisk1.img)
Aug 15 18:50:44 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 797, gen 0
Aug 15 18:50:44 TOWER kernel: BTRFS error (device nvme0n1p1): fixed up error at logical 824454868992 on dev /dev/nvme1n1p1
Aug 15 18:50:51 TOWER kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 817108246528 on dev /dev/nvme1n1p1, physical 736523083776, root 5, inode 819355, offset 1667567616, length 4096, links 1 (path: VM/Windows Server 2022/vdisk1.img)
Aug 15 18:50:51 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 798, gen 0
Aug 15 18:50:52 TOWER kernel: BTRFS error (device nvme0n1p1): fixed up error at logical 817108246528 on dev /dev/nvme1n1p1
Aug 15 18:51:01 TOWER kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 835594616832 on dev /dev/nvme1n1p1, physical 745345777664, root 5, inode 819355, offset 7420325888, length 4096, links 1 (path: VM/Windows Server 2022/vdisk1.img)
Aug 15 18:51:01 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 799, gen 0
Aug 15 18:51:02 TOWER kernel: BTRFS error (device nvme0n1p1): fixed up error at logical 835594616832 on dev /dev/nvme1n1p1
Aug 15 18:51:11 TOWER kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1063512186880 on dev /dev/nvme0n1p1, physical 666227712000, root 5, inode 819355, offset 9483952128, length 4096, links 1 (path: VM/Windows Server 2022/vdisk1.img)
Aug 15 18:51:11 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 798, gen 0
Aug 15 18:51:11 TOWER kernel: BTRFS error (device nvme0n1p1): fixed up error at logical 1063512186880 on dev /dev/nvme0n1p1
Aug 15 18:53:36 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 800, gen 0
Aug 15 18:53:36 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1173223571456 on dev /dev/nvme1n1p1
Aug 15 18:53:42 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 799, gen 0
Aug 15 18:53:42 TOWER kernel: BTRFS error (device nvme0n1p1): fixed up error at logical 967017721856 on dev /dev/nvme0n1p1
Aug 15 18:54:11 TOWER kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1234893328384 on dev /dev/nvme1n1p1, physical 940633542656, root 5, inode 819355, offset 6063378432, length 4096, links 1 (path: VM/Windows Server 2022/vdisk1.img)
Aug 15 18:54:11 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 801, gen 0
Aug 15 18:54:11 TOWER kernel: BTRFS error (device nvme0n1p1): fixed up error at logical 1234893328384 on dev /dev/nvme1n1p1
Aug 15 18:54:23 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 802, gen 0
Aug 15 18:54:23 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1263566585856 on dev /dev/nvme1n1p1
Aug 15 18:54:41 TOWER kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1145505271808 on dev /dev/nvme0n1p1, physical 868479881216, root 5, inode 819355, offset 9617289216, length 4096, links 1 (path: VM/Windows Server 2022/vdisk1.img)
Aug 15 18:54:41 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 800, gen 0
Aug 15 18:54:41 TOWER kernel: BTRFS error (device nvme0n1p1): fixed up error at logical 1145505271808 on dev /dev/nvme0n1p1
Aug 15 18:54:42 TOWER kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1147382370304 on dev /dev/nvme0n1p1, physical 870356979712, root 5, inode 819355, offset 7115976704, length 4096, links 1 (path: VM/Windows Server 2022/vdisk1.img)
Aug 15 18:54:42 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 801, gen 0
Aug 15 18:54:43 TOWER kernel: BTRFS error (device nvme0n1p1): fixed up error at logical 1147382370304 on dev /dev/nvme0n1p1
Aug 15 18:54:43 TOWER kernel: BTRFS info (device nvme0n1p1): scrub: finished on devid 2 with status: 0
Aug 15 18:55:26 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 802, gen 0
Aug 15 18:55:26 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1173223571456 on dev /dev/nvme0n1p1
Aug 15 18:56:53 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 803, gen 0
Aug 15 18:56:53 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1263566585856 on dev /dev/nvme0n1p1
Aug 15 18:57:17 TOWER kernel: BTRFS info (device nvme0n1p1): scrub: finished on devid 1 with status: 0

 

Link to comment

I don't think theres an issue with my Ram. It's a Zen2 and has been running @ 3000Mhz for 2.5 years. Its only the single pool that has corruption. The array and cache have no errors. I already repaired the file that didnt pass checksum from my main array. I'm pretty confident the corruption occured when electricians switched off the power abruptly several times, and I dont currently have it on a UPS (since the last battery died). 

 

I've restored my VM from backup and updated it so no longer receiving checksum errors with the vdisk for that. 

 

However I still have uncorrectable errors on the pool. 

It's a mirrored pool, so what is the best way to resolve this?

Can I break the pool, format the drive and resync?

Or do I need to copy everything off that pool to somewhere else and reformat the pool? 

 

Spoiler
Aug 15 23:01:21 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Aug 15 23:01:21 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 513608630272 on dev /dev/nvme1n1p1
Aug 15 23:02:08 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Aug 15 23:02:08 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 189270196224 on dev /dev/nvme1n1p1
Aug 15 23:02:27 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Aug 15 23:02:27 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 513608630272 on dev /dev/nvme0n1p1
Aug 15 23:03:09 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Aug 15 23:03:09 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 189270196224 on dev /dev/nvme0n1p1
Aug 15 23:06:16 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Aug 15 23:06:16 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 380153503744 on dev /dev/nvme1n1p1
Aug 15 23:08:23 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
Aug 15 23:08:23 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1217718784000 on dev /dev/nvme1n1p1
Aug 15 23:09:18 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Aug 15 23:09:18 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 380153503744 on dev /dev/nvme0n1p1
Aug 15 23:09:34 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
Aug 15 23:09:34 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 567391494144 on dev /dev/nvme1n1p1
Aug 15 23:09:42 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
Aug 15 23:09:42 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 573802946560 on dev /dev/nvme1n1p1
Aug 15 23:11:45 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
Aug 15 23:11:45 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1217718784000 on dev /dev/nvme0n1p1
Aug 15 23:13:47 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
Aug 15 23:13:47 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 567391494144 on dev /dev/nvme0n1p1
Aug 15 23:14:04 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
Aug 15 23:14:04 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 573802946560 on dev /dev/nvme0n1p1
Aug 15 23:17:45 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
Aug 15 23:17:45 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1173223571456 on dev /dev/nvme1n1p1
Aug 15 23:18:44 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
Aug 15 23:18:44 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1263566585856 on dev /dev/nvme1n1p1
Aug 15 23:25:39 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
Aug 15 23:25:39 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1173223571456 on dev /dev/nvme0n1p1
Aug 15 23:28:13 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
Aug 15 23:28:13 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1263566585856 on dev /dev/nvme0n1p1

 

 

Link to comment
2 minutes ago, Jarsky said:

I'm pretty confident the corruption occured when electricians switched off the power abruptly several times, and I dont currently have it on a UPS (since the last battery died). 

That should not cause data corruption, it might cause filesystem corruption.

 

3 minutes ago, Jarsky said:

It's a mirrored pool, so what is the best way to resolve this?

Delete the corrupt file(s) and restore from backup if available.

Link to comment
22 minutes ago, JorgeB said:

That should not cause data corruption, it might cause filesystem corruption.

 

Delete the corrupt file(s) and restore from backup if available.

 

How can I identify the files though? 

Using inspect-internal its giving me nothing

 

root@TOWER:/mnt/nvme_mirror# btrfs inspect-internal logical-resolve -v -P 513608630272 /mnt/nvme_mirror
ioctl ret=0, total_size=65536, bytes_left=65520, bytes_missing=0, cnt=0, missed=0
root@TOWER:/mnt/nvme_mirror# echo $?
0

 

I thought from the remaining errors, they were a filesystem issue on nvme0n1p1?

 

 

Link to comment
31 minutes ago, JorgeB said:

Run a scrub, it should list the affect file(s) in the syslog.

 

I already showed the syslog after a scrub above

 

Aug 15 23:01:21 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Aug 15 23:01:21 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 513608630272 on dev /dev/nvme1n1p1
Aug 15 23:02:08 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Aug 15 23:02:08 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 189270196224 on dev /dev/nvme1n1p1
Aug 15 23:02:27 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Aug 15 23:02:27 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 513608630272 on dev /dev/nvme0n1p1
Aug 15 23:03:09 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Aug 15 23:03:09 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 189270196224 on dev /dev/nvme0n1p1
Aug 15 23:06:16 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Aug 15 23:06:16 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 380153503744 on dev /dev/nvme1n1p1
Aug 15 23:08:23 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
Aug 15 23:08:23 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1217718784000 on dev /dev/nvme1n1p1
Aug 15 23:09:18 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Aug 15 23:09:18 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 380153503744 on dev /dev/nvme0n1p1
Aug 15 23:09:34 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
Aug 15 23:09:34 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 567391494144 on dev /dev/nvme1n1p1
Aug 15 23:09:42 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
Aug 15 23:09:42 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 573802946560 on dev /dev/nvme1n1p1
Aug 15 23:11:45 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
Aug 15 23:11:45 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1217718784000 on dev /dev/nvme0n1p1
Aug 15 23:13:47 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
Aug 15 23:13:47 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 567391494144 on dev /dev/nvme0n1p1
Aug 15 23:14:04 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
Aug 15 23:14:04 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 573802946560 on dev /dev/nvme0n1p1
Aug 15 23:17:45 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
Aug 15 23:17:45 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1173223571456 on dev /dev/nvme1n1p1
Aug 15 23:18:44 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
Aug 15 23:18:44 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1263566585856 on dev /dev/nvme1n1p1
Aug 15 23:25:39 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
Aug 15 23:25:39 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1173223571456 on dev /dev/nvme0n1p1
Aug 15 23:28:13 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
Aug 15 23:28:13 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1263566585856 on dev /dev/nvme0n1p1

 

Link to comment
  • Solution

Since it's late hours I decided to shut everything down and put the array into maintenance. 

The NVMe drives showed they were doing a bunch of read/write for about a minute. It seemed to correct the remaining BTRFS errors with the pool. 

 

Scrub started:    Tue Aug 16 01:06:35 2022
Status:           finished
Duration:         0:23:17
Total to scrub:   1.53TiB
Rate:             1.12GiB/s
Error summary:    no errors found

 

Link to comment
  • 1 year later...

This is an old thread, but i've seen others having the same issue of being able to find the file associated. 

I just ran into it again yesterday after a power outage rebooted my server several times. 

So for records sake....

 

Select your pool, tick to "Repair Corrupted Blocks" and click "SCRUB"; you should see an error log like below:

Apr 18 19:57:42 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
Apr 18 19:57:42 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 5507816349696 on dev /dev/nvme0n1p1
Apr 18 19:57:46 TOWER kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme1n1p1 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
Apr 18 19:57:46 TOWER kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 5507816349696 on dev /dev/nvme1n1p1

 

This gives you the device; if you have a lot of mounts you can check the mount point running this:

 

root@TOWER:/var/log# cat /proc/mounts | grep '/mnt/' | grep -v -E 'tmpfs|shfs'
/dev/md1p1 /mnt/disk1 xfs rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/md2p1 /mnt/disk2 xfs rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/md3p1 /mnt/disk3 xfs rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/md4p1 /mnt/disk4 xfs rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/md5p1 /mnt/disk5 xfs rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/md6p1 /mnt/disk6 xfs rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/md7p1 /mnt/disk7 xfs rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/md8p1 /mnt/disk8 xfs rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/md9p1 /mnt/disk9 xfs rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/md10p1 /mnt/disk10 xfs rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/md11p1 /mnt/disk11 xfs rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
/dev/nvme2n1p1 /mnt/cache btrfs rw,noatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/ 0 0
/dev/nvme0n1p1 /mnt/nvme_mirror btrfs rw,noatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/ 0 0

 

The one we're looking for here is /mnt/nvme_mirror 

We can then run an inspect-internal like below to find the culprit file that occupies that block

 

root@TOWER:/var/log# btrfs inspect-internal logical-resolve -o 5507816349696 /mnt/nvme_mirror/
/mnt/nvme_mirror//VM/VM1/vm1.img

 

So we can see that our offending file is /mnt/nvme_mirror/VM/VM1/vm1.img

 

We can move this offending file somewhere else e.g

mv /mnt/nvme_mirror/VM/VM1 /mnt/user0/share

 

Repeat for any other offending logical block numbers reported. 

We can then re-run the scrub and should find no errors. 

 

Move the files back to their original directories

Re-run the scrub one more time; and should still come back with no errors. 

 

image.png.52448ded72fbf3cc03dd32d8d5533505.png

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.