Cache Drive FS corruption, repair?

live4soccer7 · June 30, 2022

I woke up this morning to docker and VM manager being down. The cache drive (xfs fs), which these files are stored on is now "unmountable" (wrong or no file system).

I did put the array into maintenance mode and attempt to hit repair. You can see the results below. My backup of it is from a few months ago, so I can recover most all of what was on it, but if possible I would definitely like to get this functional again. On the "Main" menu it reads out the temp, says active and "Healthy". I'm not sure if these are accurate or just last readings of the disk. I was on 6.10.0. I did push the update today (disk failed before this) to 6.10.3.

Any help would be greatly appreciated. I use the VM for work and have lots of other things down that are somewhat important and time sensitive.

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
agi unlinked bucket 23 is 73394135 in ag 1 (inode=1147135959)
sb_icount 1021248, counted 1021376
sb_ifree 8167, counted 6971
sb_fdblocks 164425577, counted 166053595
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 0
        - agno = 2
        - agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 1147135959, would move to lost+found
Phase 7 - verify link counts...
would have reset inode 1147135959 nlinks from 0 to 1
No modify flag set, skipping filesystem flush and exiting.

Edited June 30, 2022 by live4soccer7

JorgeB · June 30, 2022

Run it again without -n or nothing will be done.

JorgeB · June 30, 2022

And use -L if it ask for it.

live4soccer7 · June 30, 2022

Thanks. running now

live4soccer7 · June 30, 2022

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
agi unlinked bucket 23 is 73394135 in ag 1 (inode=1147135959)
sb_icount 1021248, counted 1021376
sb_ifree 8167, counted 6971
sb_fdblocks 164425577, counted 166053595
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 3
        - agno = 2
        - agno = 0
clearing reflink flag on inodes when possible
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
disconnected inode 1147135959, moving to lost+found
Phase 7 - verify and correct link counts...
Maximum metadata LSN (72:1879323) is ahead of log (1:2).
Format log to cycle 75.
xfs_repair: Flushing the data device failed, err=61!
Cannot clear needsrepair due to flush failure, err=61.
xfs_repair: Flushing the data device failed, err=61!

fatal error -- File system metadata writeout failed, err=61.  Re-run xfs_repair.

JorgeB · June 30, 2022

Try again, xfs_repair should always finish, with more or less data loss, do you remember if the device had some free space or was close to fully used?

live4soccer7 · June 30, 2022

It is a 2tb nvme drive. it should have about 1tb free. Should I run it with any flags?

live4soccer7 · June 30, 2022

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 3
        - agno = 1
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
xfs_repair: Flushing the data device failed, err=61!
Cannot clear needsrepair due to flush failure, err=61.
xfs_repair: Flushing the data device failed, err=61!

fatal error -- File system metadata writeout failed, err=61.  Re-run xfs_repair.

Ran without any flags a couple times with this. It's always possible a log or something filled it up and caused this problem.

live4soccer7 · June 30, 2022

Is there a way to check for previous notifications in unraid? I may be able to see if something filled it to the "gills" last night. If not, there should be a lot of free space

live4soccer7 · June 30, 2022

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.46-Unraid] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 PRO 2TB
Serial Number:                      S6B0NG0R405728R
Firmware Version:                   2B2QGXA7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            1,496,877,862,912 [1.49 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 b41150549f
Local Time is:                      Thu Jun 30 09:39:16 2022 PDT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.49W       -        -    0  0  0  0        0       0
 1 +     4.48W       -        -    1  1  1  1        0     200
 2 +     3.18W       -        -    2  2  2  2        0    1000
 3 -   0.0400W       -        -    3  3  3  3     2000    1200
 4 -   0.0050W       -        -    4  4  4  4      500    9500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- available spare has fallen below threshold
- media has been placed in read only mode

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x09
Temperature:                        36 Celsius
Available Spare:                    0%
Available Spare Threshold:          10%
Percentage Used:                    56%
Data Units Read:                    3,017,526,991 [1.54 PB]
Data Units Written:                 2,839,436,501 [1.45 PB]
Host Read Commands:                 5,464,158,312
Host Write Commands:                4,063,944,349
Controller Busy Time:               45,841
Power Cycles:                       457
Power On Hours:                     4,340
Unsafe Shutdowns:                   28
Media and Data Integrity Errors:    9,994
Error Information Log Entries:      9,994
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               36 Celsius
Temperature Sensor 2:               49 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

live4soccer7 · June 30, 2022

Could the "failed" status have any to do with the unmountable drive/fs? I wouldn't think so, but figured I would ask. The frontend says it is healthy, so not sure why that would be

live4soccer7 · June 30, 2022

JorgeB · June 30, 2022

4 minutes ago, live4soccer7 said:

Could the "failed" status have any to do with the unmountable drive/fs?

9 minutes ago, live4soccer7 said:
- media has been placed in read only mode

The device is in read-only mode, that's why xfs_repair is failing to write the corrections, you'll need to replace it and restore data from backups if available.

live4soccer7 · June 30, 2022

What would cause it to go into read-only mode? If it is in read only mode, can I extract data off it still? I do have a backup, but it is from february, so not terribly new. Still way better than no backup

Edited June 30, 2022 by live4soccer7

JorgeB · June 30, 2022

6 minutes ago, live4soccer7 said:

What would cause it to go into read-only mode?

31 minutes ago, live4soccer7 said:
- available spare has fallen below threshold

7 minutes ago, live4soccer7 said:

If it is in read only mode, can I extract data off it still?

It would be easy if the filesystem was still mounting, since it's not and it can't be fixed there are basically two options: use a file recovery util like UFS explorer or clone it to another device then run xfs_repair.

live4soccer7 · June 30, 2022

Thanks! "available spare" is that referring to space?

trurl · June 30, 2022

JorgeB · June 30, 2022

Flash devices come with some spare cells to replace ones that turn bad, for that device once the spare space gets below 10% you'd get a SMART warning, it's now at 0% and I assume the reason why the device is read-only.

Also note that according to this the device was just a little half way past predicted life, but this is just an indication, I have one currently at 187% and still going strong.

1 hour ago, live4soccer7 said:
Percentage Used:                    56%

live4soccer7 · June 30, 2022

Thanks. Makes sense. How are you determining it is past predicted life when it says percentage used 56%? I'm trying to learn on this one, not questioning your statement.

JorgeB · June 30, 2022

2 minutes ago, live4soccer7 said:

How are you determining it is past predicted life when it says percentage used 56%?

Sorry, typo, I meant a little half way past.

live4soccer7 · June 30, 2022

Ok. thank you for clarification. This drive will get RMA'd for sure. It is just over 1 yr old.

If using a recovery tool like UFS, would this impair the ability to clone the drive? Just wondering which one I should attempt first.

live4soccer7 · June 30, 2022

What would be the preferred program or method to clone it? I am handy and comfortable in terminal if it is possible clone/copy it to the array or I can move it to a windows machine and clone it to a mechanical drive there. Thanks.

trurl · June 30, 2022

live4soccer7 · July 1, 2022

Thanks! Running ddrescue now. fingers crossed!

JorgeB · July 1, 2022

12 hours ago, live4soccer7 said:

If using a recovery tool like UFS, would this impair the ability to clone the drive?

No, but if you have a spare device available cloning would be my first option, no need to buy another program, just note that if cloning to a larger device it won't mount with Unraid, but you can use UD.

Cache Drive FS corruption, repair?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation