[SOLVED] Sudden sync errors and missing data, unmountable disk - help?

KptnKMan · October 1, 2020

Hi all,

Something strange is happening with my array and I'm not sure if I should be very worried or what to do.

Edit: I'm running 6.8.3 latest stable, no changes for some time now.

So today I ran a monthly parity check on my 34TB array.

A little over 9 18 hours in, I checked the status and saw that there were 450 errors listed.

Checked the current log and looks like one of my disks (Disk 4 / ata7) was playing up and having an issue.

I stopped the parity check, stopped the array and checked the log of the disk.

It looked like the disk was having some kind of initialisation error, but I foolishly didn't take a screenshot or note.

I brought the array back online, and saw that the reported usage was the same. Disk 4 still having issues.

When accessing the array over LAN, I noticed many files missing, and that my VMs wouldn't start.

Apparently there appeared to be many files and directories missing, despite the reported array size being correct.

The VMs would not start because files like the GPU bios and virtio-win-0.1.173-2.iso image were missing.

At this point I decided to completely shut down the system and leave it for a little bit, then start up clean.

Now the array is mounted, Disk 4 is showing "Unmountable: No file system", with the option to format the disk available further down.

The files missing were still missing, but after a short time seemed to have reappeared. I haven't verified everything.

The array usage now seems to reports what looks like incorrect total usage:

image.png.aef36548039037d0c97fcd0571d4b3c6.png

Any advice on what I should or can do?

Thanks for any help.

blaster-diagnostics-20201001-2003.zip

Edited October 1, 2020 by KptnKMan

JorgeB · October 1, 2020

Pleas post the diagnostics: Tools -> Diagnostics

KptnKMan · October 1, 2020

Yeah sorry, I forgot to attach.

Got a new one and added to original post.

JorgeB · October 1, 2020

Check filesystem on disk4.

KptnKMan · October 1, 2020

Ran the test in Maintenance mode.

Ran test with -nv options

Results:

Phase 1 - find and verify superblock...
        - block cache size set to 3043288 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1126679 tail block 1126656
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
sb_fdblocks 278251709, counted 279719268
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
data fork in ino 1567417 claims free block 195322
data fork in ino 1567417 claims free block 195323
data fork in ino 1567419 claims free block 250450
data fork in ino 1567419 claims free block 250451
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
data fork in ino 12884902030 claims free block 1610613854
data fork in ino 12884902030 claims free block 1610613855
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 7
        - agno = 6
        - agno = 0
        - agno = 1
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
Maximum metadata LSN (3:1137604) is ahead of log (3:1126679).
Would format log to cycle 6.
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Thu Oct  1 20:31:14 2020

Phase		Start		End		Duration
Phase 1:	10/01 20:30:17	10/01 20:30:18	1 second
Phase 2:	10/01 20:30:18	10/01 20:30:18
Phase 3:	10/01 20:30:18	10/01 20:30:50	32 seconds
Phase 4:	10/01 20:30:50	10/01 20:30:50
Phase 5:	Skipped
Phase 6:	10/01 20:30:50	10/01 20:31:14	24 seconds
Phase 7:	10/01 20:31:14	10/01 20:31:14

Total run time: 57 seconds

JorgeB · October 1, 2020

You need to run without -n or nothing will be done, and if it asks for -L use it.

KptnKMan · October 1, 2020

Ok thanks, running without anything produced this response.

I'll try again with -L as advised and listed in response:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

KptnKMan · October 1, 2020

Check complete using -L

Results:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
sb_fdblocks 278251709, counted 279719268
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
data fork in ino 1567417 claims free block 195322
data fork in ino 1567417 claims free block 195323
data fork in ino 1567419 claims free block 250450
data fork in ino 1567419 claims free block 250451
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
data fork in ino 12884902030 claims free block 1610613854
data fork in ino 12884902030 claims free block 1610613855
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 1
        - agno = 5
        - agno = 4
        - agno = 7
        - agno = 3
        - agno = 6
        - agno = 0
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (3:1137604) is ahead of log (1:2).
Format log to cycle 6.
done

KptnKMan · October 1, 2020

Well I ran the check with -nv as recommended by documentation.

Result before I start the array normally:

Phase 1 - find and verify superblock...
        - block cache size set to 3043288 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 0 tail block 0
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 2
        - agno = 1
        - agno = 3
        - agno = 4
        - agno = 7
        - agno = 5
        - agno = 6
        - agno = 0
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Thu Oct  1 20:45:37 2020

Phase		Start		End		Duration
Phase 1:	10/01 20:44:38	10/01 20:44:40	2 seconds
Phase 2:	10/01 20:44:40	10/01 20:44:40
Phase 3:	10/01 20:44:40	10/01 20:45:13	33 seconds
Phase 4:	10/01 20:45:13	10/01 20:45:13
Phase 5:	Skipped
Phase 6:	10/01 20:45:13	10/01 20:45:37	24 seconds
Phase 7:	10/01 20:45:37	10/01 20:45:37

Total run time: 59 seconds

KptnKMan · October 1, 2020

Thanks @JorgeB looks like the array started back correctly:

I tried to browse for lost+found and couldnt see the dir, but I'll check with cli.

For now looks like its working.

Thanks so much for your help.

[SOLVED] Sudden sync errors and missing data, unmountable disk - help?

Recommended Posts

KptnKMan

Link to comment

JorgeB

Link to comment

KptnKMan

Link to comment

JorgeB

Link to comment

KptnKMan

Link to comment

JorgeB

Link to comment

KptnKMan

Link to comment

KptnKMan

Link to comment

KptnKMan

Link to comment

KptnKMan

Link to comment

Join the conversation