BTRFS Errors after cache upgrade

jackfalveyiv · March 8, 2023

I've been experiencing a litany of issues with my server, seemingly around cache drive corruption. I formatted the drive several times before buying a new one and installing it. After two days, I'm starting to get errors seen in the below screenshot. What is happening? I don't see an issue with my applications yet, just trying to get ahead of whatever this may be.

trescommas-diagnostics-20230308-1415.zip

JorgeB · March 8, 2023

Mar  8 10:12:21 TresCommas kernel: ata6.00: exception Emask 0x10 SAct 0xc0000 SErr 0x4890000 action 0xe frozen
Mar  8 10:12:21 TresCommas kernel: ata6.00: irq_stat 0x08400040, interface fatal error, connection status changed
Mar  8 10:12:21 TresCommas kernel: ata6: SError: { PHYRdyChg 10B8B LinkSeq DevExch }
Mar  8 10:12:21 TresCommas kernel: ata6.00: failed command: READ FPDMA QUEUED
Mar  8 10:12:21 TresCommas kernel: ata6.00: cmd 60/98:90:b0:4a:0b/00:00:68:03:00/40 tag 18 ncq dma 77824 in
Mar  8 10:12:21 TresCommas kernel:         res 40/00:00:e0:c5:29/00:00:d5:00:00/40 Emask 0x10 (ATA bus error)
Mar  8 10:12:21 TresCommas kernel: ata6.00: status: { DRDY }
Mar  8 10:12:21 TresCommas kernel: ata6.00: failed command: READ FPDMA QUEUED
Mar  8 10:12:21 TresCommas kernel: ata6.00: cmd 60/80:98:78:3c:7f/00:00:68:03:00/40 tag 19 ncq dma 65536 in
Mar  8 10:12:21 TresCommas kernel:         res 40/00:00:e0:c5:29/00:00:d5:00:00/40 Emask 0x10 (ATA bus error)
Mar  8 10:12:21 TresCommas kernel: ata6.00: status: { DRDY }
Mar  8 10:12:21 TresCommas kernel: ata6: hard resetting link
Mar  8 10:12:27 TresCommas kernel: ata6: link is slow to respond, please be patient (ready=0)
Mar  8 10:12:31 TresCommas kernel: ata6: COMRESET failed (errno=-16)
Mar  8 10:12:31 TresCommas kernel: ata6: hard resetting link
Mar  8 10:12:35 TresCommas kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Mar  8 10:12:36 TresCommas kernel: ata6.00: supports DRM functions and may not be fully accessible

Issues with disk3, check/replace cables, also a good idea to recreate the docker image on cache instead of disk3.

jackfalveyiv · March 8, 2023

I turned the array off to rebuild the docker image, then I didn't have an option to delete the vdisk and create a new docker img. When I turned the array back on, my disk3 is now reporting as unmountable. What's my next logical move here?

Edited March 8, 2023 by jackfalveyiv

jackfalveyiv · March 8, 2023

Fresh diagnostic posted below

trescommas-diagnostics-20230308-1454.zip

jackfalveyiv · March 8, 2023

Booted to maint mode, tried a Check Filesystem Status -nv and got the following:


Phase 1 - find and verify superblock...
        - block cache size set to 1404320 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1197993 tail block 1197987
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 2
        - agno = 5
        - agno = 3
        - agno = 9
        - agno = 15
        - agno = 4
        - agno = 13
        - agno = 7
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 0
        - agno = 14
        - agno = 16
        - agno = 6
        - agno = 8
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
Maximum metadata LSN (4:1198031) is ahead of log (4:1197993).
Would format log to cycle 7.
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Wed Mar  8 15:27:15 2023

Phase		Start		End		Duration
Phase 1:	03/08 15:27:06	03/08 15:27:06
Phase 2:	03/08 15:27:06	03/08 15:27:07	1 second
Phase 3:	03/08 15:27:07	03/08 15:27:11	4 seconds
Phase 4:	03/08 15:27:11	03/08 15:27:11
Phase 5:	Skipped
Phase 6:	03/08 15:27:11	03/08 15:27:15	4 seconds
Phase 7:	03/08 15:27:15	03/08 15:27:15

Total run time: 9 seconds

itimpi · March 8, 2023

That is quite standard, you need to rerun it adding the -L option (and without -n for any changes to be made).

jackfalveyiv · March 8, 2023

Thank you. Here's the output from running with the -L option:


Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 5
        - agno = 8
        - agno = 13
        - agno = 6
        - agno = 7
        - agno = 1
        - agno = 10
        - agno = 11
        - agno = 14
        - agno = 12
        - agno = 16
        - agno = 15
        - agno = 3
        - agno = 4
        - agno = 9
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (4:1198044) is ahead of log (1:2).
Format log to cycle 7.
done

itimpi · March 8, 2023

Have you tried restarting the array in normal mode? I would expect the drive to now mount OK.

jackfalveyiv · March 8, 2023

I'll give that a try. The disk is still displaying as unmountable in the Main screen, but I'll report back after the next startup attempt.

itimpi · March 8, 2023

1 minute ago, jackfalveyiv said:

The disk is still displaying as unmountable in the Main screen,

It will if you have not restarted the array in normal mode.

jackfalveyiv · March 8, 2023

It did startup and mount. I'm about to rebuild the docker image and I'll report back.

jackfalveyiv · March 8, 2023

My system is back up and running. To summarize, when migrating data off the cache for an upgrade, then back again, it looks like my System share was still on disk3 when I started up the docker service. This looks like it caused the btrfs errors that eventually crashed the disk and made it unmountable. Thanks JorgeB and itimpi for your suggestions and getting me the correct solution.

JorgeB · March 9, 2023

11 hours ago, jackfalveyiv said:

This looks like it caused the btrfs errors that eventually crashed the disk and made it unmountable.

Likely what caused the problems with both the docker image and the disk filesystem were the ATA errors I've mentioned above, so make sure you check/replace cables.

jackfalveyiv · March 9, 2023

Noted. Replacing the cables in the coming day or two, and I received a read error this morning, fresh diagnostic posted below. Is this the beginning of a full hd failure?

trescommas-diagnostics-20230309-0743.zip

JorgeB · March 9, 2023

Still looks like a power/connection problem.

jackfalveyiv · March 9, 2023

Ok, new cables arrive tomorrow and everything will get swapped then. Will update at that point. Thanks.

jackfalveyiv · March 9, 2023

Question: if I have the system turned on but the array unmounted, am I safe to unplug/plug-in a drive? I'm realizing I need to label my drives somehow so that I know which is which the next time I need to do some troubleshooting. Thanks in advance.

JorgeB · March 9, 2023

Usually yes but if the hardware doesn't fully sport hot plug it can cause issues.

jackfalveyiv · March 9, 2023

Understood, thank you.

BTRFS Errors after cache upgrade

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation