jackfalveyiv Posted March 8, 2023 Share Posted March 8, 2023 I've been experiencing a litany of issues with my server, seemingly around cache drive corruption. I formatted the drive several times before buying a new one and installing it. After two days, I'm starting to get errors seen in the below screenshot. What is happening? I don't see an issue with my applications yet, just trying to get ahead of whatever this may be. trescommas-diagnostics-20230308-1415.zip Quote Link to comment
JorgeB Posted March 8, 2023 Share Posted March 8, 2023 Mar 8 10:12:21 TresCommas kernel: ata6.00: exception Emask 0x10 SAct 0xc0000 SErr 0x4890000 action 0xe frozen Mar 8 10:12:21 TresCommas kernel: ata6.00: irq_stat 0x08400040, interface fatal error, connection status changed Mar 8 10:12:21 TresCommas kernel: ata6: SError: { PHYRdyChg 10B8B LinkSeq DevExch } Mar 8 10:12:21 TresCommas kernel: ata6.00: failed command: READ FPDMA QUEUED Mar 8 10:12:21 TresCommas kernel: ata6.00: cmd 60/98:90:b0:4a:0b/00:00:68:03:00/40 tag 18 ncq dma 77824 in Mar 8 10:12:21 TresCommas kernel: res 40/00:00:e0:c5:29/00:00:d5:00:00/40 Emask 0x10 (ATA bus error) Mar 8 10:12:21 TresCommas kernel: ata6.00: status: { DRDY } Mar 8 10:12:21 TresCommas kernel: ata6.00: failed command: READ FPDMA QUEUED Mar 8 10:12:21 TresCommas kernel: ata6.00: cmd 60/80:98:78:3c:7f/00:00:68:03:00/40 tag 19 ncq dma 65536 in Mar 8 10:12:21 TresCommas kernel: res 40/00:00:e0:c5:29/00:00:d5:00:00/40 Emask 0x10 (ATA bus error) Mar 8 10:12:21 TresCommas kernel: ata6.00: status: { DRDY } Mar 8 10:12:21 TresCommas kernel: ata6: hard resetting link Mar 8 10:12:27 TresCommas kernel: ata6: link is slow to respond, please be patient (ready=0) Mar 8 10:12:31 TresCommas kernel: ata6: COMRESET failed (errno=-16) Mar 8 10:12:31 TresCommas kernel: ata6: hard resetting link Mar 8 10:12:35 TresCommas kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Mar 8 10:12:36 TresCommas kernel: ata6.00: supports DRM functions and may not be fully accessible Issues with disk3, check/replace cables, also a good idea to recreate the docker image on cache instead of disk3. Quote Link to comment
jackfalveyiv Posted March 8, 2023 Author Share Posted March 8, 2023 (edited) I turned the array off to rebuild the docker image, then I didn't have an option to delete the vdisk and create a new docker img. When I turned the array back on, my disk3 is now reporting as unmountable. What's my next logical move here? Edited March 8, 2023 by jackfalveyiv Quote Link to comment
jackfalveyiv Posted March 8, 2023 Author Share Posted March 8, 2023 Fresh diagnostic posted below trescommas-diagnostics-20230308-1454.zip Quote Link to comment
jackfalveyiv Posted March 8, 2023 Author Share Posted March 8, 2023 Booted to maint mode, tried a Check Filesystem Status -nv and got the following: Phase 1 - find and verify superblock... - block cache size set to 1404320 entries Phase 2 - using internal log - zero log... zero_log: head block 1197993 tail block 1197987 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 - agno = 16 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 1 - agno = 2 - agno = 5 - agno = 3 - agno = 9 - agno = 15 - agno = 4 - agno = 13 - agno = 7 - agno = 10 - agno = 11 - agno = 12 - agno = 0 - agno = 14 - agno = 16 - agno = 6 - agno = 8 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 - agno = 16 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... Maximum metadata LSN (4:1198031) is ahead of log (4:1197993). Would format log to cycle 7. No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Wed Mar 8 15:27:15 2023 Phase Start End Duration Phase 1: 03/08 15:27:06 03/08 15:27:06 Phase 2: 03/08 15:27:06 03/08 15:27:07 1 second Phase 3: 03/08 15:27:07 03/08 15:27:11 4 seconds Phase 4: 03/08 15:27:11 03/08 15:27:11 Phase 5: Skipped Phase 6: 03/08 15:27:11 03/08 15:27:15 4 seconds Phase 7: 03/08 15:27:15 03/08 15:27:15 Total run time: 9 seconds Quote Link to comment
itimpi Posted March 8, 2023 Share Posted March 8, 2023 That is quite standard, you need to rerun it adding the -L option (and without -n for any changes to be made). Quote Link to comment
jackfalveyiv Posted March 8, 2023 Author Share Posted March 8, 2023 Thank you. Here's the output from running with the -L option: Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... clearing needsrepair flag and regenerating metadata - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 - agno = 16 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 5 - agno = 8 - agno = 13 - agno = 6 - agno = 7 - agno = 1 - agno = 10 - agno = 11 - agno = 14 - agno = 12 - agno = 16 - agno = 15 - agno = 3 - agno = 4 - agno = 9 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... Maximum metadata LSN (4:1198044) is ahead of log (1:2). Format log to cycle 7. done Quote Link to comment
itimpi Posted March 8, 2023 Share Posted March 8, 2023 Have you tried restarting the array in normal mode? I would expect the drive to now mount OK. Quote Link to comment
jackfalveyiv Posted March 8, 2023 Author Share Posted March 8, 2023 I'll give that a try. The disk is still displaying as unmountable in the Main screen, but I'll report back after the next startup attempt. Quote Link to comment
itimpi Posted March 8, 2023 Share Posted March 8, 2023 1 minute ago, jackfalveyiv said: The disk is still displaying as unmountable in the Main screen, It will if you have not restarted the array in normal mode. Quote Link to comment
jackfalveyiv Posted March 8, 2023 Author Share Posted March 8, 2023 It did startup and mount. I'm about to rebuild the docker image and I'll report back. Quote Link to comment
Solution jackfalveyiv Posted March 8, 2023 Author Solution Share Posted March 8, 2023 My system is back up and running. To summarize, when migrating data off the cache for an upgrade, then back again, it looks like my System share was still on disk3 when I started up the docker service. This looks like it caused the btrfs errors that eventually crashed the disk and made it unmountable. Thanks JorgeB and itimpi for your suggestions and getting me the correct solution. Quote Link to comment
JorgeB Posted March 9, 2023 Share Posted March 9, 2023 11 hours ago, jackfalveyiv said: This looks like it caused the btrfs errors that eventually crashed the disk and made it unmountable. Likely what caused the problems with both the docker image and the disk filesystem were the ATA errors I've mentioned above, so make sure you check/replace cables. Quote Link to comment
jackfalveyiv Posted March 9, 2023 Author Share Posted March 9, 2023 Noted. Replacing the cables in the coming day or two, and I received a read error this morning, fresh diagnostic posted below. Is this the beginning of a full hd failure? trescommas-diagnostics-20230309-0743.zip Quote Link to comment
JorgeB Posted March 9, 2023 Share Posted March 9, 2023 Still looks like a power/connection problem. Quote Link to comment
jackfalveyiv Posted March 9, 2023 Author Share Posted March 9, 2023 Ok, new cables arrive tomorrow and everything will get swapped then. Will update at that point. Thanks. Quote Link to comment
jackfalveyiv Posted March 9, 2023 Author Share Posted March 9, 2023 Question: if I have the system turned on but the array unmounted, am I safe to unplug/plug-in a drive? I'm realizing I need to label my drives somehow so that I know which is which the next time I need to do some troubleshooting. Thanks in advance. Quote Link to comment
JorgeB Posted March 9, 2023 Share Posted March 9, 2023 Usually yes but if the hardware doesn't fully sport hot plug it can cause issues. Quote Link to comment
jackfalveyiv Posted March 9, 2023 Author Share Posted March 9, 2023 Understood, thank you. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.