Previously working cache currently "unmountable: not mounted" after hanging due to (?) full syslog file (SOLVED)


Go to solution Solved by JorgeB,

Recommended Posts

I hope I'll make sense here,

 

Uptime was uninterrupted since upgrading to 6.9.1 (stable launch) until all of a sudden I couldn't access containers remotely. Upon checking I noticed a full syslog, docker/VMs down. Restarted and the cache (xfs) was then showing as an Unassigned device- as soon as I reassigned it to the pool I got the error shown in the title.

I attempted to follow this guide, and after mounting there's only a single folder left with some of my NC data. Taking the cache off the array, it appears as it's btrfs with 0 bytes used/0 free until formatted.

 

Any idea what happened here? Any chance I can recover the data?

Thanks for your time!

tower-diagnostics-20210609-2209.zip

Edited by Tzundoku
Link to comment
34 minutes ago, Tzundoku said:

I attempted to follow this guide,

That guide if for btrfs, not xfs.

 

The NVMe device dropped offline:

 

Jun  7 15:19:21 Tower kernel: nvme nvme0: I/O 998 QID 21 timeout, aborting
Jun  7 15:19:21 Tower kernel: nvme nvme0: I/O 999 QID 21 timeout, aborting
Jun  7 15:19:21 Tower kernel: nvme nvme0: I/O 1000 QID 21 timeout, aborting
Jun  7 15:19:21 Tower kernel: nvme nvme0: I/O 968 QID 5 timeout, aborting
Jun  7 15:19:21 Tower kernel: nvme nvme0: I/O 969 QID 5 timeout, aborting
Jun  7 15:19:21 Tower kernel: nvme nvme0: I/O 934 QID 12 timeout, aborting
Jun  7 15:19:21 Tower kernel: nvme nvme0: I/O 935 QID 12 timeout, aborting
Jun  7 15:19:21 Tower kernel: nvme nvme0: I/O 936 QID 12 timeout, aborting
Jun  7 15:19:51 Tower kernel: nvme nvme0: I/O 968 QID 5 timeout, reset controller
Jun  7 15:20:21 Tower kernel: nvme nvme0: I/O 12 QID 0 timeout, reset controller
Jun  7 15:21:15 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Jun  7 15:21:15 Tower kernel: nvme nvme0: Abort status: 0x371
### [PREVIOUS LINE REPEATED 7 TIMES] ###
Jun  7 15:21:36 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Jun  7 15:21:36 Tower kernel: nvme nvme0: Removing after probe failure status: -19
Jun  7 15:21:56 Tower kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Jun  7 15:21:56 Tower kernel: XFS (nvme0n1p1): log I/O error -5

 

Post diags after rebooting.

Link to comment
On 6/11/2021 at 9:13 AM, JorgeB said:

Thanks a bunch.

 

Tried as per the guide, came up with this:

 

Quote

Phase 1 - find and verify superblock... - block cache size set to 1536600 entries Phase 2 - using internal log - zero log... zero_log: head block 277078 tail block 276881 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... agf_freeblks 18594701, counted 18594717 in ag 1 sb_icount 124608, counted 150912 sb_ifree 3051, counted 373 sb_fdblocks 75725874, counted 44572093 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 data fork in ino 270006436 claims free block 33749787 imap claims in-use inode 270006436 is free, correcting imap data fork in ino 270060728 claims free block 33757579 - agno = 2 bad nblocks 6502724 for inode 539059680, would reset to 6502740 bad nextents 154002 for inode 539059680, would reset to 154000 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... free space (1,185726-185741) only seen by one free space btree - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 bad nblocks 6502724 for inode 539059680, would reset to 6502740 bad nextents 154002 for inode 539059680, would reset to 154000 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - traversal finished ... - moving disconnected inodes to lost+found ... disconnected inode 270006436, would move to lost+found disconnected inode 270060730, would move to lost+found Phase 7 - verify link counts... would have reset inode 270060728 nlinks from 1 to 2 No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Sat Jun 12 13:31:34 2021 Phase Start End Duration Phase 1: 06/12 13:31:33 06/12 13:31:33 Phase 2: 06/12 13:31:33 06/12 13:31:33 Phase 3: 06/12 13:31:33 06/12 13:31:34 1 second Phase 4: 06/12 13:31:34 06/12 13:31:34 Phase 5: Skipped Phase 6: 06/12 13:31:34 06/12 13:31:34 Phase 7: 06/12 13:31:34 06/12 13:31:34 Total run time: 1 second

 

 

I might have misled you as I got confused over what I was seeing: the drive I noticed was btrfs wasn't the nvme cache, the actual nvme cache (similar model to the btrfs one) wasn't appearing in any of my devices (since it dropped offline apparently) and I didn't realize I was looking at the wrong drive.

The nvme cache is currently showing (after powering off and taking the server off power for a while), but it appears as a new device if I attempt to set it to the cache pool.

The above filesystem status corresponds to the correct nvme drive.

Link to comment
7 minutes ago, Tzundoku said:

The nvme cache is currently showing (after powering off and taking the server off power for a while), but it appears as a new device if I attempt to set it to the cache pool.

That's OK, as long as there's no warning on the right side that "all data on this device will be deleted at array start" you can just start the array, if it doesn't mount run another filesystem check but without -n.

Link to comment
1 hour ago, JorgeB said:

That's OK, as long as there's no warning on the right side that "all data on this device will be deleted at array start" you can just start the array, if it doesn't mount run another filesystem check but without -n.

 

That worked!
 

  • Like 1
Link to comment
  • Tzundoku changed the title to Previously working cache currently "unmountable: not mounted" after hanging due to (?) full syslog file (SOLVED)

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.