Jump to content

6.11.5 to 6.12.1 upgrade caused drives to become unmountable


Go to solution Solved by itimpi,

Recommended Posts

After upgrading from 6.11.5 to 6.12.1 I could not start my array because my data drive was showing as "Unmountable: Unsupported or no file system".

 

I added a new drive to the system with the intent of rebuilding from parity. The new drive happens to be larger than the old parity drive, so I did a Parity Swap Procedure. The copy succeeded as far as I can tell. When I began the rebuild on the new data drive (ie: the old parity drive), it started showing the same "Unmountable" error. The rebuild has not finished yet, but I expect I will be in the same state I started in at the end of it. What is this error trying to tell me? Could this be related to the OS update?

 

Some additional info: The old data drive is now completely idle, so I decided to see if I could diagnose it any further:

 

root@Mnemosyne:~# mount /dev/sde1 foo
mount: /root/foo: wrong fs type, bad option, bad superblock on /dev/sde1, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.
root@Mnemosyne:~# dmesg | tail -10
[91533.996720] less[12157]: segfault at 1a496 ip 00000000004057e7 sp 00007ffc1d2720e0 error 4 in less[402000+1a000] likely on CPU 1 (core 0, socket 0)
[91533.996781] Code: 8b 04 24 41 8b 8c 24 40 80 00 00 e9 05 fd ff ff 48 c7 c5 d8 f4 42 00 45 85 ed 0f 84 93 00 00 00 bf fa 00 00 00 e8 89 0d 01 00 <8b> 55 00 85 d2 75 08 41 bd 01 00 00 00 eb ab 48 c7 c0 e8 2d 43 00
[113884.439807] XFS (sde1): Mounting V5 Filesystem
[113884.503022] XFS (sde1): Corruption warning: Metadata has LSN (2:346049) ahead of current LSN (2:345452). Please unmount and run xfs_repair (>= v4.3) to resolve.
[113884.503041] XFS (sde1): log mount/recovery failed: error -22
[113884.503292] XFS (sde1): log mount failed
[114218.204375] XFS (sde1): Mounting V5 Filesystem
[114218.250007] XFS (sde1): Corruption warning: Metadata has LSN (2:346049) ahead of current LSN (2:345452). Please unmount and run xfs_repair (>= v4.3) to resolve.
[114218.250018] XFS (sde1): log mount/recovery failed: error -22
[114218.250192] XFS (sde1): log mount failed

 

This led me to this thread, which seems vaguely related.

 

diagnostics-20230623-1623.zip

Link to comment
2 minutes ago, trurl said:

I assume you mean you REPLACED drive1, your only data disk. ADDING is a totally different thing than REPLACING.

I added a drive to the system, but I replaced the data drive in the array. The old data drive is currently in the system, powered, etc but not assigned. In fact, the mount and dmesg logs from above are from the original data disk.

Edited by jgopel
Link to comment
3 hours ago, jgopel said:

I added a drive to the system

Since you don't have a disk2 now, I don't see how you could have ADDED a drive. ADDING means adding a disk to a new slot. If you're trying to put a different disk in a slot that is already assigned, that is REPLACING. These are very different things handled in very different ways because they have different requirements to maintain parity.

Link to comment
1 hour ago, trurl said:

Since you don't have a disk2 now, I don't see how you could have ADDED a drive. ADDING means adding a disk to a new slot. If you're trying to put a different disk in a slot that is already assigned, that is REPLACING. These are very different things handled in very different ways because they have differerent requirements to maintain parity.

I haven't looked closely at the contents of the diagnostics to know what you're going off of, but here are the disks in the system per lshw:

 

root@Mnemosyne:~# lshw -class disk
  *-disk                    
       description: SCSI Disk
       product: SanDisk 3.2Gen1
       vendor: USB
       physical id: 0.0.0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: 1.00
       serial: 0501e7d20d2b890a2a0f
       size: 57GiB (61GB)
       capabilities: removable
       configuration: ansiversion=6 logicalsectorsize=512 sectorsize=512
     *-medium
          physical id: 0
          logical name: /dev/sda
          size: 57GiB (61GB)
          capabilities: partitioned partitioned:dos
  *-disk:0
       description: ATA Disk
       product: WDC WD181KFGX-68
       vendor: Western Digital
       physical id: 0
       bus info: scsi@5:0.0.0
       logical name: /dev/sdb
       version: 0A83
       serial: 2JK9KVBB
       size: 16TiB (18TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=9d7f5e68-b1c3-4bfb-ab9a-b848cf92a01c logicalsectorsize=512 sectorsize=4096
  *-disk:1
       description: ATA Disk
       product: ST14000NM000J-2T
       physical id: 1
       bus info: scsi@6:0.0.0
       logical name: /dev/sdc
       version: SN01
       serial: ZR701WCM
       size: 12TiB (14TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=2b9706e5-5ebd-f477-59bc-3324c5e6672f logicalsectorsize=512 sectorsize=4096
  *-disk:2
       description: ATA Disk
       product: Samsung SSD 850
       physical id: 2
       bus info: scsi@7:0.0.0
       logical name: /dev/sdd
       version: 2B6Q
       serial: S250NXAGC05587N
       size: 476GiB (512GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512
  *-disk:3
       description: ATA Disk
       product: ST14000NM001G-2K
       physical id: 3
       bus info: scsi@8:0.0.0
       logical name: /dev/sde
       version: SN03
       serial: ZL2N0TCM
       size: 12TiB (14TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=f459f699-abb9-3556-ca5f-11f834a78b64 logicalsectorsize=512 sectorsize=4096

The 2 Seagate drives were the original data (/dev/sde - 001G) and original parity/new data (dev/sdc - 000J). The new parity drive is the Western Digital (/dev/sdb - KFGX). 001G was replaced in the array by KFGX (though I haven't gone to the effort of PHYSICALLY removing it from the system yet). It is fine for you to think of 001G as having been replaced by KFGX from the perspective of the array.

 

Does that clear things up at all?

Link to comment
1 hour ago, jgopel said:

contents of the diagnostics

I am referring to the disk assignments as they appear in the webUI and as Unraid keeps track of them.

 

One place you can see your disk assignments (assuming none are disconnected) is in the smart folder of diagnostics. A better place is vars.txt in the system folder of diagnostics, since this will usually show the disks that are assigned even if they aren't connected.

 

The "logical name"  isn't guaranteed to stay the same from one boot to another, and sometimes a drive will disconnect and reconnect as a different "logical name". To keep track of disk assignments, Unraid uses the serial number for the disk or whatever the controller reports for the id. This may be different than the serial number if you are using RAID or some other type of controller such as USB to connect to the disk.

 

But we can go from what you have listed above.

 

sda is your flash drive. This is the drive Unraid boots from, it is not part of the array or any pools. It is in your list above as just "disk".

sdb is your parity disk. The parity disk is sometimes referred to as slot0 or disk0 of the array, and it is coincidentally disk: 0 in your list above.

sdc is the disk assigned to slot1, or Disk 1 of the array as it appears in the webUI, and it is coincidentally disk: 1 in your list above.

sdd is the disk (SSD) assigned as the only disk in the pool named 'cache'. It is not in the array, so it is not slot2 or Disk 2 in the webUI, but it appears as disk: 2 in your list.

sde is HDD currently unassigned (not in the array or pools), so it is not slot3 or Disk 3 in the webUI, but it  appears as disk: 3 in your list.

 

The disk assignments are the important things for most purposes, but the other ways to identify disks can be useful.

 

So, you currently only have parity and Disk 1 assigned to the array.

 

If you assign a different disk as Disk 1, it REPLACES Disk 1. Replacing a disk usually is followed by rebuilding the disk from the parity calculation.

 

If you assign a disk to a new slot, such as in your case Disk 2, then that is ADDING a disk. This usually is followed by clearing the disk (unless it has been precleared) so parity will remain valid. A clear disk is all zeros, so has no effect on parity.

 

So, REPLACING and ADDING are very different.

 

Link to comment

All that is sort of a digression from your actual problem though.

 

1 hour ago, jgopel said:

The 2 Seagate drives were the original data (/dev/sde - 001G) and original parity/new data (dev/sdc - 000J). The new parity drive is the Western Digital (/dev/sdb - KFGX). 001G was replaced in the array by KFGX (though I haven't gone to the effort of PHYSICALLY removing it from the system yet). It is fine for you to think of 001G as having been replaced by KFGX from the perspective of the array.

This does clear things up, at least from the perspective of ADDING vs REPLACING array disks.

 

3 hours ago, jgopel said:

did a Parity Swap Procedure. The copy succeeded as far as I can tell.

I'm not sure I see any "parity copy" in your syslog. Did you reboot after the parity copy?

 

When your syslog starts, these are your array assignments:

Jun 22 08:52:29 Mnemosyne kernel: mdcmd (1): import 0 sdc 64 13672382412 0 ST14000NM000J-2TX103_ZR701WCM
Jun 22 08:52:29 Mnemosyne kernel: md: import disk0: (sdc) ST14000NM000J-2TX103_ZR701WCM size: 13672382412 
Jun 22 08:52:29 Mnemosyne kernel: mdcmd (2): import 1 sde 64 13672382412 0 ST14000NM001G-2KJ103_ZL2N0TCM
Jun 22 08:52:29 Mnemosyne kernel: md: import disk1: (sde) ST14000NM001G-2KJ103_ZL2N0TCM size: 13672382412 

and Disk 1 is already unmountable:

Jun 22 08:52:32 Mnemosyne emhttpd: mounting /mnt/disk1
Jun 22 08:52:32 Mnemosyne emhttpd: shcmd (29): mkdir -p /mnt/disk1
Jun 22 08:52:32 Mnemosyne emhttpd: shcmd (30): mount -t xfs -o noatime,nouuid /dev/md1p1 /mnt/disk1
Jun 22 08:52:32 Mnemosyne kernel: SGI XFS with ACLs, security attributes, no debug enabled
Jun 22 08:52:32 Mnemosyne kernel: XFS (md1p1): Mounting V5 Filesystem
Jun 22 08:52:32 Mnemosyne kernel: XFS (md1p1): Corruption warning: Metadata has LSN (2:346049) ahead of current LSN (2:345452). Please unmount and run xfs_repair (>= v4.3) to resolve.
Jun 22 08:52:32 Mnemosyne kernel: XFS (md1p1): log mount/recovery failed: error -22
Jun 22 08:52:32 Mnemosyne kernel: XFS (md1p1): log mount failed
Jun 22 08:52:32 Mnemosyne root: mount: /mnt/disk1: wrong fs type, bad option, bad superblock on /dev/md1p1, missing codepage or helper program, or other error.
Jun 22 08:52:32 Mnemosyne root:        dmesg(1) may have more information after failed mount system call.
Jun 22 08:52:32 Mnemosyne emhttpd: shcmd (30): exit status: 32
Jun 22 08:52:32 Mnemosyne emhttpd: /mnt/disk1 mount error: Unsupported or no file system
Jun 22 08:52:32 Mnemosyne emhttpd: shcmd (31): rmdir /mnt/disk1

 

Later in syslog I see disk assignments changing, eventually winding up where you say, but not what I would expect for parity swap. Doesn't seem to be enough time elapsed in syslog for "parity copy", but I don't see any other way those disk assignment changes could be made without New Config.

 

Maybe I missed something and someone else will take a look.

 

3 hours ago, jgopel said:

old data drive is now completely idle, so I decided to see if I could diagnose it any further

You shouldn't need to resort to the command line to mount an unassigned device. Most users have Unassigned Devices plugin installed, but I see you don't. You might try that and see if you can mount the original disk1.

 

If it is unmountable, maybe we can try to repair the filesystem on the original or the currently emulated Disk 1.

 

I'll wait until tomorrow and see if anyone else has any ideas.

 

 

Link to comment
12 hours ago, trurl said:

I'm not sure I see any "parity copy" in your syslog. Did you reboot after the parity copy?

 

I don't recall whether or not I rebooted. I did do a parity copy and it did take quite a while (about 48hrs). Looking at my uptime, I feel like I must have rebooted after that, though I can't recall why I would have done that.

 

12 hours ago, trurl said:

You shouldn't need to resort to the command line to mount an unassigned device. Most users have Unassigned Devices plugin installed, but I see you don't. You might try that and see if you can mount the original disk1.

 

I'm just very comfortable on the command line. At the point that the GUI wasn't giving me much information, I just started treating it like I would any of the other headless systems that I deal with. I'll get that plugin installed and see if there's any extra info.

Link to comment

Restarted the array in maintenance mode and ran xfs_repair status with options of -n on the new data drive (000J). Here are the results:

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 7
        - agno = 2
        - agno = 4
        - agno = 3
        - agno = 5
        - agno = 6
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 0
        - agno = 11
        - agno = 12
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
Maximum metadata LSN (2:346032) is ahead of log (2:345452).
Would format log to cycle 5.
No modify flag set, skipping filesystem flush and exiting.
Edited by jgopel
Improve clarity
Link to comment

Been short on time to work on this, so I haven't been making progress, but I did try a normal xfs_repair (ie: no -n) on the old parity drive via the webUI. That results in the following output:

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

 

Link to comment

Ran that and got the following output:

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 4
        - agno = 9
        - agno = 3
        - agno = 1
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 10
        - agno = 2
        - agno = 11
        - agno = 12
clearing reflink flag on inodes when possible
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (2:346049) is ahead of log (1:2).
Format log to cycle 5.
done

 

Seems like the system is back to normal. Thanks for the help!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...