Lost ALL array drives instantly, unmountable. (6.12.8)

Deler7 · March 15

Seems i'm about having a bad weekend.

As i decided to overhaul my tower build server to a rackmount Supermicro 36 slot SAS chassis, so i can add more disks.

On my first startup, all my array drives got the status unmountable: Unsupported partition layout.

All drives were XFS before this happened.

I have no clue why this happened. I didn't change the mainbord or controller, it was a 1 to 1 transfer of all hardware into a new case, and added some disks (not formatted yet).

The only difference is, before i used 3 mini-sas to 4 SATA cables to connect my harddrives, now i'm using a single SAS cable to the chassis expander. This shouldn't make that much of a difference ?

Disks not having SMART errors, and are recognised by unraid.

In maintenance mode, i click on a random disk, to perform a xfs repair, using the default -n option.

This is the output:

Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
would write modified primary superblock
Primary superblock would have been modified.
Cannot proceed further in no_modify mode.
Exiting now.

Any clues what happened, and more importantly, can this be fixed without loosing the data ?

I've read something about running xfs_repair -V on each disk, but before doing that, wanted to consult this forum first

In my diag file, you will also see 12 Seagate disks, those are new and unrelated to the array. I already tried to boot without those new disks, it made no difference.

parodius-diagnostics-20240315-2035.zip

JonathanM · March 15

Maybe the expander remapped the partitions?

@JorgeBmay have a better option, but I seem to remember the solution was to unassign one drive at a time, start the array, stop the array, reassign the drive, let it rebuild. Several hours for each drive.

Whatever you do, don't remove more than one drive at a time. To test my solution, unassign one drive, start the array, and see if the emulated drive mounts properly.

Deler7 · March 15

3 hours ago, JonathanM said:

Maybe the expander remapped the partitions?

@JorgeBmay have a better option, but I seem to remember the solution was to unassign one drive at a time, start the array, stop the array, reassign the drive, let it rebuild. Several hours for each drive.

Whatever you do, don't remove more than one drive at a time. To test my solution, unassign one drive, start the array, and see if the emulated drive mounts properly.

Yea, i think as well something have to do with the expander, because that's the only component that was really changed.

I have swapped disks in the past between direct to controller and expanders, never had such issue, although that was on HW raid. To my knowledge, expanders are just SAS switches, they don't do anything with data on drives.

Unless this was a problem that started before the HW changes, and came up on the next boot, coincidentally after i moved to a new chassis.

Tried to unassign one disk, but it wont allow me to start the array due Missing disk, or do i need to remove the drive physically ?

image.png.8f3280578b6e5dcd5f74cf084b08bc02.png

Edited March 15 by Deler7

JonathanM · March 16

Is there not a checkbox to allow you start? Post a full screenshot of the main page.

Deler7 · March 16

Ok.

When the array is stopped, i mark 1 disk as no device

...

On this way, i can't start the array, as its missing a disk.

Note: disk7 is unassigned, but that was to replace a defective drive a few weeks (and reboots) ago.

Note2: The Seagate 4TB drives are

When i start the array with all the correct disks assigned:

...

It wants me to format the WDC drives, ofcourse not doing so, as it contains data i would love to get it back.

JonathanM · March 16

3 minutes ago, Deler7 said:

On this way, i can't start the array, as its missing a disk.

56 minutes ago, JonathanM said:

Is there not a checkbox to allow you start?

image.png.96bea389bc52a2878c745f0d21bde822.png

Deler7 · March 16

Ah, when i checked the box, START remains grey. Closed all browsertabs, did a cache clear, logged back into server, and now it turns orange, so i can proceed now.

Removed the disk, checked the box, start array.

Stopped the array, reattach the disk i just unassigned.

And start the array.

I assume, the waiting game starts ?

image.png.41d3819eb543d6a443ef10ead798493e.png

image.png.d7dabfdd641a525b7d394c42dc618e65.png

Edited March 16 by Deler7

JonathanM · March 16

29 minutes ago, Deler7 said:

I assume, the waiting game starts ?

Yes and no. My theory didn't work, if it did, the disk slot being rebuilt would be mounted already. Let the rebuild complete, and wait for @JorgeB

JorgeB · March 16

This happened before with those Adaptec RAID controllers, they can apparently sometimes overwrite the MBR of the disks, but it's not good news that the emulated disk didn't mount, if you didn't reboot since doing that, post the diagnostics.

Deler7 · March 16

After a night of spinning, the rebuild is complete.

13 minutes ago, JorgeB said:

This happened before with those Adaptec RAID controllers, they can apparently sometimes overwrite the MBR of the disks, but it's not good news that the emulated disk didn't mount, if you didn't reboot since doing that, post the diagnostics.

I havent rebooted since my previous steps. My new diagnostics below 👍

parodius-diagnostics-20240316-1120.zip

JorgeB · March 16

No valid filesystem is being detected on the rebuilt disk1, post the output of:

blkid

and also

gdisk /dev/sdi

the latter is for disk2, to check the current partition layout.

Deler7 · March 16

Allright.

root@PARODIUS:~# blkid
/dev/sda1: LABEL_FATBOOT="UNRAID" LABEL="UNRAID" UUID="2736-60C3" BLOCK_SIZE="512" TYPE="vfat"
/dev/loop1: TYPE="squashfs"
/dev/mapper/nvme3n1p1: LABEL="nvme-two" UUID="4128737249177850753" UUID_SUB="5585851513646370683" BLOCK_SIZE="4096" TYPE="zfs_member"
/dev/nvme0n1p1: LABEL="cache" UUID="17440156396158875726" UUID_SUB="14585744415475683753" BLOCK_SIZE="4096" TYPE="zfs_member"
/dev/nvme3n1p1: UUID="ad1f4b74-88bc-409b-8586-e81baf646027" TYPE="crypto_LUKS"
/dev/nvme2n1p1: UUID="b8d4d0c9-ec99-4505-8cce-c9a19a817ba1" TYPE="crypto_LUKS"
/dev/loop2: UUID="139601bd-7ef3-471e-9dc5-5e5e4f78d045" BLOCK_SIZE="512" TYPE="xfs"
/dev/loop0: TYPE="squashfs"
/dev/mapper/nvme2n1p1: LABEL="nvme-one" UUID="18166102060409839496" UUID_SUB="9952805448310778193" BLOCK_SIZE="4096" TYPE="zfs_member"
/dev/nvme1n1p1: LABEL="cache" UUID="17440156396158875726" UUID_SUB="926196455089919428" BLOCK_SIZE="4096" TYPE="zfs_member"
/dev/sdt1: PARTUUID="30cecb74-aef3-47de-99a5-1b6d6033178a"

and

From DISK2

root@PARODIUS:~# gdisk /dev/sdy
GPT fdisk (gdisk) version 1.0.9.1

Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!

Warning: Invalid CRC on main header data; loaded backup partition table.
Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
on the recovery & transformation menu to examine the two tables.

Warning! Main partition table CRC mismatch! Loaded backup partition table
instead of main partition table!

Warning! One or more CRCs don't match. You should repair the disk!
Main header: ERROR
Backup header: OK
Main partition table: ERROR
Backup partition table: OK

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: damaged

Found invalid MBR and corrupt GPT. What do you want to do? (Using the
GPT MAY permit recovery of GPT data.)
 1 - Use current GPT
 2 - Create blank GPT

Your answer: ^C

I did the same for DISK3 until DISK12 all have identical results as above.

In case of any help, this is the output of the rebuilded DISK1: (see below)

root@PARODIUS:~# gdisk /dev/sdt
GPT fdisk (gdisk) version 1.0.9.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.

Command (? for help): ^C

JorgeB · March 17

gdisk output confirms the partitions got clobbered, likely by the RAID controller, you can try to fix one with gdisk to see if it works after, to play it safer I would recommend cloning the disk first with dd and do it on the clone.

Deler7 · March 18

A small update from my side.

I made a DD copy from 1 disk to one of my new installed HDD's before continuing potentially screwing things (more)up

Not sure how to use the gdisk tool, i just tried following a guide, using gdisk on the drive that just got its backup. I went for the "r" option (r recovery and transformation options), then choose "b" (use backup GPT header (rebuilding main)), and when finished write and exit.

Then i rebooted the server, but still no luck, same error message.

Then, in maintenance mode, in Unraid UI, i went for the same disk, and i started the xfs_repair without any arguments. Rebooted the server again, did a regular ARRAY START and there it was, this particular disk with all its files accessible again !

There were a few (just really only a few) files in the Lost+found folders, but that's OK for me.

So, i guess im on the right track (?)

At the moment, i'm DD'ing every disk now to my new placed harddrives to have a backup at least. When that is finished, i should follow the same procedure on each disk ?

Edited March 18 by Deler7

JorgeB · March 19

9 hours ago, Deler7 said:

When that is finished, i should follow the same procedure on each disk ?

If it worked for one it should work for the other ones, except maybe disk1, since that one was rebuilt.

Deler7 · March 19

1 hour ago, JorgeB said:

If it worked for one it should work for the other ones, except maybe disk1, since that one was rebuilt.

It worked, even for DISK1 👍

Thank you all very much for the support !!!

JorgeB · March 19

If you can, I would recommend replacing that controller, since it's a common issue with them.

Deler7 · March 20

20 hours ago, JorgeB said:

If you can, I would recommend replacing that controller, since it's a common issue with them.

Oh yes, i instantly ordered a cheap LSI 9211-i8 controller and installed it yesterday on the server. As i expected to have some complications, i kept my backups, but this change from Adaptec to LSI went flawlessly. It was more like plug&play, all drives were recognised and the array stated without errors.

To make sure everything is OK, i started a paritycheck.

image.png.e174d393ab6f3ecb7634eee8eede7da4.png

Perfect !

All my issues are resolved now.

However, perhaps good information for others who are reading this thread in future and have the same issue. I have found the source of my original issue!

I now can reproduce the issue.

Not the swap to the SAS-SATA cables to SAS-Extender that caused this.

It was change of controller mode.

The Adaptec 7 series have 4 different controller modes you can pick for operation.

- Auto

- RAID hide RAW. Act as a pure HW raid controller, unassigned disks are not exposed to OS.

- RAID expose RAW.(default) Same as above, but now it exposes unassigned disks not assigned on a HW RAID to the OS. This is the default setting at factory defaults ie. new cards.

- HBA. No RAID volumes, all disks are RAW exposed to the OS, comparable to IT mode, this is the mode you should run with UNRAID.

Here is the thing, although you might expect "RAID expose RAW" would be the same as "HBA" when having no RAID volumes, IT ISN'T. Switching between those 2 modes does somehow mess up with your partition tables. I did some tests on a discardable array. Create the array when controller is in HBA, then change controller to RAID expose RAW, will render the issue i had on my first post of this topic (drives are unmountable). Even when switching back to HBA mode, the damage is already done, and disk partitions needs to be fixed.

Well, how could this happen to me, as i wasn't messing at the controller settings when migrating to a new enclosure.

My mainbord UEFI bios. 🤬

As some are aware, when installing the 7 series (and perhaps newer) Adaptec cards on mainbord's that have UEFI, you no longer get the legacy CTRL+A for menu option at bootup, instead, you enter the controller bios within your mainbord UEFI bios at the Add-On devices sub-menu.

For me unknown reason, the mainbord also remembers what slot the controller is plugged into, and keep it settings by slot. Meaning, when you change PCIe slot of the controller, at my Gigabyte Z590 mainbord, it defaults the controller to its original setting "RAID expose RAW".

And indeed, when i was migrating to the enclosure, i did a one-time boot the system with the controller on a different PCIe slot, as i wanted to test a GPU in its main slot.

I didn't check the array at that time, but i did a shutdown, moved the controller back to its original slot, but the 'damage' was already done. Hence my issue above.

This seems only happening on my Gigabyte Z590 UEFI based mainbord, as my older dual Xeon legacy BIOS mainbord does not behave like this. On the legacy board, i can move the controller to any PCIe slot i like, it does not change it's mode. But when i install the controller to my Gigabyte Z590, it store my settings per PCIe slot.

My lesson learned: do NOT change the PCIe slot of Adaptec cards at UEFI based mainbord's without verifying settings before fist boot. Or buy a LSI HBA in IT mode and never look back

JorgeB · March 20

Thanks for posting the above, it may help other users in the future, I'll keep a link to this thread.

Lost ALL array drives instantly, unmountable. (6.12.8)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation