[Solved] Multiple unmountable disks with "wrong encryption key" (but it's correct?)


augot

Recommended Posts

I've been banging my head against a wall for the last two days with this and I'm finally out of ideas... If anyone can figure out what's happening here, I'm all ears.

 

Last week I decided to do a few upgrades to my server. New motherboard (ASRock X570 Taichi) with a third full 16X slot, and an Adaptec 71605 to go in it as I expand my array from 8 drives to 11 courtesy of a few spares I had lying around, with a plan to eventually get up to 16 drives. At the same time, I'd move the 2x2TB m.2 SSD pool used for my Windows VM over to become part of the main cache pool, and replace them with 2x2TB m.2 nvme drives. All dockers/VMs/scripts/etc disabled until everything's done, ofc, and I'd wait to start changing drives around until after I'd checked everything from the old setup was still working. So just to be clear:

 

Old setup: 8 HDDs via an LSI card, cache made up of 3 SATA SSDs, Windows VM pool of 2 m.2 SATA SSDs

New setup: 11 HDDs, cache of 5 SSDs (mix of SATA and m.2), Windows VM pool of 2 m.2 nvme drives

CPU, RAM, PSU, etc, are all kept from the previous rig.

 

Everything looked good after making the hardware switchovers (even though only UEFI boot seems to work with this new mobo), so I started the process of moving my Cache and Windows pool drives around. Started by removing one of the two Windows disks and starting a new BTRFS balance. Unfortunately it got to 71% and the server started to freeze up, eventually necessitating a hard reset.

 

Ever since, I've been unable to start the array normally. It works fine in safe mode - entering the disk encryption password on the main tab, hitting "Start", the array + two pools start up. But if I try after a normal boot, maintenance mode or otherwise, most of the drives - usually disks 6-11 of the array (but sometimes 7-11, or 5-11), plus all the disks in both pools - refuse to mount because of an "incorrect encryption key". Now, I know for a fact the password is fine (I wouldn't be able to start the array in safe mode, after all), but I cannot figure out what on earth is happening. It seems like the system decrypts each disk one by one, but then after a few disks something is happening to make it think the encryption key is suddenly incorrect for all subsequent disks.

 

1688721592_ScreenShot2021-09-27at22_49_45.thumb.png.6882a3c2ec407ce5b0e3d88af82d10ee.png

 

At first I thought it was due to something screwy around the the pool disks that I was switching around - even in safe mode the same encryption key error was happening when I tried to include either of them in the cache pool, so I bit the bullet and formatted them clean, accepting I'd be spending some time redownloading from my offsite backups to get back to normal. But even though that resolved the issue in safe mode, it still left the problem in place for a normal boot. Replugging individual drives/changing cables didn't do anything, nor reseating the Adaptec card, although that wasn't surprising considering the pool disks are connected directly to the mobo's SATA ports yet are also affected by this. (And, again, I can start the array fine in safe mode, so that means it must be something else like a plugin perhaps?)

 

I've attached my diagnostics, taken after a new boot and attempt to start the array. Any ideas much appreciated!!

tower-diagnostics-20210927-2251.zip

Edited by augot
typo
Link to comment

....I thought I'd narrowed it down to Unbalance after systematically deleting plugins one by one until something changed, but this now feels like something more fundamentally wrong than just the one plugin:

 

1) Reflashed my USB stick with my backup with all my old plugins, deleted the Unbalance plugin alone, started the array in maintenance mode, all OK

2) Re-downloaded Unbalance from community apps, seemingly installed OK, GUI accessible (although disk info not available because of maintenance mode ofc)

3) Disabled Unbalance server in its settings, then stopped the array and restarted in normal mode

4) Instantly get an alert that one of my cache drives is now missing, and the Unraid GUI otherwise becomes unresponsive

5) After a hard reset, deleted Unbalance again and tried booting into maintenance mode - no dice, back to disks not being decrypted

 

I'm going to try deleting all plugins entirely off my flash drive - installed and removed history - to really test this, but I'm wondering if this is perhaps a hardware or BIOS issue, or something else beyond plugins which changes between safe and normal boot modes. A situation where Unraid is expecting or requesting some kind of data too slowly (or too quickly) and it's causing disk read issues across the system as it kind of "times out". Would explain the pattern of disks decrypting in sequence for only the first few and then stopping, in the same pattern, over and over again. Hmm.

 

EDIT: Deleting the entire plugins folder has worked - I can start the array normally again. But I did notice something new. There were *four* cache pools when I booted. "Cache" and "Windows", as per usual, but also "._cache" and "._windows", both without any drives assigned and only one slot each. Where on earth did they come from? I deleted them entirely before booting and now it seems to be functioning normally, but this seems very odd to me... I'm guessing my efforts to swap disks between the two pools has caused it somehow, and now there are multiple plugins which think that there are two of each pool and it's causing everything to gum up?

Edited by augot
Link to comment
39 minutes ago, augot said:

EDIT: Deleting the entire plugins folder has worked - I can start the array normally again. But I did notice something new. There were *four* cache pools when I booted. "Cache" and "Windows", as per usual, but also "._cache" and "._windows", both without any drives assigned and only one slot each. Where on earth did they come from? I deleted them entirely before booting and now it seems to be functioning normally, but this seems very odd to me... I'm guessing my efforts to swap disks between the two pools has caused it somehow, and now there are multiple plugins which think that there are two of each pool and it's causing everything to gum up?

Are you using a Mac as it has a tendency to create .xxx type files unless you disable this for network drives and it is possible the unexpected presence of one has triggered some strange behaviour?  If so then it sounds like something that should be fixed at the Unraid level to ignore such files.

Link to comment
3 minutes ago, itimpi said:

Are you using a Mac as it has a tendency to create .xxx type files unless you disable this for network drives and it is possible the unexpected presence of one has triggered some strange behaviour?  If so then it sounds like something that should be fixed at the Unraid level to ignore such files.

I am - my Macbook is my only computer atm, until I get my Windows VM back up and running, and I deleted the plugins folder from the flash drive using Finder. Strange!

 

I'm doing clean installs of all my old plugins and setting them up manually instead of copying any old files across, so fingers crossed however many of them were having issues won't from now on...

Link to comment

Damn - I thought this was fixed, but unfortunately it's still happening. Without any plugins installed, it doesn't happen every time, but it still happens maybe twice in three as I've been trying to re-establish my various configurations and settings, and stress-tested general system stability my starting/stopping the array. If I launch the array without any plugins installed directly after booting, it's generally fine - but the longer that the system is up, the less likely it is that restarting the array would be possible again without a full system restart. Same if plugins are reinstalled between array launches.

 

I discovered that the Cache pool's BTRFS filesystem was corrupted with uncorrectable errors, so (with everything already backed up) I kept it simple by wiping the whole pool and starting fresh, like I'd already done with the Windows pool. I thought that maybe it was some kind of BTRFS-related process that was the culprit, since a balance would start on the Cache pool (whenever I could start the array properly), but the system would still become gradually unstable, correlating with the progress of the balance (which never finished). And I figured maybe something about that filesystem corruption had been the cause of the decryption errors - it was corrupted in such a way that the decrypting process itself was being destabilised somehow. (I have no idea if this mental model makes sense with how Unraid actually works, but this gelled with what I've been seeing - however, no filesystem errors have been found on the array, just the pool, so it still doesn't really make sense that a BTRFS problem would cause an XFS array to not decrypt too, I imagine...)

 

I reduced each pool down to one drive each, then added one at a time to Cache until it was full (5 drives), waited for each balance to complete, then stopped the array and added the next one until all were done (5 drives). Then I added the second to the Windows pool. Sometimes drives weren't added properly - they were added without formatting to the pool, or without being encrypted, so I'd have to stop the array and try adding them again until Unraid actually recognised they needed to be formatted/encrypted correctly AND the array was entirely decrypted in general. Eventually I got there, but clearly something is going very wrong with how my Unraid is handling filesystems and drive interactions. And this is with both my existing copy and a completely new install on the flash drive (with the old config folder copied over minus the plugins), just in case it was some kind of file corruption with the system files that was the culprit.

 

 

Edited by augot
Link to comment

I'd prefer to keep it, but I suppose I could switch to encrypting individuals folders/files on an as-needed basis rather than full disk encryption, if this really is unavoidable. How time-consuming would it be to change for the array, though? It's nearly 50TB of data on there atm, which would probably take around a month to download again from my offsite backup, and I really want to avoid doing that if possible.

Link to comment

Update to this, in case anyone else finds themselves in the same situation: I'm thinking of exchanging this new mobo, because whatever's happening now feels like it must be hardware related. I've removed encryption from both pools but I'm still getting uncorrectable BTRFS errors appearing on them whenever I start loading them back up with data; meanwhile, I've found (from wiping a few more of the array drives as a test) that having more than the first four disks in the array encrypted causes the decryption errors to inevitably reoccur again (sporadically with 5-6 encrypted array drives, always with more than that). And, in general, the system just feels... deeply unstable. I've never had so many complete system lock-ups, so frequently, from doing things that my previous setup wouldn't even blink at.

 

It's a shame, because I've seen plenty of people say that the X570 Taichi has served them well as an Unraid board - but as it's the only substantially different change to my system in terms of hardware, logically it seems more prudent to just get something else and see if it still happens, rather than spent more days chasing ghosts in software.

Link to comment
5 hours ago, JonathanM said:

What is your memory clocked at compared to the maximum supported by your CPU for the given number of sticks? The limiting factor is not normally the rating on the RAM, and the system will be very unstable if you exceed the capabilities of your CPU.

On first diags the RAM was at 2133 for a 5950X, way below the worst case for this generation of CPU.

Link to comment

Yes - I thought it might be RAM-related too at first, and one of the very first troubleshooting steps I took was to run memtest and also tune down the memory speed. On my previous board (Asus Prime X570-P) I’d been running these same sticks at full speed (3600) with the same CPU, which I know is faster than Unraid officially supports - but it hadn’t caused any issues, so stuck with it.

 

I’m aware that Ryzen can in general be temperamental when it comes to Unraid, and I’m wondering if this is one of those unlucky cases where the hardware combination just doesn’t work. Too much I/O for a delicate storage controller, that kind of thing. I’m gonna try flashing the mobo BIOS back a revision or two (it’s been on the most revent version, 4.60, released August 2021, for all this) to see if maybe that helps, but it feels like clutching at straws tbh.

Edited by augot
Link to comment

Update on this: I think I'm narrowing in on what's happened, and it's something to do with crypts and corrupted LUKS headers.

 

I've now pretty much fully Ship of Theseus'd my server, and the behaviour continues to persist. I've now tried with 3 different motherboards - another X570 Taichi, and a Gigabyte X570 Aorus Pro (and flashed through multiple BIOS revisions for each). I've hooked my new Adaptec card up, and gone back to my old LSI card too. I've swapped drive cables in and out, I've changed the locations of devices in the PCIe slots, I've tried it without the NVME drives, I've tried it with just the SSDs, I've tried it with just HDDS, I've tried it with no HDDs at all and just SSDs. I've even tried with a completely fresh install of Unraid, on a completely new USB stick and a trial license. Nothing changes. Encrypted disks - BTRFS or XFS - cannot be opened at all now, and I cannot create new ones. And the culprit must be in the few things which have persisted between all these different changes: namely, the CPU, the RAM, or the disks themselves. I think it's the latter.

 

If I take a completely fresh install of Unraid - and then install only Unassigned Devices (and Plus) - I can delete any existing disk partition fine. I can also then seemingly format any empty disk fine and create a new encrypted XFS or BTRFS filesystem. But if I try to mount any of those newly-formatted disks, this happens:

 

Quote

Oct 1 17:56:40 Tower unassigned.devices: Adding disk '/dev/mapper/Samsung_SSD_870_EVO_2TB_S6PNNJ0R404700P'...
Oct 1 17:56:40 Tower unassigned.devices: Using Unraid api to open the 'crypto_LUKS' device.
Oct 1 17:56:42 Tower unassigned.devices: Mount drive command: /sbin/mount -o rw,noatime,nodiratime,discard '/dev/mapper/Samsung_SSD_870_EVO_2TB_S6PNNJ0R404700P' '/mnt/disks/Samsung_SSD_870_EVO_2TB_S6PNNJ0R404700P'
Oct 1 17:56:42 Tower kernel: squashfs: Unknown parameter 'discard'
Oct 1 17:56:42 Tower unassigned.devices: Mount of '/dev/mapper/Samsung_SSD_870_EVO_2TB_S6PNNJ0R404700P' failed: 'mount: /mnt/disks/Samsung_SSD_870_EVO_2TB_S6PNNJ0R404700P: special device /dev/mapper/Samsung_SSD_870_EVO_2TB_S6PNNJ0R404700P does not exist. '
Oct 1 17:56:42 Tower unassigned.devices: Partition 'Samsung_SSD_870_EVO_2TB_S6PNNJ0R404700P' cannot be mounted.

 

From researching LUKS, my suspicions is that Unraid is essentially failing to correctly transmit the encryption passphrase to any of my disks - so of course, when it tries to use that same passphrase to mount the disk immediately after formatting, the two passphrases don't actually match. My only real guess for why this is happening is that the LUKS header partitions on all the drives have been corrupted... somehow? But it must be happening sometime during formatting and partitioning with the crypt process, and this is even happening with completely fresh installs of Unraid on multiple USB sticks, so it can't be down to system file corruption. It must be because of something being preserved on each disk between wipes.

 

I'm aware I could just not use encryption and everything works fine, but - besides still preferring to use encryption if possible - the broader principle here is that I *really* want to figure out what the hell has happened here, for both my own sanity and in case this is a bug that affects anyone else who might find this thread in future.

 

 

Edited by augot
Link to comment
1 hour ago, JorgeB said:

This part here sounds more like a UD plugin issue, do you still have problems without the UD plugin installed?

 

Unfortunately yes. The only times I was able to boot normally and decrypt all drives was when all plugins were uninstalled, not just UD - but even then only sometimes, and even then the system would inevitably become unstable over time until it locked up completely.

 

However! I think I've had a breakthrough. I booted up into a live Ubuntu instance off a USB stick after my last comment and had a look through all my disks via Disk Utility there. I could unlock a few of them using the correct passphrase I'd set in Unraid, but not most of them (which goes to my theory that something was causing the passphrase to be corrupted when being set). Disk Utility was able to successfully format all of them with encryption (XFS for HDDs, BTRFS for SSDs), and mount them afterwards too.

 

I've just booted back into Unraid again, and now everything's mounting properly with encryption - and that's with all plugins still installed, for the first time since this whole saga began. I'm going to leave it overnight to see if the system remains stable this time, but things already feel back to their snappy and responsive best. I'm cautiously optimistic.

Link to comment
  • 2 weeks later...

Wanted to bump this just to give a final update after managing to solve the issue, for anyone who might experience something similar in future and google this. The problem WAS the RAM in the end.

 

After my last post, the system unfortunately locked up again after a couple of days and the unmountable disks issue returned. I decided to start again from the beginning, in case one of my original troubleshooting steps had missed something obvious - and since I'd run memtest as one of the very first things, I repeated that again (even though the first time around it had run repeatedly without finding any errors). But this time... so many errors that it halted testing within less than a minute of starting. *Three* of my four sticks of RAM, it turned out, were so badly broken that it's a miracle I was even able to boot. Why didn't anything show up the first time? I haven't the foggiest. But I guess it goes to show the importance of retracing your steps if you end up stuck down a dead end.

 

That memory kit was only a year old, so it's currently wending its way back to Taiwan to be RMA'd - in the meantime I've borrowed a couple of spare sticks from a friend, and my system is back to its stable self. I did have one further strange issue - one of my array drives died while rebuilding parity, with a corrupt XFS filesystem on the emulated disk that just *would not* repair too. Hitting the check button in the GUI gave me an error: "update.php: missing csrf_token". Following troubleshooting steps for this I found elsewhere on the forum (safe mode, closing browser windows/tabs, etc) couldn't fix it, I couldn't fix it via the command line because it was an encrypted disk, and even resorting to a fresh Unraid install on another flash drive (copying only the disk config over from my main flash drive and nothing else, so parity would be carried over) wouldn't work. Very strange error. Only way to fix it was to new config the array and start rebuilding parity again from scratch, and restore that one disk's contents (again) from backups. Assuming no more drives fail, I'm home and dry. Phew.

Edited by augot
  • Like 1
Link to comment
  • ChatNoir changed the title to [Solved] Multiple unmountable disks with "wrong encryption key" (but it's correct?)
  • 1 year later...

Ive been getting this issue, radom drives that says incorrect key when starting the array. I can start in safe mode without any issues. Been removing plugins and it seems like I can start the array when I remove My servers version 2023.01.23.1223 . Im on Unraid 6.11.5.

 

There is no error in the log either, so my guess is that this is a timeout issue of some sort.

 

 

image.thumb.png.3390e0d311bedb2e83d360376811d84b.png

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.