Crash during rebuild, changed one disk from 4TB to 8TB

rookie701010 · December 4, 2022

Hi there,

this seems to be a GUI-related trap (bug???). I added a 4TB drive to my array, the parity disk is 18TB. It zeroed, I formatted it (with XFS) and everything was okay. No user data on it. Then I decided to up my SSD cache array (went fine) and to replace the disk with 8TB (and an additional fan for better air flow). System restarted, I replaced the disk in the array and the rebuild started. No pre-clearing, no formatting before. The information on unraid/main says 8TB free on this disk, everything fine. The process crashed reliably, twice. The whole system went unresponsive, no screen output, not reachable by network.

The way out of this was to erase the disk (array not started) and start the array. Then it will be formatted, and afterwards the rebuild starts.

This seems like a handling issue: Rebuild on a replaced disk with different size should only start after formatting.

I'm running unraid version 6.11.3.

with best regards

rookie701010

itimpi · December 4, 2022

1 hour ago, rookie701010 said:

Rebuild on a replaced disk with different size should only start after formatting.

Rebuild overwrites all sectors so formatting a disk would be pointless.

rookie701010 · December 4, 2022

16 minutes ago, itimpi said:

Rebuild overwrites all sectors so formatting a disk would be pointless.

Then it shouldn't crash and handle it as a 4TB disk (not optimal, but okay). There was no data on the original, but rebuild crashed... which poses some interesting questions. An erased and formatted disk is just rebuilding parity, with *exactly the same visuals* as restoring a disk. Or do I miss something? The I/O stats say that the replaced disk is written to, which would imply restoring the data and the 4TB file system. This looks inconsistent.

Edited December 4, 2022 by rookie701010

itimpi · December 5, 2022

4 hours ago, rookie701010 said:

Then it shouldn't crash and handle it as a 4TB disk (not optimal, but okay). There was no data on the original, but rebuild crashed... which poses some interesting questions. An erased and formatted disk is just rebuilding parity, with *exactly the same visuals* as restoring a disk. Or do I miss something? The I/O stats say that the replaced disk is written to, which would imply restoring the data and the 4TB file system. This looks inconsistent.

The rebuild should start by restoring the original 4GB file system on the new disk, then if that completes successfully Unraid will try to mount the drive and expand the file system to fill the whole disk.

rookie701010 · December 5, 2022

Well. Something goes horribly wrong with that, now for the third time. Currently rebuilding parity after doing the drive removal the documented way. Will take some time, but three crashes in a row is a bit unsettling.

JorgeB · December 5, 2022

4 hours ago, rookie701010 said:

Well. Something goes horribly wrong with that

If you didn't reboot yet post the diagnostics.

rookie701010 · December 5, 2022

With "horribly wrong" I mean completely unresponsive, no network, no console. So... hard reset is the way to re-awaken the box. Maybe I can set up forensics like a dmesg -W in a ssh terminal on another server and hope that something shows. However, now parity is rebuilding, and the new disk is getting precleaned, though. Would need to duplicate on another setup.

Edited December 5, 2022 by rookie701010

JorgeB · December 5, 2022

You enable the syslog server.

rookie701010 · December 6, 2022

Okay, rsyslog is enabled and appears to be working. The parity rebuild also resulted in a hard crash. Now unraid is in "zombie" mode with VMs running and a stale configuration, parity check is progressing. But now we have a log 👯‍♀️The array shows as not started, but provides its services... anyway, lets see what the parity check will do.

rookie701010 · December 6, 2022

Hmm the parity check went trough pretty fast. Now everything is normal. Next up: Add the pre-cleared drive 😈

rookie701010 · December 6, 2022

... aaand it worked. No idea what caused the hiccup, and unfortunally, no diagnostics of the crash. Maybe I'm able to reproduce it on a different box. Needs to be set up first, though.

rookie701010 · December 20, 2022

Update to add: The crashes kept coming, and as it almost always is in this case, it's something hardware related. In this case the culprit is the RAM (pretty sure), just changed it to Kingston HyperX Fury/Renegade 3600, also 128GB. Since the change required some disassembly I also changed the CPU to Ryzen 9-3950X. What's not to like Since this thing is running VMs and containers, more cores are a good thing.

Why am I so sure regarding RAM: I had similar issues with this kit in completely different hardware, after 18 months. So there appears to be a degradation issue. I changed everything (!) else in the box, same behaviour. Changed the RAM, stable... although the MSI boards seem to have ageing effects, too.

I will close this issue now, it is at least linked to the hardware issue. There was no useful info in the rsyslog, btw. last entry was some cron hourly job, then completely unresponsive box.

Crash during rebuild, changed one disk from 4TB to 8TB

Recommended Posts

rookie701010

Link to comment

itimpi

Link to comment

rookie701010

Link to comment

itimpi

Link to comment

rookie701010

Link to comment

JorgeB

Link to comment

rookie701010

Link to comment

JorgeB

Link to comment

rookie701010

Link to comment

rookie701010

Link to comment

rookie701010

Link to comment

rookie701010

Link to comment

Join the conversation