Jump to content

Crash during rebuild, changed one disk from 4TB to 8TB


Go to solution Solved by rookie701010,

Recommended Posts

Hi there,

 

this seems to be a GUI-related trap (bug???). I added a 4TB drive to my array, the parity disk is 18TB. It zeroed, I formatted it (with XFS) and everything was okay. No user data on it. Then I decided to up my SSD cache array (went fine) and to replace the disk with 8TB (and an additional fan for better air flow). System restarted, I replaced the disk in the array and the rebuild started. No pre-clearing, no formatting before. The information on unraid/main says 8TB free on this disk, everything fine. The process crashed reliably, twice. The whole system went unresponsive, no screen output, not reachable by network.

The way out of this was to erase the disk (array not started) and start the array. Then it will be formatted, and afterwards the rebuild starts.

This seems like a handling issue: Rebuild on a replaced disk with different size should only start after formatting.

 

I'm running unraid version 6.11.3.

 

with best regards

 

rookie701010

Link to comment
16 minutes ago, itimpi said:

Rebuild overwrites all sectors so formatting a disk would be pointless.

Then it shouldn't crash and handle it as a 4TB disk (not optimal, but okay). There was no data on the original, but rebuild crashed... which poses some interesting questions. An erased and formatted disk is just rebuilding parity, with *exactly the same visuals* as restoring a disk. Or do I miss something? The I/O stats say that the replaced disk is written to, which would imply restoring the data and the 4TB file system. This looks inconsistent.

 

 

2022-12-04 21_57_26-unraid_Main – Mozilla Firefox.png

Edited by rookie701010
Link to comment
4 hours ago, rookie701010 said:

Then it shouldn't crash and handle it as a 4TB disk (not optimal, but okay). There was no data on the original, but rebuild crashed... which poses some interesting questions. An erased and formatted disk is just rebuilding parity, with *exactly the same visuals* as restoring a disk. Or do I miss something? The I/O stats say that the replaced disk is written to, which would imply restoring the data and the 4TB file system. This looks inconsistent.

 

 

2022-12-04 21_57_26-unraid_Main – Mozilla Firefox.png

The rebuild should start by restoring the original 4GB file system on the new disk, then if that completes successfully Unraid will try to mount the drive and expand the file system to fill the whole disk.

Link to comment

With "horribly wrong" I mean completely unresponsive, no network, no console. So... hard reset is the way to re-awaken the box. Maybe I can set up forensics like a dmesg -W in a ssh terminal on another server and hope that something shows. However, now parity is rebuilding, and the new disk is getting precleaned, though. Would need to duplicate on another setup.

Edited by rookie701010
Link to comment

Okay, rsyslog is enabled and appears to be working. The parity rebuild also resulted in a hard crash. Now unraid is in "zombie" mode with VMs running and a stale configuration, parity check is progressing. But now we have a log 👯‍♀️The array shows as not started, but provides its services... anyway, lets see what the parity check will do.

2022-12-06 19_32_36-jglathe@rpifour01_ _var_log.png

Link to comment
  • 2 weeks later...
  • Solution

Update to add: The crashes kept coming, and as it almost always is in this case, it's something hardware related. In this case the culprit is the RAM (pretty sure), just changed it to Kingston HyperX Fury/Renegade 3600, also 128GB. Since the change required some disassembly I also changed the CPU to Ryzen 9-3950X. What's not to like :) Since this thing is running VMs and containers, more cores are a good thing.

Why am I so sure regarding RAM:  I had similar issues with this kit in completely different hardware, after 18 months. So there appears to be a degradation issue. I changed everything (!) else in the box, same behaviour. Changed the RAM, stable... although the MSI boards seem to have ageing effects, too.

I will close this issue now, it is at least linked to the hardware issue. There was no useful info in the rsyslog, btw. last entry was some cron hourly job, then completely unresponsive box.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...