Jump to content

Power Issue - Drive Recovery & Array Rebuild


Recommended Posts

Hi All,

 

A few questions - Back story below if of any use but it is mainly a rant so can be skipped:-

 

1. If you have an array with say 2 failed drives, a parity and disk. If you take the parity and disk out of the array, fixed the disk with an XFS Repair CLI - Can you add that disk back into the array before adding and rebuilding parity drive - key being without losing the disk data? Or does a new disk in an array have to be a blank disk - So basically if your fs is XFS can you simply add the drive to an array and resync parity while keeping all your files on the disk intact?

 

2. Do you always have to go through a rebuild when you get a red x? Even if the drive is fine and the issue is intermittent power drops or lose cables while writing.

 

3. With UD can you write to a NTFS drive, or even an XFS drive, using Krusader or in a windows VM without issue?

 

4. If a drive is showing a green ball but with write errors, if the parity has also failed, is that drive lost - So next reboot it will show the red cross of death and I've lost the data. Or at best I can recover the data but need to add a drive to the array and copy everything over again, growing older and older in the process...

 

Many Thanks

 

 

 

Back Story

 

I'm new to Unraid, less than 2 weeks, and have never posted in forums - Other than tinkering with a few Pi's I've always been a windows fan boy. I've been running a 'server' for near a decade for Plex and have recently got into dockers on WSL and into linux in general. I also had a synology NAS which recently failed on it's 2 RAID 1's after giving it a clean and not inserting the PCIE SATA adapter in right.

 

I decided to take the plunge and go for Unraid. I tore down my gaming pc to temporarily build in and set up 6 drives (2x16TB, 4x18TB Exos X series) with 2 NVME 2TB WD Blue's as cache drives.  I assigned an 18TB as parity leaving 5 drives for data and everything was going great. I even said to myself this is awesome, why didn't I do this years ago instead of dealing with windows updates, crashes, windows 11 pop ups, poor performance, etc.

 

So as the jinx gods laughed upon me I took down my windows 10 'server', gave it a big clean. Ripped out the 9 drives consisting of 4 WD 2TB reds, a 6TB red, a 8TB seagate green and 2 exos 16s. I then added 9 Exos 18/16 drives, 6 in the array built earlier and 3 as unassigned devices to be added later. There were also 2 1TB 2.5 QVO that were UD's to be added later.

 

Then onto the boot - firstly nothing was showing on the screen so I did the usual debug of ripping things out until I get a boot and then add one thing in at a time until it doesn't. Turned out to be the GPU (1070 Giga Aero) which after some bios tweaks of using internal GPU only I was into unraid. All drives showed as they did on the other computer and GPU was showing (nvidia plugin). Great!

 

I then went to add the GPU to the plex docker and noticed the docker I had didn't have a GPU option. I pulled down another docker and stupidly pegged the config folder to that of the other plex docker. I thought, oops I best shut down as it had frozen, which I did by clicking the shut down button and it worked. But on reboot I got a red X on drive 1. I panicked and got onto google looking how to recover. I then was trying to access data from it to make a copy when another drive showed missing and then it started going nuts so I quickly shut down.

 

I thought for a second and my best guess from struggling to boot was the 550w PSU wasn't enough for the 10900K, 3200 4 channel 16GB sticks, 9 x 140mm noctua fans, 2 nvmes, 2 2.5 SSDs, GTX 1070, 9 18/16 Exos HDD, 2.5gbps PCIE NIC, 2 SAS port HBA to 8 SATA and a 4 way DVB tv tuner. So i threw in my gaming computers 850w and booted up. There was still a red x on disk 1 so I cleared one of the 18TB's I was planning on using as parity 2 and, after removing disk 1 added the other drive to the array to rebuild. It failed after a few minutes. So I removed that and added the original one again. That failed too after a few minutes. I turned off and thought maybe the HBA had too many drives and hooked the new 18tb to the boards SATA. After 28 painful+depressing hours it rebuilt and all data was available. All PSU's are Seasonic Focus Platinums by the way, you should never cheap out on a PSU!

 

I enabled plex quickly last night as fam and friends are missing it with it being down all weekend - the docker was the old version, the new one wasn't there. I tried it out, all working well and much faster than vmware even without the GPU added yet.

 

Got up this morning all still working well. During my panicked 28 hours I had a look what was on disk 1 and I could see it was a few docker containers and the 'system' file share. I wanted these on the NVME that I was going to backup with a script. Everytime I hit the move button they wouldn't move though - No google answers on this other than disabling docker and VM, which still didn't work. It turned out that setting to 'Only' on Cache after it has already saved data to a disk will not move it back to the SSD. Same with data on the Cache that I set to disk only. When I changed to prefer and add the cache to the disk shares, hit move and... - waaaayyyy - it all started moving to where it should. A bit before hitting the move I had started a preclear on the previous disk 1 drive.

 

My heart sunk again as I noticed after about 5 minutes the read/write numbers were no longer moving. I found the log button on the top of the screen and there was a lot of red errors writing to disk 0 (assumed parity). And then bang... another red x on parity this time. I was thinking ahhh another 28 hours of waiting now but at least its working and then I noticed the pop up saying 2 drives have failed (it was an orange one for 2 drives [inc. parity], a red one for the parity). Disk 3 was showing 650 odd errors and the numbers were no longer changing. I've been trying to copy data from that drive to another UD drive both using Krusader and a windows VM I set up. Nothing is letting me copy the whole folders, or even sub folders. I was going to try using rsync tonight when I get home as trying to do this on my phone is just frustrating (I have wireguard set up) especially after a week of 4-5 hour sleeps thanks to this little project.

 

So yeah - Unraid seemed awesome but at the moment I'm glad I'm still on the trial as my poor little ticker can't cope with this constant data loss risk. Yes I should have backed up twice but I wanted to get this up and running and most was done remotely through my phone so it was easier to set preclear to the drive I had just transferred the data from than lose half a day of the server doing nothing.

 

I have to admit I'm close to just installing windows again and going back to the devil I know but the power of unraid compels me!

 

Thanks

 

 

 

 

Link to comment

Wish you had asked for help while your "backstory" was going on.

 

Maybe obvious, but in general, whether rebuilding or repairing filesystem or whatever, you need your hardware to be working well. If you are having hardware problems you need to take care of those before attempting rebuild or repair.

 

If you don't know what your problem is, or aren't completely sure you know what to do, please ask for help. And save us from having to ask for it by posting your diagnostics.

 

1.

 

Not enough details to give a good answer. If you only have single parity, you should only have one disabled disk. Unless a data disk is unmountable, probably no reason to repair its filesystem.

 

Whether it would be better to rebuild parity or rebuild the data disk depends. Something needs to be rebuilt since the array is out-of-sync.

 

A disk is disabled when a write to it fails. After it is disabled, the physical disk isn't used again. Any reads from the disk are emulated from the parity calculation be reading all other disks. Any writes to the disk, including that initial failed write, are emulated by updating parity. If you don't rebuild the data disk, data that was written to the emulated disk would be lost. And there may be some possibility that initial failed write might have been filesystem metadata, which might result in corruption.

 

2. See my answer in 1.

 

3. Yes. For Krusader, it would depend on the mappings you have whether or not UD is accessible.

 

4. If a write to a disk fails, it is disabled and won't be green. If the data is recovered (by rebuilding?) not clear what you mean by adding a drive and copying.

 

And, in general, moving or copying to other disks in the array is not the best way to backup anything from another array disk that is disabled, since all disks are involved in emulating that disabled disk and your array already has no redundancy unless you have more parity disks than disable disks. Best to backup that data somewhere off the array.

 

Of course, you must always have another copy of anything important and irreplaceable. Parity is not a substitute for backups.

 

 

Link to comment
39 minutes ago, DangerPete said:

the red cross of death

Seldom fatal. That is the whole point of parity.

 

And, we have helped people recover from much worse.

 

Sometimes a user will make a not-so-bad situation worse by trying to fix it without knowing exactly what to do.

 

A common mistake is to format a disk before rebuilding it. Maybe they think you always format before using a new disk, maybe the drive was unmountable so they thought it needed formatting.

 

Format is NEVER part of rebuild.

Link to comment

I have one disk that is showing green at the moment and I know it's too many drives on one SATA power connector (6) so I'm going to shut down when I get home - Rewire - I have new cables for the HBA too, and bring back up. Thanks very much for the advice on formatting. I had the mentality that once a drive has the red cross it's game over. Backups will be made as soon as I can get stable access to the data!

 

Thanks again - I may need to ask for help if disk 3 drops from the array as I will have no parity (showing red x on 1 and only parity drive).

Link to comment

I think it's back up now - I rewired everything with new sata cables and only running 3 drives max per cable seems to be up and running now and rewriting parity to the drive. I'll be sure to include the diagnostics if it all goes wrong. Disk 3 is showing as normal, I assume it has errors somewhere probably linked to parity going bad due to too much power being drawn from one cable.

 

Is there anyway of stress testing this now? I usually run prime95 on windows.

 

I had a lot of issues after rewiring as the bios decided to go back to peg instead of igpu. Its a msi mobo so thats going into the bin and I'm getting an asus.

 

Thanks again for your help

 

 

 

 

Link to comment
6 minutes ago, DangerPete said:

prime95

I think that is mostly about the CPU and related systems. The main thing I will say about that on your server is don't overclock.

 

2 minutes ago, DangerPete said:

stress testing

The drives and related hardware is the important thing and rebuilding is good test of the array.

 

4 minutes ago, DangerPete said:

Disk 3 is showing as normal, I assume it has errors somewhere probably linked to parity going bad due to too much power being drawn from one cable.

Disk3 problems are not due to parity problems. Possibly parity and disk3 problems are due to similar and/or related causes.

Link to comment

Hi trurl, thanks for all the info. Ah oops it's properly overclocked to 5.2ghz on all 10 cores. I'll knock the overclock off once it's finished building parity. XMP ok, rams running at 3200?

 

I did rebuild drive 1 and it seemed to run fine. It was only when I ran mover manually it showed the errors and parity failed - I'm not sure if there is some corruption on a drive. Anyway of scanning for it, or is that just running a 'check filesystem status' on each drive and correcting errors? Does that break parity though as you are changing bites on a single drive right?

 

I honestly think it was hooking up 6 of these power hungry exos drives to one sata cable. Its seemed fine since changing to max 3 on each and I've been pre-clearing 2 drives at 240MB/S and running parity about 210MB/S while plex and several dockers are live. I also installed two new sas to sata cables and moved some hdds to the onboard sata so less traffic is running through the HBA as pretty much every drive was hooked up to it.

 

Smarts shows healthy, I think the oldest drive is 4months on time with 1250 spin ups. All drives were given a preclear session including the new ones.

 

Once parity is finished I'm going to back everything up over the next few days and then try things out.

 

Thanks again you've given me much more confidence with it knowing that the red ️  isn't all doom.

 

Link to comment
7 hours ago, DangerPete said:

running a 'check filesystem status' on each drive and correcting errors? Does that break parity though as you are changing bites on a single drive right?

Depends on how you do it. Filesystem repairs on array data disks must use the md# device or you will invalidate parity. If you do it through the webUI it will use the correct command. Some people try to do it from the command line and get the command wrong.

 

 

Link to comment

Thanks Jonathan - funnily enough I have checked xmp/cpu overclock and both have been disabled since I moved to the server hardware. Originally I had no video output at first so I reset bios which didn't fix the issue but removed all my previous overclocks. I had needed to pull the power pins out of my gpu, boot into bios, change to IGPU instead of PEG, shut down and then plug the GPU in and that got me a good boot to unraid.

 

Seems like I have the wrong hardware for all this anyway. I went for the 10900k because of the extra 2 cores over the 11900k but the 10th gen has less pci lanes which has started to sting me too. It was built for windows and vms and I've spent enough on this project already... Once I've ordered the licence that is (if I can get everything up and running with less problems than the windows solution).

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...