3rd drive failed during rebuild of 2 others

meat · January 30

After upgrading to a new machine and moving drives over I had 1 drive fail, followed by another within a few hours.

I put in 2 new drives and began the rebuild process and in the middle of that, a 3rd drive failed.

At this point I figured it was an issue with drives not fully seated, I had to stop the rebuild since it hung up once the 3rd drive failed.

I shut the machine down and pushed all drives in tight to make sure they were fully seated. I also put the original 2 back in as well.

When I boot back up the 2 new disks that were being rebuilt said emulated, but when I tried to browse the emulated disk, all data was gone,

I'm guessing the parity is shot at this point. I was able to browse the 3rd failed disk that caused the rebuild failure and it is fine, I also mounted the original 2 failed disks and they still have all the data.

The good news is, all data is in tact in the respective drives, I just don't know where to go from here.

Is it possible to add the original 2 drives back without having them try to rebuild, and then simply rebuild parity? Or what steps should I take?

trurl · January 30

27 minutes ago, meat said:

Is it possible to add the original 2 drives back without having them try to rebuild, and then simply rebuild parity?

Yes, New Config. But if there were any writes to the emulated disks, those writes would be lost if you don't rebuild the data. Sounds like that might be your best solution anyway, at least rebuilding parity wouldn't write to any disks except parity.

In any case, probably better to try to identify the problem before trying to fix it.

Attach Diagnostics to your NEXT post in this thread.

meat · January 30

I will have to read up more on the "New Config" option. I thought that might be an options but after reading the warnings on it I'm a bit unsure how to do it without losing all data on all disks.. As for losing anything written to the emulated disks, it should be extremely minimal and unimportant, I had all dockers and VMs off for nearly the entire time, there may have been a bit of data written to the shares from my NVR but that's no big deal.

A second option I considered but not sure if it would work would be to remove the drives and shrink the array, then mount the 2 drives and copy data over to the array, but sounds like the "new config" option would be smoother... if it worked.

unraid-diagnostics-20240130-1734.zip

meat · January 30

New Config won't delete any data on the array, correct? Just rebuild the parity? And what about the "Preserve current assignments" I've never had to do this and don't want to F it up.

trurl · January 31

2 hours ago, meat said:

A second option I considered but not sure if it would work would be to remove the drives and shrink the array

Both options are possible, and both would be accomplished by New Config. Do you want to shrink the array?

New Config doesn't do anything except allow you to assign disks however you want, and optionally (by default) rebuild parity. Only parity would be written, nothing would be changed about your data disks unless you accidentally assign one to either parity slot.

Possibly you are only having connection problems, but not clear from syslog since disks 3 and 8 were probably already disabled when you rebooted.

According to entries in syslog, it also looks like these disks were also unmountable, and then you formatted them. So that would explain why the data is gone. But apparently, that was the emulated disks you formatted, since you say the physical disks have data. So rebuild can't recover anything now anyway.

Please don't format any disk that has your data, including emulated disks. If a disk is unmountable, format will make them mountable again, but it will also make them empty. The correct solution to an unmountable disk that has data on it is check filesystem.

You have a very large number of disks, so I won't tediously examine SMART report for each of them.

Start the array with disks 3 and 8 unassigned.

Then post new diagnostics so I can make sure your other disks are still mountable and have data.

Also, tell me which (if any) disks have SMART ( 👎 )warnings on the Dashboard page.

meat · January 31

Ok, I went ahead and started the new config using the original drives in 3 and 8. I'll let this run overnight and see how it goes. At first glance it looked like data was in both drives and that this should work. I'll report back after the parity rebuilds, thanks for your help.

meat · January 31

12 hours ago, trurl said:

Also, tell me which (if any) disks have SMART ( 👎 )warnings on the Dashboard page.

Disk 5 is only one with a SMART error on the dashboard.

Rebuild ran most of the night and hung up at 5 AM when both Disk 3 (ST2000LM015-2E8174_ZDZBD5RC - 2 TB (sdd)) and Disk 9 (ST2000LM015-2E8174_ZDZB5F6A - 2 TB (sdb)) started getting read errors at the same time. Over 60,000,000 read errors each..

Not sure what the issue is, if it is a backplane controller or some other hardware related.. I doubt it's the drives. I'll attach the diagnostic from this morning.

If I need to I can more all drives back to my old server and see if it builds on there. If everything works I could then move all data to the disks housed in my disk shelf since the only issues I've had are on the drives directly attached to the new server. I'll wait to do anything until I get feedback from here though.

thanks

unraid-diagnostics-20240131-0805.zip

trurl · January 31

23 minutes ago, meat said:

Disk 5 is only one with a SMART error on the dashboard.

3 Reallocated is fine. You can acknowledge by clicking on the warning and it will warn again if it increases.

23 minutes ago, meat said:

Disk 3 (ST2000LM015-2E8174_ZDZBD5RC - 2 TB (sdd)) and Disk 9 (ST2000LM015-2E8174_ZDZB5F6A - 2 TB (sdb)) started getting read errors at the same time.

Both of these disks disconnected and reconnected as different devices. Perhaps they are appearing in your Unassigned Devices now.

I took the trouble to look at SMART for each of your WD disks, since default monitored attributes for those may not be enough, but they look fine also. You should add attributes 1 and 200 to monitor for each of those by clicking on the disk to get to its settings.

One way to clean things up would be to reduce the number of disks. You have a very large number of small disks, some even very small, and probably many are old. Newer larger disks will perform better. More disks require more hardware, more power. Each additional disk is an additional point of failure.

trurl · January 31

For example, my 32TB server only has one parity and 4 data disks, and each of those disks (8TB) would be considered small to medium sized today.

meat · January 31

32 minutes ago, trurl said:

Both of these disks disconnected and reconnected as different devices. Perhaps they are appearing in your Unassigned Devices now.

They still show in the array for now but the rebuild really isn't doing anything even though says its running. I'll likely have to reboot since I don't know if I can get the rebuild to stop.

I acknowledged the error on 3, didn't know you could actually do that.. and I will add 1 and 200 to all drives.

I do have several drives and plan to consolidate them eventually. I have a 24 drive disk shelf (3.5") and 16 drives spaces on the new server (2.5") + 2 NVME. The old server only had 8 drive spaces (2.5")

Is the HPE Smart Array P816i-a SR Gen10 controller supported? The old server was a Dell and it worked fine.

Edited January 31 by meat

trurl · January 31

2 minutes ago, meat said:

add 1 and 200 to all drives

Only to the WD drives. That would give false alarms on others

meat · January 31

2 minutes ago, trurl said:

Only to the WD drives. That would give false alarms on others

Gotcha, thanks.. and I just looked, those 2 disks are in unassigned devices now.. I think I'll just power down and move everything back to the old server so I can at least get everything online and parity back up (hopefully) Then I can safely start moving data off the 2.5" disks.

trurl · January 31

14 minutes ago, meat said:

HPE Smart Array P816i-a SR Gen10 controller

I don't have any experience with that. If it can work as HBA for all disks probably OK. RAID not recommended for many reasons.

meat · January 31

2 hours ago, trurl said:

For example, my 32TB server only has one parity and 4 data disks, and each of those disks (8TB) would be considered small to medium sized today.

How long does it take to do a Parity check when using 8TB or larger drives like that? My largest drives are 4TB and takes over 24 hours.. would it twice as long with 8, or take a week if I were to use 16s?

trurl · January 31

Usual estimate is 2-3 hours per TB of parity assuming no controller bottlenecks. My monthly parity check runs from 12AM-6AM and finishes on the third night taking about 16 hours total.

3 hours ago, trurl said:

Newer larger disks will perform better.

This is mostly because of data density. Of course, the hardware and electronics were improved to allow that much more data in the same space.

itimpi · January 31

56 minutes ago, meat said:

How long does it take to do a Parity check when using 8TB or larger drives like that? My largest drives are 4TB and takes over 24 hours.. would it twice as long with 8, or take a week if I were to use 16s?

The parity check time is almost completely determined by the size of the parity drive. If it takes 24 hours for 4TB on your setup then expect it to take twice that for a 8TB parity drive. However your speeds seem a bit slow so maybe your disk controller is limiting your speeds.

If your checks are taking that long do you have the Parity Check Tuning plugin installed so you can offload the check to increments run in idle times (albeit at the expense of an extended elapsed time).

meat · January 31

43 minutes ago, trurl said:

This is mostly because of data density. Of course, the hardware and electronics were improved to allow that much more data in the same space.

What do you recommend, CMR or SMR?

meat · January 31

12 minutes ago, itimpi said:

The parity check time is almost completely determined by the size of the parity drive. If it takes 24 hours for 4TB on your setup then expect it to take twice that for a 8TB parity drive. However your speeds seem a bit slow so maybe your disk controller is limiting your speeds.

If your checks are taking that long do you have the Parity Check Tuning plugin installed so you can offload the check to increments run in idle times (albeit at the expense of an extended elapsed time).

thanks, I kinda figured.. I may very well have a bottle neck on my disk controller. I've never ran any speed tests or even really looked into it, I am using a netapp disk shelf for the majority of my drives and always wondered how those speeds compared or if it is any different.. After my rebuild is done, I'll looki into tuning mods you mentioned.

trurl · January 31

1 hour ago, meat said:

What do you recommend, CMR or SMR?

Depends on how you use it. You should definitely use CMR for parity since it gets rewritten much more than other disks. If data disks are mostly WORM (write once read many) SMR should be fine.

meat · January 31

thank you both for your help.. I moved things back to the Dell and currently rebuilding parity, fingers crossed it completes.. then I will start cleaning things up.. I suspect that the issues are with the HP.. I'll have to dig into that more, maybe I'll get an evaluation version of unraid and just run some tests, but if I get all the data moved to the drives on my netapp device, I can just use it exclusivly for disks and still use the horsepower of my HP.

3rd drive failed during rebuild of 2 others

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation