Jump to content
AgentXXL

[SOLVED] Failed Data Drive After Adding 1 New Data Drive?

19 posts in this topic Last Reply

Recommended Posts

I've had a glitch with adding a new 10TB drive to unRAID. My unRAID was running well and the last parity check succeeded, so I assumed all was good. The new 10TB drive was pre-cleared and stress tested successfully while still in the USB enclosure. To prep for adding the drive to my system, I disabled Docker and Array Autostart in Settings. I then did a clean power-down.

 

While powered down, I shucked the new 10TB from its USB enclosure, and added it to a free bay in my case. I then powered up and it booted normally with no unusual messages that I saw during the boot. The array was stopped as expected, so I then assigned the new 10TB drive to a free slot in the unRAID webgui.

 

I then started the array and the first thing I did was enable the option to format the newly added disk. While this format started, I then noticed that one of my new 8TB drives which had been in the system for at least 1 month had a red X beside it. Hovering over the drive with the mouse revealed that the drive was disabled and the contents being emulated.

 

I then downloaded the SMART log (attached) and saw only a few reported UDMA CRC errors, which to my prior experience isn't a major concern as they just require a resend of the info to/from the drive and that's handled automatically. I then decided to stop the array after the format finished on the new 10TB. I tried to start the array again but the 8TB drive remained disabled.

 

I then powered down, removed the suspect 8TB drive and attached it to my Ubuntu system where the drive was seen normally and a quick browse of the XFS filesystem shows that things appear to be intact. Regardless, I had a new 8TB that arrived in the mail this morning that I was planning to pre-clear and stress test and keep ready in case of failure. That new 8TB drive is now pre-clearing.

 

My plan is to wait for the pre-clear to finish, then add the new drive to the array as a replacement for the potentially failed disk. In the meantime I've left the array in the STOPPED state (autostart still disabled). This will exclude use for my nightly backups of my other systems and of course my Plex docker. That's fine - I can tolerate missing a nightly backup from each of my other systems and of course, Plex use isn't required either. If I really want to watch something, I can use my Netflix/Amazon Prime/other streaming options.

 

I assume once I add the new 8TB as a replacement, the next array start will trigger the rebuild. I have a couple of questions:

 

1. My suspected failed drive appears good, but is there anything I should do with it on my Ubuntu system to try and verify it further? 

2. Since the new 10TB drive is already added, formatted and part of the array, would it be quicker to copy the data from the suspect drive on my Ubuntu system to the array? This would be over the network, but I suppose I could also add the disk as an unassigned volume and copy to the array.

3. If I don't do the copy as described in item 2, how long (estimated; system specs in my signature) would rebuild from parity take for the new 8TB? Once the 8TB drive is finished its pre-clear/stress test, I could add it as a new drive instead of a replacement, but this means I'd have to redo the configuration without the suspect disk and then let parity rebuild with the two new disks in place (1 x 10TB, 1 x 8TB).

 

The parity rebuild would take about 24 hours based on the last runs, and then I'd have 18TB (raw) of empty storage. This is plenty for me to copy over the data (about 6.2TB) from the suspect 8TB drive. Once copied over, I can then redo the pre-clear/stress test on the suspect drive. If it passes, I can keep it as a spare for future failures.

 

I'm fairly confident that my parity is fine, but rather than try to rebuild and/or re-use the suspect disk in a new config, I'm still leaning towards the copy method vs a parity rebuild using the new 8TB drive. Thoughts? I didn't examine the attached SMART log too closely. I removed the references to the serial number, not that it's really useful to anyone. Any suggestions/recommendations appreciated!

 

Dale 

ST8000DM004-2CX188-20190710-1256.txt

Share this post


Link to post

OK, I see another post made by @BiGs and responded to by @jonathanm and @itimpi, related to my questions above:

 

 

Since I had powered down and then restarted after removing the suspect disk, my config now shows the suspect Disk 8 with a red X but the config field is now set to 'unassigned' instead of the name/model/sn of the missing disk. So inadvertently I've created a new config that is valid - I haven't tried starting the array yet though.

 

I'm still letting my new 8TB finish its pre-clear but now I'm wondering if it is still possible to assign the new 8TB to the Disk 8 slot when cleared and have parity rebuild it? Because Disk 8 still has a red X but nothing assigned, I'm assuming that any disk large enough assigned to that slot will still have the data rebuilt from parity?

 

Or am I now better off starting the array with the new config and then copying the apparently accessible data from the suspect disk over the network or via the Unassigned Devices plugin? Or will Disk 8 contents still be emulated and I'll need to use the unbalance plugin or Krusader/MC to copy the data from the emulated disk to free space in the array?

 

Thanks again for any assistance or suggestions!

 

Dale

 

Share this post


Link to post

new drives should mean new cables, right?   Always buy new cables, they are cheap and not worth the pain in the ass they cause when they are screwed up.

you probably knocked a cable alittle loose, I'd finish the preclear and add it back in.

Share this post


Link to post
Posted (edited)
48 minutes ago, Abzstrak said:

new drives should mean new cables, right?   Always buy new cables, they are cheap and not worth the pain in the ass they cause when they are screwed up.

you probably knocked a cable alittle loose, I'd finish the preclear and add it back in.

It's a Norcotek RPC-4220 enclosure... like this one: http://www.norcotek.com/product/rpc-4220/ . The new 10TB drive was just put into one of the empty drive trays and then inserted into a free bay on the front of the case.

 

So nope, no new cables - the hot-swap drive trays are accessible from the front. The cabling is one SFF-8087 mini-SAS to 4 SATA breakout for the motherboard SATA ports, and the remaining 4 'shelves' of 4 disks are cabled with SFF-8087 to SFF-8087 cables fed from the LSI 9201-16i HBA.

 

I did swap out one of the SFF-8087 to SFF-8087 cables for one shelf as drives on that shelf of 4 saw more UDMA CRC errors. The new 10TB was installed 2 shelves up and the 8TB that 'failed' is on a shelf with 2 other 8TB drives and has been error-free since the original build. 

 

I'm suspecting the 8TB drive is fine, but I'll let the pre-clear finish on the new 8TB before I decide how to proceed. That'll be sometime tomorrow.

 

Dale

 

Edited by AgentXXL

Share this post


Link to post

Well, the pre-clear on the new/replacement 8TB finished successfully so I added it to the array in the slot (Disk #8) with the red X, formerly occupied by the suspect 8TB disk. However it's just sitting there showing read errors. Was I supposed to format it with XFS after the pre-clear as that step isn't indicated in the 'Replace a Data Drive' procedure. Step 9 mentions needing to check the 'Yes I'm sure' checkbox when starting the array, but that wasn't offered.

 

Right now the array is started with the replacement in the same slot as the suspect drive, and the suspect (but apparently fine when mounted on my Ubuntu system) drive is completely removed.

 

The only option I see is to do a 'Check' which is apparently only a read-check of all disks. Is this necessary before it will start the rebuild? Right now it's showing that contents are still emulated. Suggestions on how to proceed?

 

Dale

 

Share this post


Link to post

Here's the diagnostics... I'm looking at syslog.txt and see that it's got a lot of info about the replacement disk but nothing that stands out so far. But I'm a newb at analyzing unRAID diagnostics. Any help appreciated!

animnas-diagnostics-20190711-1616.zip

Share this post


Link to post

The syslog shows that Unraid started the rebuild process on disk8, but then started getting lots of write errors.   Not quite sure why, but that explains why it is not getting anywhere.

Share this post


Link to post

I'm suspecting I may have a controller issue... as mentioned above (in my 3rd post) I'm using a Norcotek enclosure with 20 hot-swap drive bays. The shelf that has 4 slots for 8TB drives shows the 1st 2 fine, but the 3rd slot is where the replacement (and the original suspect disk) are located. It's connected via a mini-SAS SFF-8087 cable to the LSI 9201-16i.

 

I'll try stopping the array and moving the disk to a slot on another shelf or to the slot I've reserved for a 2nd parity drive (when I can afford it). The slot reserved for the 2nd parity drive is connected to a motherboard SATA port so that will tell us if it's a controller/cable issue or still something else.

Share this post


Link to post

No change. Tried the slot reserved for 2nd parity drive that's connected to MB SATA, and also a port on another shelf that's connected via a different SFF-8087 cable to the LSI 9201-16i.

 

I'm thinking that I may need to use the unbalance plugin or Krusader to copy the emulated contents to elsewhere on the array. I have enough free space to allow this since adding the new 10TB which started this whole adventure.

 

Or what about using dd on my Ubuntu to mirror the original suspect drive to the new replacement and then re-inserting it? I also still haven't tried a 'Check' even though that says it's a read-only check of all drives.

 

One final option might be to first use dd to mirror the contents of the suspect 8TB to the replacement 8TB. Then do a new configuration and let parity rebuild. Thoughts?

 

Dale

Share this post


Link to post

So questions... if unRAID started the rebuild and for some reason had the write errors, moving the replacement disk to a new slot/SATA port would still leave the disk disabled? If so, since I've already pre-cleared and verified the replacement, could I not just remove it from the array and format it as XFS under Unassigned Devices? Then re-add it to the array while attached to the MB SATA port and hope it rebuilds?

 

Dale

Share this post


Link to post
Posted (edited)
1 minute ago, AgentXXL said:

if unRAID started the rebuild and for some reason had the write errors, moving the replacement disk to a new slot/SATA port would still leave the disk disabled?

Yes, you need to re-enable the drive, like linked above.

Edited by johnnie.black

Share this post


Link to post
2 minutes ago, AgentXXL said:

 If so, since I've already pre-cleared and verified the replacement, could I not just remove it from the array and format it as XFS under Unassigned Devices? Then re-add it to the array while attached to the MB SATA port and hope it rebuilds?

The rebuild process overwrites the existing contents so you gain nothing by attempting a format outside the array.

Share this post


Link to post
5 minutes ago, johnnie.black said:

Yes, you need to re-enable the drive, like linked above.

The re-enable process has now re-started the rebuild of Disk 8 from parity. The drive is still attached to the MB SATA port but once the rebuild is complete I'll likely want to move it back to the shelf with the other 8TB drives.

 

I'm not happy with the new but long SFF-8087 cables I'm using for the connections to the 4 shelves controlled by the 9201-16i. I'll order some new shorter cables which should hopefully reduce potential for UDMA CRC errors. I'll also do some testing with the 9201-16i... I do note that it's running an older firmware (19.0) in IT mode so I may also upgrade it to the latest.

 

Thanks for the help!

 

Dale

Share this post


Link to post

I also have a 24 slot norco case with sas connections. Backplanes can fail like cables and sometimes they have a bit of play that doesn't get a good connection. I would try shutting down and reinserting the missing disk and booting up again next time. Occasionally I do have a disk missing on bootup and this fixes it. If you are running through a sas controller you prob have to jbod/pass through the disk again as I do, manually via the card's bios or unraid never sees it. You prob know this and looks like you fixed it but thought I'd mention.

Share this post


Link to post
6 hours ago, BiGs said:

I also have a 24 slot norco case with sas connections. Backplanes can fail like cables and sometimes they have a bit of play that doesn't get a good connection. I would try shutting down and reinserting the missing disk and booting up again next time. Occasionally I do have a disk missing on bootup and this fixes it. If you are running through a sas controller you prob have to jbod/pass through the disk again as I do, manually via the card's bios or unraid never sees it. You prob know this and looks like you fixed it but thought I'd mention.

Thanks @BiGs... I definitely have things configured as recommended on the LSI controller. And I've been lucky that this Norcotek unit has been quite reliable when it was in use as my FreeNAS server, but that system used 8 SATA ports on the motherboard and only 6 more from an older Intel HBA. So some of the bays have never had drives connected since I bought it over 6 years ago. 

 

I've ordered some short 6G certified MiniSAS cables - 11.5" compared to the 30" existing. I'm hoping they'll help reduce crosstalk as right now the excessive length of the existing cables has them looped back and forth a couple of times inside the case. Far too much length for what's needed.

 

When the new cables arrive, I'm going to shut the whole system down so I can disassemble the unit to do a clean and apply some contact cleaner/stabilant on the MiniSAS connectors,  the 4 x SAS/SATA connectors per shelf and the power connectors on each shelf. I'll especially pay attention to the shelves that hadn't seen any use until I built my unRAID setup in it 2 months ago. There could be oxidation on the connectors as they never saw any use until recently.

 

Regardless, the parity rebuild of my failed disc is well underway - should be done sometime tomorrow. After that I'll use something like FreeFileSync (or maybe even Krusader natively in its unRAID docker) to compare the files on the rebuilt drive to the files on the original drive that is suspect.

 

Since the original drive is still fully mountable and browseable on my Ubuntu system, I'm suspecting it's a controller/cable/oxidation issue that caused the failure after adding the new 10TB drive. If all files appear the same after the comparison, I'll re-do a pre-clear and stress test on the original 8TB and hopefully put in on the shelf as a spare ready to use should I experience another failure.

 

And hopefully the cleanup of the connectors and application of some Stabilant 22/Deoxit will help where oxidation may have played a role. That and I'm pretty sure I'll also update the firmware on the LSI. Even though it's not likely at fault, it won't hurt to have the latest version with better support for larger drives.

 

Hope your issues continue to remain resolved. And I agree with what was said in your thread about removing drives that aren't immediately needed for the array. It's nice to have some spare space, but it's also lifetime hours wasted on the drives that don't see any use.

 

Dale

 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.