Drives keep having errors and being disabled


GoChris

Recommended Posts

Over the past couple, maybe a few now, I keep having drives "fail". It's been several different drives, and not a single one has anything wrong with the SMART statuses. I had recently installed a 3rd HBA card and removed an expander. I thought maybe that was the culprit, but I can't recall if I had a drive fail before or after I put that in. Either way I removed the new card and I still have had a couple drives fail since. I have done rebuilds on all of them, sometimes a 2nd drive "fails" during a rebuild, sometimes after. This latest time and once other a drive has tons of errors during a rebuild, but thankfully isn't being marked as disabled. Although really I don't expect most of this data to be 100% once I figure out the problem.

 

I've checked the connections, I've done a memory test over last night, no errors. The drives have failed in at least 2 different backplanes (Norco 4224 case).

 

What should I check, what should I do? There's been like 6 or 8 failures, and again, zero smart status problems. One of the recent ones was a parity drive, I did have a spare new HGST NAS so I used that this time for the parity rebuild.

 

Attached some diags if that helps.

tower-diagnostics-20181107-1713.zip

Link to comment

This type of issues can be difficult to diagnose, see if the failing disks have something in common like same HBA or miniSAS cable, if they use different controllers it could for example be the power supply.

P.S. when there are read errors on another disk during a rebuild, like it happened during disk4 on the log, and since there's no second parity, the rebuild will continue but there will be some or a lot of corruption on the rebuilt disk.

Link to comment

In hindsight I should have stopped doing rebuilds as the data was fine, figured out the problem, then just rebuilt parity leaving the disk intact. Although the first one I did I had no reason to think I'd end up with these problems. 

I have an 850W power supply and according to the dashboard the UPS load is ~210W when all spinning. I'll see if I have a spare PS I can swap out, I also have a couple extra SAS cables so I can do some swapping to see. Seeing as the 3rd HBA removal didn't stop the issues as I thought it might have started them, I will use that card to replace the other 2 one at a time and do some read tests to see if any other drives "fail".

 

Any thoughts about chaning that plan? I'll start tonight after work.

Link to comment

3 of the last failures were on the same backplane, however at least one was there cause it was moved prior. Either way, I have replaced the cable to that one, and replaced the card with the new card that I removed that I thought was the culprit but problems continued after. So, new cable and card going to that one backplane row that had drive issues.

 

How can I test the drives, I'm guessing just some reads across each, to know if there are errors still without doing a drive rebuild right now?

Link to comment

SMART test I forgot about, thanks. I did eventually think to do a non correcting parity test. Almost 4TB in and no errors. Might let it finish then rebuild the remaining failed disk. Seems either that cable or HBA card, or 3 slots on that backplane are the problem. 

 

I think what I will do is take a couple spare drives put them in those slots and do some pre-clears on them to see if there are any read errors to keep ruling out the exact culprit.

Link to comment
  • 2 weeks later...

I am sadly still having issues, here is what I've done and am still experiencing.

 

Replaced the cable to the backplane that had 3 drive issues, a 4th had no issues.

Removed the HBA the cable was plugged into and move it to the expander.

I took one of the drives from that row that reported issues, and ran a pre-clear on it, resulting in no issues and a successful pre-clear and clean SMART status.

(It's been a couple weeks so not 100% sure on every detail)

Rebuilt the drive (disk6) no issues (I think).

Rebuilt drive 16, near the start there were 768 errors on disk6 that was just rebuilt fine.

 

Now, 768 errors is what was the exact number of errors on the past couple failures as well, each time.

I even put the suspect HBA card back in, plugged one drive into it, and did a pre-clear, resulting in no issues and a clean SMART status.

 

So, there could have been a cable issue, that cable is out, I don't think there is an HBA issue.

 

Why am I getting 768 errors on rebuilds on multiple drives that seem to exhibit no issues from SMART and full pre-clear tests. I tested both drives for a full cycle. So currently, disk6 is in a failed state, again, not going to bother doing a rebuild as I suspect unless I do something different, I'll just end up with errors on disk16 when I rebuild disk6.

 

Thoughts?

Edited by GoChris
Link to comment
4 hours ago, John_M said:

850 watts should be enough power but does it have a single high current +12 volt rail? If you don't understand the question then what make and model is it?

Good question, seem it's not a single rail, so that might need some further investigation as to the power setup. It's an Antec HCP-850, High Current Pro.

http://www.jonnyguru.com/modules.php?name=NDReviews&op=Story&reid=215

Right now since the backplane takes molex, they are probably split off one cable. I will get another cable and split the power to the backplanes at least to another, I hope I have a cable. Or, get a single rail PSU ya? Wasn't a problem for a while though.

Link to comment
7 hours ago, GoChris said:

Good question, seem it's not a single rail, so that might need some further investigation as to the power setup. It's an Antec HCP-850, High Current Pro.

On the face of it, that looks like a pretty decent power supply. It seems rather expensive (at least, here in the UK, and in comparison with the Corsair AX and RM models I prefer), which ought to be an indication of quality, but isn't necessarily the case. However, the four +12 volt rails would exclude it from any shortlist of recommended parts for a file server. Long term, I would look to replacing it. Short term, I'd rearrange my cabling to balance the load across the four rails. Molex connectors are notoriously unreliable, tend to go high resistance and can easily be overloaded.

Link to comment
37 minutes ago, jonathanm said:

Is there a diagram showing which rails feed which wires? Many times only one of the rails is wired to the drive connections anyway, so splitting between the rails requires a custom cutting and splicing rewiring job.

The photos on the Jonnyguru site that the OP referenced show the modular sockets as being labelled with the rails and the article says that you can plug the 5-pin cables into the 10-pin sockets so mixing and matching should be possible. The only inaccessible rail is +12V1, which is presumably dedicated to the non-modular motherboard cables. It looks doable, with a bit of thought and planning, but I'd still prefer a single-rail supply. That said, you still have to be careful not to exceed the current capacity of any single cable.

 

Edited by John_M
The +12V1 rail IS accessible too
Link to comment
30 minutes ago, John_M said:

The photos on the Jonnyguru site that the OP referenced show the modular sockets as being labelled with the rails and the article says that you can plug the 5-pin cables into the 10-pin sockets so mixing and matching should be possible. The only inaccessible rail is +12V1, which is presumably dedicated to the non-modular motherboard cables. It looks doable, with a bit of thought and planning, but I'd still prefer a single-rail supply. That said, you still have to be careful not to exceed the current capacity of any single cable.

Great to know! I will investigate and make sure that is the case, about the labeling, and that I still can find the extra cables.

 

Might be a good time to get a new PSU with black friday approaching quickly. My estimation puts it at most ~600W of draw, so perhaps I might upgrade to something larger if I go that route.

i7 CPU, ~21 HDDs, 2 SSD, 3 HBA, 120mm water cooler, 3 120mm case fans, 2 80mm case fans, 32GB ram. And temporarily the intel SAS expander.

 

I still find it odd that it's 768 errors each time. This is on an older 3TB drive, I'm reluctant to replace it with a spare 8TB I have cause I don't need the extra space right now at all. I might have a 3TB I pulled earlier that could be fine I can try for the rebuild.

 

Link to comment
8 minutes ago, GoChris said:

Great to know! I will investigate and make sure that is the case, about the labeling, and that I still can find the extra cables.

I suspect you have the drive power cables (with black 5-pin modular plugs) plugged into the black 5-pin sockets on the PSU. That would put all the drives on the 12V1 rail. You can plug them into the red 10-pin sockets - using the correct half of the socket - to pick up the other rails, as labelled. The Jonnyguru article has all the information you need.

14 minutes ago, GoChris said:

 

Might be a good time to get a new PSU with black friday approaching quickly. My estimation puts it at most ~600W of draw, so perhaps I might upgrade to something larger if I go that route.

I'd look for something like the Corsair AX760 (or AX860, if you want a bit more headroom). The Corsair RM750x is an excellent choice too. There are plenty of other good brands too - some people rate Seasonic as the best but beware that they too make split-rail models.

Link to comment
21 hours ago, GoChris said:

Why am I getting 768 errors on rebuilds on multiple drives that seem to exhibit no issues from SMART and full pre-clear tests

Overlook this, so it won't be PSU or cable issue.

 

Problem cause still unknown.

 

Any cascade on expander?

Edited by Benson
Link to comment

No cascade on the expander nope. I also for a bit there didn't have the expander in but have it in now as I tried to rule out the newest HBA card being the problem, which doesn't seem to be the case.

 

I also realized these errors started at the same time I swapped out the i5 for the i7, same mobo and ram. Might mean nothing.

Link to comment

Confirmed that the backplanes are powered by two cables, not just split off one. So they're on two rails, which is probably okay. Rebooted, and have a new drive in "error", despite it showing zero errors, don't think there was even any reads.

 

I'm quite frustrated with this, not sure what to do. Swap the i5 cpu back in? Nothing else has really changed. Well when I checked the power, I removed a reverse sata->sas breakout cable and replaced it with a sas straight to the HBA card, so the only drives off the mobo sata are the 2 ssd cache drives.

 

Starting to pull my hair out.

Link to comment
On 11/24/2018 at 2:41 AM, Benson said:

Could you try rebuild in maintenance mode. Or put it in safe more and check any different.

 

Btw, two cable for almost all disk still not a good arrangement.

Okay, I have one of these as well I found in the basement, https://www.corsair.com/us/en/Power/txm-series-2017-config/p/CP-9020130-NA so I will swap that in, as I think it's a single rail and then try a rebuild, in maintenance mode.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.