Help identifying possible drive issue [Solved]


Recommended Posts

*NEW INFORMATION AT BOTTOM OF POST*

 

I recently made some substantial hardware changes to my second UnRAID backup machine and I am now having issues with a new disk I have introduced. My array consists of a mix of 2TB disks and a single 4TB parity drive.

 

Old specs:

Mobo: ASROCK 880GMLE-FX

CPU: AMD FX-6300 Black Edition

RAM: 8GB (2x 4GB) Corsair Ballistix DDR3

Storage Cards: SuperMicro AOC-SASLP-MV8

Total Drives: 16 x 2TB (1 Parity x 15 data)

Drives were all mounted in CoolerMaster 4x3 bay expanders where each drive had an individual power and SATA connection.

 

PCI/PCIE Layout:

1x PCIE x1<----- IOCrest IO-PCE9215-4I

1x PCIE x16 <---- SuperMicro Card

2x PCI <----- Graphics Card

 

 

my new specs:

Mobo: ASUS M5A97 R2.0

CPU: AMD FX-6300 Black Edition

RAM: 16GB (4x 4GB) Corsair Ballistix DDR3

Storage Cards: SuperMicro AOC-SASLP-MV8 8x drives

2x SYBA SI-PEX40064 aka. IOCrest IO-PCE9215-4I 7x drives across the two cards

Other six drives are connected directly to motherboard.

Total Drives: 19 x 2TB 1 x 4TB (1 4TB Parity x 19 2TB data)

These are all in NORCO SS-500 5-Bay SATA / SAS Hot Swap Rack Module with individual SATA connections per drive and shared power from 2x molex connectors per bay.

 

PCI/PCIE Layout:

 

1 x PCIe 2.0 x16 (blue)  <---- Empty

1 x PCIe 2.0 x16 (x4 mode, black)  <---- SuperMicro Card

2 x PCIe 2.0 x1  <----- IOCrest IO-PCE9215-4I cards

2 x PCI <----- One Empty, one PCI Nvidia GTX200 series Graphics Card

 

 

EDIT2: Going to disable virtualization in BIOS as I never plan on using it with this machine and I spent all day reading about marvell based storage controllers and the issues discovered with virtualization for some users. Also plan to do a more balanced layout for disks to better distribute bandwidth for optimum speeds:

New PCI/PCIE Layout:

 

1 x PCIe 2.0 x16 (blue)  <---- SuperMicro Card          x6 disks

1 x PCIe 2.0 x16 (x4 mode, black)  <---- SuperMicro Card          x6 disks

1 x PCIe 2.0 x1  <-----  Empty

1 x PCIe 2.0 x1  <----- IOCrest IO-PCE9215-4I card      x2 disks (last 6 on Mobo).

2 x PCI <----- One Empty, one PCI Nvidia GTX200 series Graphics Card

 

So after my rebuild I got my array all situated and everything was working fine (besides speed issues which is issue for another post) with a 2TB parity drive and 18 2TB data drives. This past weekend I decided to swap in a 4TB Seagate NAS HDD that I had sitting around as a cold spare for my other UnRAID server. This would allow me to replace disks in the old system as they die with newer 4TB disks to expand my backup capacity. This cold spare was pre-cleared with the pre-clear script and had two full rounds run on it and it reported no problems.

 

NOTE: I did NOT put the 4TB disk in the same bay as the old parity drive. I figured I could just leave the old one in place for a hot spare. I will try swapping their bays if the community recommends it. In addition I have 7 of these same model drives

 

Now this is where things seem to go bad fast. I start the parity rebuild and leave it sit as it says it is going to take about 24 hours to complete. I checked on it before I went to sleep and it was still running fine. In the morning I woke up and checked my notifications and saw a notice for the array that stated the parity build had completed and then immediately after I received a failed health report from UnRAID. I manage to get the webui to work long enough to see that the parity disk, disks 1-4, disks 11-12, & disks 15-16 are all reporting some crazy number of errors in the tens of millions. I try to view drive smart information but nothing appears to be working quite like it should be. I try telnet and SSH in to the machine, connection refused.

I use the WebUI to stop the array and it manages to do that before showing me this:

 

0gql5o.png

 

 

At this point the WebUI is just stuck saying waiting for available socket in chrome. I waited for a couple hours to see if it would start responding before I ended up having to do a hard shutdown.

 

Once I got the machine rebooted all the drives showed up fine and the array started normally and wanted to do a parity check. I stopped the check because I wanted to look at the drives first. Of all of the drives that supposedly had errors the only one that seemed to show any sign of actual problems was the 4TB drive (see attached smart report from the drive). I ran both a short and extended SMART test and both passed without error:

 

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error      00%      1987        -

# 2  Short offline      Completed without error      00%      1980        -

 

Here are the particular values (these are much higher than they were the first time as I have done subsequent work and testing):

1 Raw read error rate 0x000f 080 064 044 Pre-fail Always Never 108284144

7 Seek error rate 0x000f 081 060 045 Pre-fail Always Never 121797733

 

I decided to run a non-correcting parity check and see what happened. The check completed and listed the exact same number of parity errors that I saw previously and this time around no disks dropped and no "disk errors" were reported (or whatever the error column on the main screen represents).

 

Decided to try and run a correcting parity check this time just to test because at this point I am prepared to swap back to my old 2TB parity drive so I don't care if I hose the parity on the 4TB disk.

I let it run for about 8 hours and in that time it reported an additional 102,861,929 errors.

 

Normally I would just write this off as a bad drive and cut my loses. What has me curious about this is that none of the usual indicators of a failing drive are present such as pending re-allocated sectors, unrecoverable sectors, etc. I  don't think it would be power as these bays share a common power back plane across the five drives so I assume that a power issue would manifest itself across all the disks in that bay in some way, shape, or form.

 

So my question(s) to the community are:

-is this a sign of a bad drive or possibly some sort of cable issue?

-Should I immediately discontinue use of the suspect drive for the sake of restoring parity protection to my array?

-Or do you all think I should try swapping cables, bays, etc before I write this drive off as toast?

 

 

EDIT: after writing this massive post I decided to look at my other server that has 7 of these drives in it that I have had no issues with. Well low and behold literally every single ST4000VN000-1H4168 drive I have is reporting ludicrous numbers of read and seek errors in SMART for these drives. I never noticed any of these errors before because my parity drive in that machine is a 4TB HGST NAS drive which doesn't report these errors.

 

Now the question becomes, can I somehow get unraid to ignore these two smart error values? Is that even recommended, or should I just get a drive that doesn't report erroneous errors? Does anyone else have experience with these drives and this issue?

void-diagnostics-20170124-1414.zip

Untitled.png.96ed4e3dac6b8b3c3120ef7d34a3357d.png

void-smart-20170124-1446.zip

Link to comment

Do the missing drives have anything in common, like a controller or backplane?

See my edit in the OP. I would have to check my drive serial numbers to my array layout when I get home to be sure but I'm fairly certain they are spread across multiple bays and controllers. I will come back and update this post once I get home to check.

Link to comment

Seagate disks report the real error rate. Other manufacturers' don't for fear of scaring the users or to make their products look better - you choose which you want to believe. Either way, ignore them. Your 4 TB Seagate SMART report looks good. I'd check cables and the seating of controllers in PCIe slots. Also power distribution.

 

Link to comment

Any time multiple drives drop out at the same time, it 's my experience that ALL of the drives are fine, and a common controller has crashed or failed (or lost power).  I'm not sure I have EVER seen a different cause.  I suppose it's possible a power outage or nearby lightning strike could damage multiple drives, if they were all being written to at that moment.  But even that wouldn't likely cause most or all of the drives to drop.

Link to comment

Seagate disks report the real error rate. Other manufacturers' don't for fear of scaring the users or to make their products look better - you choose which you want to believe. Either way, ignore them. Your 4 TB Seagate SMART report looks good. I'd check cables and the seating of controllers in PCIe slots. Also power distribution.

After seeing all of my Seagate drives across both servers reporting these "errors" I agree that I should be able to ignore them. I personally like to be made aware anytime a non-erroneous smart error appears so I am glad it was brought to my attention.

My question is, if this was a known thing Seagate drives do why does UnRAID report these as errors with a parity disk but completely ignores them when they are data disks?

 

Do the missing drives have anything in common, like a controller or backplane?

Any time multiple drives drop out at the same time, it 's my experience that ALL of the drives are fine, and a common controller has crashed or failed (or lost power).  I'm not sure I have EVER seen a different cause.  I suppose it's possible a power outage or nearby lightning strike could damage multiple drives, if they were all being written to at that moment.  But even that wouldn't likely cause most or all of the drives to drop.

Turns out they do. Here is my up to date Server layout hastily notated in paint. Two ports on the motherboard and all of the ports for my SuperMicro AOC-SASLP-MV8 dropped during that first parity build I described.

Screenshot_2017_01_24_18_02_21.png

 

In case you can't understand my paint markup:

-Drive 1 (my old parity disk) is just chilling as a hot spare and was not part of the array at all when this happened.

-Drives 2-5 are on one of the two breakout cables from my SuperMicro AOC-SASLP-MV8 and they all dropped.

-Drives 16-19 make up the other breakout cable from my SuperMicro card.

-Drive 20 is the 4TB new parity disk I am having issues with and is connected directly to the motherboard.

 

Thanks so far for your help, we are getting this narrowed down and now have some suspects.

 

A few weeks ago when I first got all my "new" parts in I did some playing around with dual parity (2x 2TB disks) and had a couple rebuilds and parity checks under my belt with this setup and had no issues for the few weeks I had it set up like that. Only things I have changed was remove dual parity, re-assign one disk as data other as hot spare, and try to move to this 4TB disk. Haven't touched any of the hardware inside since I first put it all back together.

 

I am inclined to point the finger at these Norco bays to start as they are one of the few components (besides some new SATA cables) in this rebuild that wasn't in use in the old flawless build I had going. They were bought used on Ebay for a sweet price ($40 x 4 plus $20~ for shipping). Supposedly (this is ebay after all) they were all pulled from working servers that were being decommissioned and were said to be fully functional besides some cosmetic stuff. They all appear to work just fine in the limited testing I did (plug drive in, make sure it powers on, mounts and I can read/write a couple gigs of data to it).

 

I have never had any issues with this SuperMicro card before this changeover. Same can be said for my power supply (a seasonic x-series 650 or 750 can't remember which) and it should be more than enough to power 20 drives, your basic computer components (CPU, RAM, etc), and an old GTX200 series PCI graphics card for troubleshooting when WebUI is down.

 

I will try re-seating expansion cards, SATA cables, and power connections in that order to see if I can't find the root cause. Does anyone have suggestions for a good way to QUICKLY test for the issues I am experiencing? This is my older system and a lot of these drives are 5+ years young so I would prefer to avoid subjecting my disks to multiple parity checks, rebuilds, etc unless absolutely necessary.

Link to comment

@trurl: The diagnostic dump attached to my OP should not contain the first incident where all of those drives dropped that I posted a screenshot of. I had to hard reboot the server after that first time so there wouldn't have been a syslog or anything else to show for it. I should have tried to get a dump while the webui was still working but alas I did not.

None of my other subsequent parity checks or rebuilds on the 4TB disk had a "catastrophic" failure like that first time did. No disk errors have since been reported on the "main" tab. I see different error numbers on my non-correcting check vs my correcting check (almost 5x as many during correcting).

 

Like I mentioned in the OP after the hard reboot all of the disks re-appeared and the array started fine. I now only see parity errors during parity operations. None of the other disks have dropped again.

 

I wanted to expand on my post above yours (trying to provide as much info as possible) and mention that when my array had the 2 x 2TB parity drives I had no issues making a full backup of my main array (20TB+ of data) via samba and 4TB or so from USB disks all without a single controller or disk error.

 

Sent from my Nexus 6P using Tapatalk

 

Link to comment

My question is, if this was a known thing Seagate drives do why does UnRAID report these as errors with a parity disk but completely ignores them when they are data disks?

 

It doesn't. The errors that unRAID reports have nothing to do with those SMART parameters. Those raw error rates are caused by the fact that the drive is trying to read tiny magnetic fluctuations close to the limits of what is physically possible. It's a tiny signal amongst a lot of noise and it needs advanced signal processing and error correction to get a result. The raw error rate is very high on all hard drives; it's just that Seagate reports it while others do not. Just ignore it - unRAID does. The errors in the unRAID GUI are because some of your disks dropped off-line. If a disk is off-line then an attempt to read it will fail. You were running a parity check when the disks dropped off-line, which involves reading every byte from every disk, so that's a lot of errors.

 

Now the question becomes, can I somehow get unraid to ignore these two smart error values?

 

It already does ignore SMART parameters 1 and 7. See for yourself: Settings -> Disk Settings then scroll down to Global SMART Settings and see that it only monitors parameters 5, 187, 197 and 198 by default.

 

Link to comment

My question is, if this was a known thing Seagate drives do why does UnRAID report these as errors with a parity disk but completely ignores them when they are data disks?

 

It doesn't. The errors that unRAID reports have nothing to do with those SMART parameters. Those raw error rates are caused by the fact that the drive is trying to read tiny magnetic fluctuations close to the limits of what is physically possible. It's a tiny signal amongst a lot of noise and it needs advanced signal processing and error correction to get a result. The raw error rate is very high on all hard drives; it's just that Seagate reports it while others do not. Just ignore it - unRAID does. The errors in the unRAID GUI are because some of your disks dropped off-line. If a disk is off-line then an attempt to read it will fail. You were running a parity check when the disks dropped off-line, which involves reading every byte from every disk, so that's a lot of errors.

 

Now the question becomes, can I somehow get unraid to ignore these two smart error values?

 

It already does ignore SMART parameters 1 and 7. See for yourself: Settings -> Disk Settings then scroll down to Global SMART Settings and see that it only monitors parameters 5, 187, 197 and 198 by default.

That makes sense, thanks for the explanation. So you are thinking this is a cable/power/controller issue?

 

I forgot to mention that the SuperMicro card is plugged into one of my two PCI-E 3.0 x16 slots (Updated with OP for more details). It shouldn't make a difference but i want to make sure I share as much info as I can.

 

What would be my ideal testing procedure moving forward here? Like I said I don't really want to mess around with trying a bunch of parity builds and then checks as I am worried about beating on the array to much while it is unprotected. Is the suspect most likely:

 

-Power connections?

-SATA connections?

-Could it be two of the Norco SS-500s I got on ebay are damaged?

 

I'm thinking swap the old 2TB parity with the location of the current 4TB and do a parity build on to the 2TB and see if it reports any errors. If the build and a subsequent check don't report any errors I can assume that it isn't the bay or cable that the 4TB was plugged into.

 

Or should I first attempt to re-seat the expansion card, check cables, power, etc before moving on to parity re-builds and checks on the 4TB?

Link to comment

I'd do the easy things first.

 

Re-seat the PCIe cards - make sure they're sitting square in the slots. Many times I've had to bend the mounting bracket slightly in order to be able to fit the screw without it trying to lift one end of the car out of the slot.

 

Check the data cables - most SATA cables are horrible. The stiff red ones without latches are the worst and they are easily knocked out of fitting squarely. SAS splitter cables are usually lighter and more flexible but sometimes the SAS end doesn't fit quite properly in the socket on the card. But they are all easily re-seated.

 

Power. Your PSU itself should be fine - it's a good brand and should be powerful enough, single +12V rail. But check any Molex splitters you might be using and where you plug into the drive cages.

 

Drive cages - could possibly be damaged. It's another set of connectors for the power and data signals to go through. You'll need to assess these for yourself - maybe swap drives around if a pattern emerges, though the current pattern point very strongly towards the SAS card and its cables.

 

Put the old 2 TB disk back in the parity slot if you want to - it will take less time to build parity onto it than the 4 TB disk - and hopefully all the disks will be available. If they are not then more investigation will be needed. If parity builds successfully then a parity check would be a good test and zero errors would be a good result. Any problems - grab diagnostics and post.

 

Link to comment

Awesome thank you for your assistance and detailed response!

I'd do the easy things first.

 

Re-seat the PCIe cards - make sure they're sitting square in the slots. Many times I've had to bend the mounting bracket slightly in order to be able to fit the screw without it trying to lift one end of the car out of the slot.

I'll be sure to take a good hard look at the way they are oriented and make sure the screws aren't causing the cards to lift out of the socket.

 

Check the data cables - most SATA cables are horrible. The stiff red ones without latches are the worst and they are easily knocked out of fitting squarely. SAS splitter cables are usually lighter and more flexible but sometimes the SAS end doesn't fit quite properly in the socket on the card. But they are all easily re-seated.

I will be sure to double check all the connections.

 

Power. Your PSU itself should be fine - it's a good brand and should be powerful enough, single +12V rail. But check any Molex splitters you might be using and where you plug into the drive cages.

I actually have two 1x SATA to 2x Molex splitters, installed because the power supply only came with 4 molex connectors total. These are connected to the first (top) bay and the bay below it. If after re-seating the cards and SATA connectors I still have trouble I will try swapping them and see if it makes a difference.

 

Drive cages - could possibly be damaged. It's another set of connectors for the power and data signals to go through. You'll need to assess these for yourself - maybe swap drives around if a pattern emerges, though the current pattern point very strongly towards the SAS card and its cables.

Yeah I am really hoping it isn't the bays, they were such a steal on ebay and the seller had nothing but five star reviews.

 

Put the old 2 TB disk back in the parity slot if you want to - it will take less time to build parity onto it than the 4 TB disk - and hopefully all the disks will be available. If they are not then more investigation will be needed. If parity builds successfully then a parity check would be a good test and zero errors would be a good result. Any problems - grab diagnostics and post.

All the disks are available. I imagine they all dropped off like that because of the 8-port expansion card having some kind of error. Before I stopped the array that first time all the disks that dropped off showed the same number of reported errors. So I am hypothesizing what happened was the SuperMicro card dropped out and caused all 8 disks to stop reading which caused the parity disk to also record those errors?

 

I am using tail -f /var/log/syslog > /mnt/cache/.tmp/syslog.txt to copy the syslog to a disk so if there is a hard reboot or some other kind of critical failure I should have at least most of my system log to post.

 

Anyways the server is powered off for now so I will wait and see if anyone has any other suggestions when I get off work tomorrow.

Link to comment

Ok I re-seated the SuperMicro card and bent the bracket somewhat to keep card flush in socket, breakout cables, power connectors, swapped bays with the old and new parity drive. Just for my sanity sake I disabled all the virtualization options I could find in the BIOS as well.

 

I started a correcting parity check on to the 4TB disk this morning and so far I am at 47% with 46250954 parity sync errors. Once this finishes I will run a non-correcting check and see if it reports any new errors.

 

This is in the syslog: Jan 25 06:34:45 VOID kernel: md: recovery thread: stopped logging

I assume this just means the syslog decided to stop monitoring the parity write since there are so many "errors"?

Link to comment

Now at 75% with 78,077,167 sync error.

 

Something I have noticed with my last two parity operations is that the parity check starts out at around ~40MB/s and I'm not sure exactly when as I can't watch it very thoroughly while at work but at some point between 50-75% my speeds have now shot up to what I used to see with my old setup, around 120MB/s.

 

In the syslog I see that the drives all spun down and stayed down.

Jan 25 09:49:42 VOID kernel: mdcmd (56): spindown 1

Jan 25 09:49:43 VOID kernel: mdcmd (57): spindown 2

Jan 25 09:49:45 VOID kernel: mdcmd (58): spindown 3

Jan 25 09:49:46 VOID kernel: mdcmd (59): spindown 4

Jan 25 09:49:48 VOID kernel: mdcmd (60): spindown 5

Jan 25 09:49:49 VOID kernel: mdcmd (61): spindown 6

Jan 25 09:49:49 VOID kernel: mdcmd (62): spindown 7

Jan 25 09:49:50 VOID kernel: mdcmd (63): spindown 8

Jan 25 09:49:50 VOID kernel: mdcmd (64): spindown 9

Jan 25 09:49:51 VOID kernel: mdcmd (65): spindown 10

Jan 25 09:49:52 VOID kernel: mdcmd (66): spindown 11

Jan 25 09:49:53 VOID kernel: mdcmd (67): spindown 12

Jan 25 09:49:53 VOID kernel: mdcmd (68): spindown 13

Jan 25 09:49:53 VOID kernel: mdcmd (69): spindown 14

Jan 25 09:49:54 VOID kernel: mdcmd (70): spindown 15

Jan 25 09:49:54 VOID kernel: mdcmd (71): spindown 16

Jan 25 09:49:55 VOID kernel: mdcmd (72): spindown 17

Jan 25 09:49:55 VOID kernel: mdcmd (73): spindown 18

 

I Tried searching the forum for answers but wasn't sure how to word it to find information on people who have parity drives that are larger than their largest data drive. Is this because my parity disk is larger than my data disks and parity is already fully calculated for the smaller disks so it is doing a faux parity calculation for the rest of the 4TB drive? Or is this pre-emptive spin down before the check is complete a sign of trouble?

Link to comment

I Tried searching the forum for answers but wasn't sure how to word it to find information on people who have parity drives that are larger than their largest data drive. Is this because my parity disk is larger than my data disks and parity is already fully calculated for the smaller disks so it is doing a faux parity calculation for the rest of the 4TB drive? Or is this pre-emptive spin down before the check is complete a sign of trouble?

When the parity check has gone beyond the size of a particular drive, it is no longer used since its contribution to the parity calculation is zero.
Link to comment

I Tried searching the forum for answers but wasn't sure how to word it to find information on people who have parity drives that are larger than their largest data drive. Is this because my parity disk is larger than my data disks and parity is already fully calculated for the smaller disks so it is doing a faux parity calculation for the rest of the 4TB drive? Or is this pre-emptive spin down before the check is complete a sign of trouble?

When the parity check has gone beyond the size of a particular drive, it is no longer used since its contribution to the parity calculation is zero.

Ok that is kind of what I figured. The correcting check finished and no lost disks or disk errors. Running the non-correcting check now.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.