Jump to content
Badams

2 Drives acting strange.. Possibly dying?

13 posts in this topic Last Reply

Recommended Posts

Hey All,

 

I think I might be on the event-horizon of a massive hard drive failure (for me anyway).

 

My UnRaid box has 4 3TB Hdd's in it (1 Parity) and since the weekend I've been seeing disks 2 and 3 acting strange, almost as though they just decide to go AWOL.

 

I got an alert last night (after starting a parity scan) that the array has 2 disks with read errors (while the parity scan was running). The Parity scan finished with 514,358,631 errors, similar to the result of another scan I did with 3926338 where I then stopped the scan and rebooted the server.

 

When I reboot the server, the 2 drives that were previously showing as no file system and/or no temperature on the SMART status were re-connected and working fine for another 12 or so hours.

 

WHATS CHANGED WHEN THE ERRORS STARTED:

Over the weekend I added a spare HDD I had laying around from an old computer that I was no longer using into the server. This drive is a WD Green drive. The issues appeared to have then started when I was trying to run a clear on it (NOTE: I would have done a pre-clear using Unassigned Devices but it wasn't letting me. The bubble was showing grey?).

I decided to then remove the drive from the array until I could get to the server and physically pull the drive. Removing it from the server didn't stop the issues however as I then got errors (the scan with the 514,358,631 errors) on the most recent scan.

 

The drive has now been physically removed but I have put the array into maintenance mode to stop it from being mounted.

 

I'm not sure as to how to proceed and am open to any/all suggestions.

 

Please find attached the two diag dumps. One from when the WD drive was in, and another from when it was removed and rebooted.

AfterWDRem_blackbox-diagnostics-20180515-1044.zip

BeforeWDRem_blackbox-diagnostics-20180515-1028.zip

Share this post


Link to post

Usually I expect to see someone with bad connections to good disks, but in your case it looks like the only good disk is parity. Do you have backups?

Share this post


Link to post
16 minutes ago, trurl said:

Usually I expect to see someone with bad connections to good disks, but in your case it looks like the only good disk is parity. Do you have backups?

 

I do, luckily, but it is a little dated but wont be hard to re-download what I need to.

 

What makes you think that the other disks are dead? Was it something in the logs?

Share this post


Link to post

Because they have SMART issues, including two with a SMART failing now attribute, it doesn't help they are the infamous ST3000DM001.

 

Also check your cooling, all of them have overheat in the past.

Share this post


Link to post
11 hours ago, Badams said:

I got an alert last night

Was this an email from unRAID Notifications? Do you not have Notifications configured to email you on SMART issues? You should have already known about a bad disk before it became multiple bad disks. Doesn't the Dashboard have warning indicators on each of them?

Share this post


Link to post
1 hour ago, trurl said:

Was this an email from unRAID Notifications? Do you not have Notifications configured to email you on SMART issues? You should have already known about a bad disk before it became multiple bad disks. Doesn't the Dashboard have warning indicators on each of them?

 

I do have alerts enabled, but the SMART issues that appeared were there well before I put the drives into my new server which I had only moved across to UnRaid with the new server. This was 8 months ago. It has been fine ever since. But yes, I should have acted earlier but with a baby on the way I couldn't exactly get any new drives. :-(

 

7 hours ago, johnnie.black said:

Also check your cooling, all of them have overheat in the past.

 

This was a massive heat spike that happened on a heatwave day we had about 7 months ago when the air conditioning was dead, it didn't stay that high for long before I acted and jerry-rigged some massive cooling for it.

 

 

---

 

 

Moving forward, I have a plan and need to know its likelihood to succeed/if its even possible.

 

  1. I purchase a 8TB WD Red drive and put this into the array for use WITHOUT parity protection (is this even possible with a current array setup)?
  2. I move data from the current drives and/or backup to the 8 TB drive
  3. I purchase a second 8TB WD Red drive and this becomes the new Parity
  4. I purchase a third 8TB WD Red drive for future storage.

Thanks heaps for your advice so far guys.

Share this post


Link to post
33 minutes ago, Badams said:

the SMART issues that appeared were there well before I put the drives into my new server

 

Rebuilding a disk requires that parity plus all the other disks be good. Starting with multiple known problem disks is almost certain to end this way. I hope you do have good backups of anything important and irreplaceable.

 

28 minutes ago, Badams said:
  1. I purchase a 8TB WD Red drive and put this into the array for use WITHOUT parity protection (is this even possible with a current array setup)?
  2. I move data from the current drives and/or backup to the 8 TB drive
  3. I purchase a second 8TB WD Red drive and this becomes the new Parity
  4. I purchase a third 8TB WD Red drive for future storage.

 

Can you purchase 2 of them at the same time? Then you could start fresh and have parity.

 

I would probably just do a new array, with or without parity, only new disks, and use Unassigned Devices to copy whatever you can from the old disks and backups.

Share this post


Link to post
Posted (edited)
On 5/16/2018 at 12:57 AM, trurl said:

 

Rebuilding a disk requires that parity plus all the other disks be good. Starting with multiple known problem disks is almost certain to end this way. I hope you do have good backups of anything important and irreplaceable.

 

This morning I had an Epiphany.

 

Lets forget about those SMART status's for now... I know that it CAN show that a disk is dying, but I've had the same error for 8 months and haven't had an issue. I'm not saying with 100% definitive proof that this isn't the problem, but I'm also not saying it is... These two failing disks we will call sdg and sde.

 

After putting a newish disk into the server, and trying to run a pre-clear and then rebuild with both not finishing (even with it just sitting there getting ready) I notice that the temp for the drive became a * and then that got me thinking. I checked the SMART status for the newish drive (lets call this disk sdf) and it's fine. Not one SMART error. So why is this one having an issue too? Surely I can't have 3 drives failing at THE EXACT SAME TIME with varying ages/brands.

 

Ok, so my Motherboard only has 4 SATA ports (Gigabyte AB-350N), which means I had to purchase an expansion card (https://www.umart.com.au/Skymaster-PCIe-4-Ports-SATA-III-6G-Card_34030G.html) to get more than 4 drives... The epiphany I had was, 'what if the 'failing' drives are all plugged into the expansion card?'.

So I set out to have a look into how I can distinguish that without opening the server up (I'm 70 km away at work from it at the moment and wanted to check my theory).

 

So this is what I did.

 

Step 1. 

Take note of the failing drives. In my case it was sdg, sde and sdf.

Step 2.

Run the following command to get a list of the current HDD's

ls -alt /sys/block/sd*

Step 3.

From the output, take a look at the values in the 4th lot of /'s for your drives.

In my case it is 0000:09:00.0 and 0000:01:00.1 (see below)

lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sda -> ../devices/pci0000:00/0000:00:07.1/0000:0a:00.3/usb4/4-3/4-3:1.0/host0/target0:0:0/0:0:0:0/block/sda/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sdb -> ../devices/pci0000:00/0000:00:01.3/0000:01:00.1/ata1/host1/target1:0:0/1:0:0:0/block/sdb/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sdc -> ../devices/pci0000:00/0000:00:01.3/0000:01:00.1/ata2/host2/target2:0:0/2:0:0:0/block/sdc/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sdd -> ../devices/pci0000:00/0000:00:01.3/0000:01:00.1/ata6/host6/target6:0:0/6:0:0:0/block/sdd/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sde -> ../devices/pci0000:00/0000:00:03.1/0000:09:00.0/ata9/host9/target9:0:0/9:0:0:0/block/sde/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sdf -> ../devices/pci0000:00/0000:00:03.1/0000:09:00.0/ata10/host10/target10:0:0/10:0:0:0/block/sdf/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sdg -> ../devices/pci0000:00/0000:00:03.1/0000:09:00.0/ata11/host11/target11:0:0/11:0:0:0/block/sdg/

This shows what the drives are connected to.

 

Step 4.a

Now, take those values and drop the first set of 0's before the colon. For me this would be 09:00.0 and 01:00.1.

 

Step 4.b

Run the following command (substituting the values you took in step 4.a)

lspci | grep 09:00
lspci | grep 01:00

And I got the following result:

09:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller (rev 11)
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset SATA Controller (rev 02)

So, the drives with the issues are connected to the PCIe Sata Expansion card. 

 

Would I be safe to assume that the issue is possibly with the expansion card and not the drives themselves?

 

When I get home, I will be moving ALL the drives to the SATA ports on the Mobo to get it back and running.

If I purchase a new expansion card (I'm looking at this https://www.startech.com/au/Cards-Adapters/HDD-Controllers/SATA-Cards/4-Port-PCI-Express-SATA-6Gbps-RAID-Controller-Card~PEXSAT34RH), should I be able to just plug it in and away it goes? Or will I need to do some config changes with the OS.

 

Thanks heaps for your help so far too btw guys.

Edited by Badams
punctuation changes

Share this post


Link to post

I haven't looked at your diagnostics and I've only skimmed through the earlier posts in this thread but I can tell you that Marvell controllers are proving to be problematic for many people running 64-bit Linux operating systems, such as unRAID, and especially if they have IOMMU enabled. The StarTech card you mention is based on the same Marvell 9230 chip, so not a good choice for a replacement. Cheap 2-port ASM1061 or 1062-based cards work reliably, as do more expensive LSI-based 8-port SAS cards.

Share this post


Link to post
35 minutes ago, John_M said:

I haven't looked at your diagnostics and I've only skimmed through the earlier posts in this thread but I can tell you that Marvell controllers are proving to be problematic for many people running 64-bit Linux operating systems, such as unRAID, and especially if they have IOMMU enabled.

 

Oh Crud... Well I guess that makes sense. ?

 

Is there a way to disable IOMMU?

Share this post


Link to post
1 minute ago, Badams said:

Is there a way to disable IOMMU?

 

In the BIOS.

Share this post


Link to post

Even if it is a controller issue you should replace the disks. What I should have said above is that every bit of parity plus every bit of all other disks must be read reliably in order to reliably reconstruct a missing disk. Disks that are "mostly OK" will not be able to reconstruct a disk that fails.

Share this post


Link to post
9 minutes ago, trurl said:

Even if it is a controller issue you should replace the disks.

Agree, very likely the controller is a problem, Marvell controllers in general can be and the 9230 seems to be the worst offender, 9215 and 9235 seem work better, but still not recommended, but I would replace those disks sooner rather than later, especially the ones with a failing end-to-end SMART attribute.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now