2 Drives acting strange.. Possibly dying?


Recommended Posts

Hey All,

 

I think I might be on the event-horizon of a massive hard drive failure (for me anyway).

 

My UnRaid box has 4 3TB Hdd's in it (1 Parity) and since the weekend I've been seeing disks 2 and 3 acting strange, almost as though they just decide to go AWOL.

 

I got an alert last night (after starting a parity scan) that the array has 2 disks with read errors (while the parity scan was running). The Parity scan finished with 514,358,631 errors, similar to the result of another scan I did with 3926338 where I then stopped the scan and rebooted the server.

 

When I reboot the server, the 2 drives that were previously showing as no file system and/or no temperature on the SMART status were re-connected and working fine for another 12 or so hours.

 

WHATS CHANGED WHEN THE ERRORS STARTED:

Over the weekend I added a spare HDD I had laying around from an old computer that I was no longer using into the server. This drive is a WD Green drive. The issues appeared to have then started when I was trying to run a clear on it (NOTE: I would have done a pre-clear using Unassigned Devices but it wasn't letting me. The bubble was showing grey?).

I decided to then remove the drive from the array until I could get to the server and physically pull the drive. Removing it from the server didn't stop the issues however as I then got errors (the scan with the 514,358,631 errors) on the most recent scan.

 

The drive has now been physically removed but I have put the array into maintenance mode to stop it from being mounted.

 

I'm not sure as to how to proceed and am open to any/all suggestions.

 

Please find attached the two diag dumps. One from when the WD drive was in, and another from when it was removed and rebooted.

AfterWDRem_blackbox-diagnostics-20180515-1044.zip

BeforeWDRem_blackbox-diagnostics-20180515-1028.zip

Link to comment
16 minutes ago, trurl said:

Usually I expect to see someone with bad connections to good disks, but in your case it looks like the only good disk is parity. Do you have backups?

 

I do, luckily, but it is a little dated but wont be hard to re-download what I need to.

 

What makes you think that the other disks are dead? Was it something in the logs?

Link to comment
11 hours ago, Badams said:

I got an alert last night

Was this an email from unRAID Notifications? Do you not have Notifications configured to email you on SMART issues? You should have already known about a bad disk before it became multiple bad disks. Doesn't the Dashboard have warning indicators on each of them?

Link to comment
1 hour ago, trurl said:

Was this an email from unRAID Notifications? Do you not have Notifications configured to email you on SMART issues? You should have already known about a bad disk before it became multiple bad disks. Doesn't the Dashboard have warning indicators on each of them?

 

I do have alerts enabled, but the SMART issues that appeared were there well before I put the drives into my new server which I had only moved across to UnRaid with the new server. This was 8 months ago. It has been fine ever since. But yes, I should have acted earlier but with a baby on the way I couldn't exactly get any new drives. :-(

 

7 hours ago, johnnie.black said:

Also check your cooling, all of them have overheat in the past.

 

This was a massive heat spike that happened on a heatwave day we had about 7 months ago when the air conditioning was dead, it didn't stay that high for long before I acted and jerry-rigged some massive cooling for it.

 

 

---

 

 

Moving forward, I have a plan and need to know its likelihood to succeed/if its even possible.

 

  1. I purchase a 8TB WD Red drive and put this into the array for use WITHOUT parity protection (is this even possible with a current array setup)?
  2. I move data from the current drives and/or backup to the 8 TB drive
  3. I purchase a second 8TB WD Red drive and this becomes the new Parity
  4. I purchase a third 8TB WD Red drive for future storage.

Thanks heaps for your advice so far guys.

Link to comment
33 minutes ago, Badams said:

the SMART issues that appeared were there well before I put the drives into my new server

 

Rebuilding a disk requires that parity plus all the other disks be good. Starting with multiple known problem disks is almost certain to end this way. I hope you do have good backups of anything important and irreplaceable.

 

28 minutes ago, Badams said:
  1. I purchase a 8TB WD Red drive and put this into the array for use WITHOUT parity protection (is this even possible with a current array setup)?
  2. I move data from the current drives and/or backup to the 8 TB drive
  3. I purchase a second 8TB WD Red drive and this becomes the new Parity
  4. I purchase a third 8TB WD Red drive for future storage.

 

Can you purchase 2 of them at the same time? Then you could start fresh and have parity.

 

I would probably just do a new array, with or without parity, only new disks, and use Unassigned Devices to copy whatever you can from the old disks and backups.

Link to comment
On 5/16/2018 at 12:57 AM, trurl said:

 

Rebuilding a disk requires that parity plus all the other disks be good. Starting with multiple known problem disks is almost certain to end this way. I hope you do have good backups of anything important and irreplaceable.

 

This morning I had an Epiphany.

 

Lets forget about those SMART status's for now... I know that it CAN show that a disk is dying, but I've had the same error for 8 months and haven't had an issue. I'm not saying with 100% definitive proof that this isn't the problem, but I'm also not saying it is... These two failing disks we will call sdg and sde.

 

After putting a newish disk into the server, and trying to run a pre-clear and then rebuild with both not finishing (even with it just sitting there getting ready) I notice that the temp for the drive became a * and then that got me thinking. I checked the SMART status for the newish drive (lets call this disk sdf) and it's fine. Not one SMART error. So why is this one having an issue too? Surely I can't have 3 drives failing at THE EXACT SAME TIME with varying ages/brands.

 

Ok, so my Motherboard only has 4 SATA ports (Gigabyte AB-350N), which means I had to purchase an expansion card (https://www.umart.com.au/Skymaster-PCIe-4-Ports-SATA-III-6G-Card_34030G.html) to get more than 4 drives... The epiphany I had was, 'what if the 'failing' drives are all plugged into the expansion card?'.

So I set out to have a look into how I can distinguish that without opening the server up (I'm 70 km away at work from it at the moment and wanted to check my theory).

 

So this is what I did.

 

Step 1. 

Take note of the failing drives. In my case it was sdg, sde and sdf.

Step 2.

Run the following command to get a list of the current HDD's

ls -alt /sys/block/sd*

Step 3.

From the output, take a look at the values in the 4th lot of /'s for your drives.

In my case it is 0000:09:00.0 and 0000:01:00.1 (see below)

lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sda -> ../devices/pci0000:00/0000:00:07.1/0000:0a:00.3/usb4/4-3/4-3:1.0/host0/target0:0:0/0:0:0:0/block/sda/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sdb -> ../devices/pci0000:00/0000:00:01.3/0000:01:00.1/ata1/host1/target1:0:0/1:0:0:0/block/sdb/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sdc -> ../devices/pci0000:00/0000:00:01.3/0000:01:00.1/ata2/host2/target2:0:0/2:0:0:0/block/sdc/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sdd -> ../devices/pci0000:00/0000:00:01.3/0000:01:00.1/ata6/host6/target6:0:0/6:0:0:0/block/sdd/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sde -> ../devices/pci0000:00/0000:00:03.1/0000:09:00.0/ata9/host9/target9:0:0/9:0:0:0/block/sde/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sdf -> ../devices/pci0000:00/0000:00:03.1/0000:09:00.0/ata10/host10/target10:0:0/10:0:0:0/block/sdf/
lrwxrwxrwx 1 root root 0 May 17 08:11 /sys/block/sdg -> ../devices/pci0000:00/0000:00:03.1/0000:09:00.0/ata11/host11/target11:0:0/11:0:0:0/block/sdg/

This shows what the drives are connected to.

 

Step 4.a

Now, take those values and drop the first set of 0's before the colon. For me this would be 09:00.0 and 01:00.1.

 

Step 4.b

Run the following command (substituting the values you took in step 4.a)

lspci | grep 09:00
lspci | grep 01:00

And I got the following result:

09:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller (rev 11)
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset SATA Controller (rev 02)

So, the drives with the issues are connected to the PCIe Sata Expansion card. 

 

Would I be safe to assume that the issue is possibly with the expansion card and not the drives themselves?

 

When I get home, I will be moving ALL the drives to the SATA ports on the Mobo to get it back and running.

If I purchase a new expansion card (I'm looking at this https://www.startech.com/au/Cards-Adapters/HDD-Controllers/SATA-Cards/4-Port-PCI-Express-SATA-6Gbps-RAID-Controller-Card~PEXSAT34RH), should I be able to just plug it in and away it goes? Or will I need to do some config changes with the OS.

 

Thanks heaps for your help so far too btw guys.

Edited by Badams
punctuation changes
Link to comment

I haven't looked at your diagnostics and I've only skimmed through the earlier posts in this thread but I can tell you that Marvell controllers are proving to be problematic for many people running 64-bit Linux operating systems, such as unRAID, and especially if they have IOMMU enabled. The StarTech card you mention is based on the same Marvell 9230 chip, so not a good choice for a replacement. Cheap 2-port ASM1061 or 1062-based cards work reliably, as do more expensive LSI-based 8-port SAS cards.

Link to comment
35 minutes ago, John_M said:

I haven't looked at your diagnostics and I've only skimmed through the earlier posts in this thread but I can tell you that Marvell controllers are proving to be problematic for many people running 64-bit Linux operating systems, such as unRAID, and especially if they have IOMMU enabled.

 

Oh Crud... Well I guess that makes sense. ?

 

Is there a way to disable IOMMU?

Link to comment

Even if it is a controller issue you should replace the disks. What I should have said above is that every bit of parity plus every bit of all other disks must be read reliably in order to reliably reconstruct a missing disk. Disks that are "mostly OK" will not be able to reconstruct a disk that fails.

Link to comment
9 minutes ago, trurl said:

Even if it is a controller issue you should replace the disks.

Agree, very likely the controller is a problem, Marvell controllers in general can be and the 9230 seems to be the worst offender, 9215 and 9235 seem work better, but still not recommended, but I would replace those disks sooner rather than later, especially the ones with a failing end-to-end SMART attribute.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.