Drive write errors on 3 different drives in less than 2 weeks

Kode · July 8, 2013

Not sure if its a cable issue / power issue or what, but I recently added 4x https://www.overclockers.co.uk/showproduct.php?prodid=CA-055-XG to my system, since then I've had drive errors on 3 different drives, when the first happened I tried to rebuild from parity and got write errors again, so replaced it with a new drive, the second one I shut down, pulled the drive out, re-seated it, rebuilt from parity and it was all fine, today yet another drive is red balling =/ trying to look at a syslog just showed:

Jul 8 04:40:02 Tower syslogd 1.4.1: restart.

Jul 8 05:52:01 Tower kernel: mdcmd (30006): spindown 0

Jul 8 05:52:03 Tower kernel: mdcmd (30007): spindown 3

Jul 8 05:52:04 Tower kernel: mdcmd (30008): spindown 4

Jul 8 10:04:36 Tower kernel: mdcmd (30009): spindown 0

Jul 8 10:04:38 Tower kernel: mdcmd (30010): spindown 1

Jul 8 10:04:40 Tower kernel: mdcmd (30011): spindown 3

Jul 8 10:04:41 Tower kernel: mdcmd (30012): spindown 4

Jul 8 14:33:59 Tower kernel: mdcmd (30013): spindown 1

Jul 8 14:34:01 Tower kernel: mdcmd (30014): spindown 3

Jul 8 14:34:02 Tower kernel: mdcmd (30015): spindown 4

So I stopped the array and the drive doesn't show show up, so I shut down, tried pulling and reseating, stated it back up, stopped the array and the disk is present again, un assign the drive (as it's showing as disabled anyway due to write errors), start array, stop array re-assign drive start array drive shows as invalid data so I am doing yet another rebuild.

What should I be looking for in the syslog?

I have no idea what is going wrong, but it seems strange it all started happening after installing those cages.

graywolf · July 8, 2013

attach your entire syslog so that folks can review it

Kode · July 8, 2013

That WAS the entire syslog =/ Obviously since rebooting I have a new syslog that has a lot more in it, don't know how useful that is though? Some further information:

I've attached the drive list, Parity and Cache are attached to the motherboard, drive 1 - 4 are attached to an aoc-saslp-mv8

parity, cache, drive 1 are in the first enclosure, drive 2, drive 3, drive 4 are in the second

Drive 4 failed first, then drive 1, now drive 2.

electron286 · July 9, 2013

That is one very nice looking drive cage! It has just what I have been looking for, an easy to clean lint screen! :-)

I think I am leaning toward a bad connection(s) to your drives. Probably on the connectors in your nice cool new cages. With the ROHS initiatives, the old chemicals, and lead, that did such a great job for producing high quality electronics is just not there any more. This has caused two possibilities that I can think of that may be an issue in your situation:

1. Contaminants were not fully cleaned off the backplane PCB, or connectors after assembly, or were flushed INTO the connectors during cleaning.

This has been a real problem, that can sometimes be seen by inspection with a good light. Surfaces should look clean, with no streaking, and metal surfaces should be nice and shinny.

This can be taken care of with some careful cleaning using a good quality of either rubbing, or de-natured, alcohol. Use a small 'acid' brush with short bristles for best cleaning action. Clean caefully not to disturb any stikers or markings placed on the board, or other assemblies, to avoid voiding any warrantee.

Another possible, less desirable solution would be to just relocate any contaminants on the connections by repeatedly plugging and unplugging each connector. This will allow the contact pressure to clean the contacting surface to allow proper electrical connection. The down side here, is that in time, the contaminants that are still present, may migrate and cause problems again in the future...

2. Bad solder connections, due to the less forgiving acceptible temperature range of non-lead containing solder.

This is not as much of a problem, usually, in a high volume controlled production environment, but often connectors are MANUALLY soldered in place after all other components have been 'wave' soldered. Again, a good light shining on the suspect areas should reveal if there is a problem. Look for nice shinny solder, with no irregular surfaces, or variations in color. Note that it is unlikely to look as nice as old circuit boards using lead did...

If you see bad solder joints, AKA COLD SOLDER JOINTS, you could try to fix them if you are good with a soldering iron... This however will likely void any warrantee you may have however.

Kode · July 9, 2013

OMG! The rebuild worked (as far as I know) but when I got downstairs the web interface was unresponsive, so I rebooted the server an went to work, just got home and looked at it and the drive that was rebuilt is green, but the port that had the original failure, that I replaced the drive on, is now showing as red balled

Kode · July 9, 2013

Is there any way to see the last error log? as again the current one is going to be useless as the server was rebooted =/

electron286 · July 9, 2013

These posts may be of help in capturing your log, prior to a reboot/crash...

http://lime-technology.com/forum/index.php?topic=6801.msg65881#msg65881

http://lime-technology.com/forum/index.php?topic=28316.msg251503#msg251503

Kode · July 13, 2013

*sigh* Might not have been the enclosures that were the issue, it was playing up again today so I rebooted it and now it wont boot, i just get the attached image, so it looks like the issue might be with the saslp-mv8, i've tried reseating it, putting it in a different slot, reseating the cables, it just won't boot =/

Kode · July 13, 2013

actually, it might be one of the cables, is I remove the one in the first port it will boot up =/ or at least get to the unraid selection screen, havent let it go further than that let

itimpi · July 13, 2013

I have seen that behaviour with a bad cable (or one badly seated). It seems to be due to the fact that the system is detecting a device seems to be present, but then cannot get a sensible response. Having said that there could well be other causes such as a power problem that might give similar symptoms.

Kode · July 13, 2013

I think in this particular case the issue may have been the drive on that port, annoyingly it was the WD green drive I bought to replace a different drive, with the drive in it was making odd noises and wouldn't boot, with it removed it booted fine.

itimpi · July 13, 2013

I think in this particular case the issue may have been the drive on that port, annoyingly it was the WD green drive I bought to replace a different drive, with the drive in it was making odd noises and wouldn't boot, with it removed it booted fine.

I have found I have had a very high 'early life' failure with WD Green 3TB drives. Out of 6 I bought 3 of them failed early on, and then one of the replacement drives I got through the RMA process.

Another thing I noticed is that they seem to run significantly hotter (5-10 degrees C) under load than my 3TB Seagate drives despite the fact the WD drives are 5900 RPM and the Seagate ones are 7200RPM.

Drive write errors on 3 different drives in less than 2 weeks

Recommended Posts

Kode

Link to comment

graywolf

Link to comment

Kode

Link to comment

electron286

Link to comment

Kode

Link to comment

Kode

Link to comment

electron286

Link to comment

Kode

Link to comment

Kode

Link to comment

itimpi

Link to comment

Kode

Link to comment

itimpi

Link to comment

Archived