Parity check interrupted by missing drives?

abq-pete · May 24, 2009

I recently added a couple of new drives to my array. Since then the new drives have been a bit troublesome in that sometimes they have not shown up in the array when first starting. A quick reboot has always solved the problem. This had led me to believe that the drives have not finished their boot process that allows them to be detected by unRAID (this logic is likely flawed).

Yesterday, I started my weekly parity check and went about my business. This morning, I took a look at the array's status to find that the two drives had disappeared again with unRAID assigning them a RED status. Without thinking, I clicked the reboot button on the web page. I say "without thinking" as I should have copied the syslog. But I just assumed that the drives had not been detected during the boot process (WRONG!). After rebooting, one of the drives came back (went green). The other still showed RED. I rebooted again. The drive still stayed RED. To troubleshoot I have checked all cable connections, verified SMART on the specific drive and run REISERFSCK on the specific drive. No errors or corruptions reported.

I suspect something happened during the parity check process as there was no summary showing. When the two drives disappeared the parity process must have stopped. So the question is how to recover. Is the parity data correct or is the data on the affected drive correct? If the parity data is correct, then I should execute the "Make unRAID Trust the Parity Drive...." routine. If however, the data drive is correct, then how does one force the parity drive to rebuild?

The affected system consists of the following:

unRAID 4.5b6 (stock) with CACHE-DIRS v 1.5

Intel D975XBX2, Intel E2180, 4GB RAM, Norco 4020 case with backplane, Corsair 620HX Power Supply, 2x Adaptec 1430SA Controllers

Parity Drive - WD20EADS

Cache Drive - Maxtor 250GB

Data Drives (13 drives) - Mix of WD10EACS, WD10EADS and WD20EADS

The affected drive is Drive 12 (WD10EADS)

Thanks and regards, Peter

p.s. Happy Memorial Day!

edited to add additional hardware info

SSD · May 24, 2009

Once unRAID turns a disk red, it does not turn back to green by itself. It is a bit surprising that you had 2 red drives and one of them turned green after a reboot, but thinking through it, I could see how that might happen. But the first disk it disabled will stay disabled, no matter if you fix the underlying problem or not.

If you are sure the disk is back online, I'd suggest trying the trust my parity procedure which should get you up and running again.

Once you do, I'd try to figure out what is causing these particular drives to get dropped. It could be an intermittent power connection, or a bad/loose cable, or a bad backplane.

I had a problem with drives not getting recognized (by BIOS) on boot and determined that the problem was that my machine was booting too fast. I turned OFF quickboot and never had the problem again. YMMV

RobJ · May 24, 2009

I used to have an nForce 4 board, that among other problems, would drop a disk controller, with all of its attached drives. A reboot would bring them all back, but I would often have to rebuild parity. In fact, that is a primary reason I so welcomed the Trust My Array procedure, and one of the stated reasons for doing it is based on my struggles.

Joe has had problems in the past with a defective power splitter, that I believe would cause the temporary loss of the connected drives. I would determine what is common to these 2 drives - disk controller? power cables? splitter? etc... You did not mention what disk controllers are involved, beside the onboard ICH one.

abq-pete · May 24, 2009

bjp999 and RobJ,

The same two disks have turned red a couple of times (with a missing message) and every time a reboot has returned them to normal (green). But as you've correctly identified, the first disk (disk 12) is now stuck on red. Both drives as well as the cache drive are all attached to the same Adaptec 1430SA (one of two in the system). I don't suspect the controller itself as the cache drive has never turned red. Actually is the cache drive able to have a red status?

I think it is a cabling problem for the following reason. The locking cables I use (from Monoprice) do not fit on the Adaptec as the ports are stacked. In order for them to fit, I had to remove the metal locking mechanism. Even so, the connectors are thick enough that they do not sit as well as I would like (they are kind of wedged in). It could be the backplane on the Norco as these are the only two drives on it. Also, there is only a single power cable for that row of drives so I will have to check that out as well. Lastly, I will change fast boot to normal as well though the boot times will be very long. I typically turn on the array only when watching content or adding content.

As both of you recommend the Trust my Parity / Array, I will give that a go after redoing the mechanical bits to try to eliminate the problem.

Thanks Gents!

Regards, Peter

abq-pete · May 26, 2009

I moved the drives to another backplane and power connection on the Norco as well as used different SATA connections. After running several parity checks, the problem has not resurfaced. When I have time, I will try to figure out what the failure point was (SATA, backplane or power connection).

Regards, Peter

SSD · May 26, 2009

I moved the drives to another backplane and power connection on the Norco as well as used different SATA connections. After running several parity checks, the problem has not resurfaced. When I have time, I will try to figure out what the failure point was (SATA, backplane or power connection).

Regards, Peter

Glad you are up and running again.

It is so easy to have a card not seat exactly right, a cable to jiggle lose in the SATA port, a cable to fail, a power connection to be a little lose - so many things that can cause an intermittent problem during the VERY intensive I/O caused by a parity check. The chances increase as the number of drives, controllers, cables, power splitters, etc, etc, increase on a larger array. You did just the right thing in trying different cables / ports / power connectors to try to bypass the problem, and it seems you are stable again. This means that the drives themselves are likely fine. Mucking further may not be such a good idea. Unless you have a bad part that misbehaves consistently, my experience with these types of intermittent problems is that you're more likely to cause a different problem while rummaging through the case than find the original problem. I'd probably run a few more tests to convince myself that all is stable for now. Then, next time you do maintenance on your array (add or replace a disk), you'll have to rerun your stability tests and address any intermittent problems you may have caused. That's life with a big array IMO.

For whatever reason, my biggest problems have been with SATA cables. After having several problems with them, I bought a bunch of locking cables. Not sure if it coincidence or just dumb luck, but have not had an intermittent issue since. (knock on wood)

Enjoy your array!

abq-pete · May 27, 2009

At some point in time I am going to have to "muck" with it as I am at the limit for drives and will be using the last backplane. I am fairly confident that the issue is the connection at the Adaptec 1430SA. Those darned stacking connectors without latching support really irk me. I miss the promise PCI cards that had four separate connectors that supported the latching cables. I'm keeping my eye out for a better PCI-E card...

Thanks and regards, Peter

Parity check interrupted by missing drives?

Recommended Posts

abq-pete

Link to comment

SSD

Link to comment

RobJ

Link to comment

abq-pete

Link to comment

abq-pete

Link to comment

SSD

Link to comment

abq-pete

Link to comment

Archived