Errors on rebuild.

August 20, 201312 yr

Hi,

I have just upgraded my software and hardware at the same time (http://lime-technology.com/forum/index.php?topic=28716).

After issues with the new unraid not liking my old network card i have now moved to the new hardware. I checked the thumb drive booted and could see the controllers ect and all was fine until i moved the disks over to the new system. 3 (parity, disk 2 and 3) out of the 4 disks moved without issue, the 4th (disk 1)decided to let out the magic smoke and burned a hole in a chip on the pcb.

Not an issue i thought as the array was still seen fine just missing one disk. Replacement disk in and started the rebuild which stopped within 30 seconds with errors (128 of them) on disks 1 and 2. Disk 2 is running seatools hard drive diag as i type but no errors as yet.

Any ideas or pointers?

Quote

August 20, 201312 yr

Hi,

the 4th (disk 1)decided to let out the magic smoke and burned a hole in a chip on the pcb.

What does this mean? If you want help, you have to write in clear English exactly what the problem you encountered was. (Did you really have smoke rising out of your server and find, upon examining the hardware, that you have a destroyed chip on your motherboard?)

A full syslog would also help the experts who are not clairvoyant to assist you...

Quote

August 20, 201312 yr

Author

Sorry for the lack of a syslog, the system is at home and im at work.

Yes i did have smoke from the server. I checked the hard drives and have found a very burnt looking chip on its controller board. This drive is dead.

Over lunch i went home and switched out the drive i was rebuilding onto with another new one which seems to have done the trick. The array is rebuilding, but at less then 2MB/Sec.

I dont have the full syslog here but its full of the below errors:

Aug 20 15:53:52 Tower kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Aug 20 15:53:52 Tower kernel: ata3.00: configured for UDMA/33

Aug 20 15:53:52 Tower kernel: ata3: EH complete

Aug 20 15:54:26 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Aug 20 15:54:26 Tower kernel: ata1.00: failed command: WRITE DMA EXT

Aug 20 15:54:26 Tower kernel: ata1.00: cmd 35/00:00:c7:5c:37/00:04:00:00:00/e0 tag 0 dma 524288 out

Aug 20 15:54:26 Tower kernel: res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

Aug 20 15:54:26 Tower kernel: ata1.00: status: { DRDY }

Aug 20 15:54:26 Tower kernel: ata1: hard resetting link

All the drives are sata 2 (3 Gbps) and the controller cards are Adaptec 1420SA.

bad cables or drive caddys?

Hope this helps clear it up.

Rick

Quote

August 20, 201312 yr

Frank must not have heard of that comical way of referring to a dead drive! I suspect someone somewhere did a comic strip or skit on a simpleton explaining a dead drive - "here is a drive that works fine, uh oh now smoke is coming out of the drive, now the drive doesn't work, so in conclusion - obviously the drive would work if it had not lost its magic smoke!". Its current usage doesn't actually have to have visible smoke, just involve a dead drive.

We really do need a full syslog. That 2MB/s sounds bad and the syslog excerpt shows trouble, but an excerpt like that is rarely useful without the complete syslog, because almost always the only important error is the very first one. All others are probably side effects of the initial issue. And we cannot tell which drive or drives are having issues without a lot more info, all or most of which is in the syslog.

Quote

August 20, 201312 yr

FYI, if it is important enough you might be able to salvage that fried drive. Buy the same make and model, swap the controller board, pull the data, swap the board back, profit. I did this to an old 80GB NTFS drive back on 2006 after a lightening strike and it worked like a charm. I had to use a data recovery tool but I was able to pull the files. I'm assuming there is an appropriate RFS tool for similar data recovery. But the main thing is getting the platters spinning and the heads moving.

Quote

August 20, 201312 yr

PS: worst case, you've bought a new drive. Swapping the controller board back and forth should not compromise the new drive.

Quote

August 20, 201312 yr

Author

I hadn't thought of somebody not having heard of the magic smoke. (http://en.wikipedia.org/wiki/Magic_smoke)

I did ponder trying to fix it with the magic smoke kit but nobody had stock! (https://www.sparkfun.com/products/retired/10622).

All joking aside, the failed drive was about 18 months old and i couldn't locate a replacement controller from our stock (i work in a computer store).

I have attached the syslog, the current rebuild rate is:

Total size: 2 TB

Current position: 7.47 GB (0%)

Estimated speed: 1.2 MB/sec

Estimated finish: 27783 minutes

Thanks Rick

syslog.zip

Quote

August 20, 201312 yr

You should not be getting those errors in the syslog - on a well behaved system there would be none at all. I strongly suspect that it is either a cabling problem, or possibly a power supply that is under-rated.

When the system is working correctly I would expect the rebuild speed to be at something like 40+MB/sec, and on a good system more than that (although it can vary with drive performance). Your low speeds will be because of the drives going through continual resets.

Quote

August 20, 201312 yr

Sorry for the lack of a syslog, the system is at home and im at work.

Yes i did have smoke from the server. I checked the hard drives and have found a very burnt looking chip on its controller board. This drive is dead.

Hope this helps clear it up.

Rick

OK, I admit that I had never heard the term 'Magic Smoke' before. However, it is somewhat surprising that an 18 month drive would fail in this manner--- with real smoke! Especially if it happened immediately when you first applied power to it. I would almost suspect that 12 volts ended up on a buss where only 5V should have been.

How much of the server upgrade is new gear and how much is 'recycled' from your old server?

I would look very carefully at any new drive trays and power connectors on any new power supply.

Quote

August 20, 201312 yr

I would abort the rebuild, as it is clearly not working correctly, likely to take a year to finish, and I wouldn't trust the result. DMA reads from the parity drive and DMA writes to Disk 1 are both repeatedly failing, with timeouts and resets. It always seems to recover, but the kernel has slowed both drives to their slowest speed UDMA/33, and when you add the timeouts and reset times... There are no other errors, directly implicating power or cables etc, but I'd say there is an incompatibility or misconfiguration of VMware somewhere. I've no experience to help you further with that, perhaps the VMware gurus can help. The fact that both drives are behaving exactly the same would indicate that it isn't the drives themselves at fault.

Quote

August 21, 201312 yr

Author

Ok i have it. Its one of those things that wakes you up at 3am and you just have to work on!

First of all i thought it might be my removable caddys, so out came the hard drives and they for plugged direct into the controllers and the rebuild was restarted. No change, rebuilding at sub 1 Mbps and errors in the logs.

In my sleep state i had a crawl through the syslogs and noticed that the same ports (ata1 and ata3) were listed as resetting time and time again. Then it hit me, with 6 ports (3 per controller card) with cables in, and 4 hard drives, what would the chances be of plugging the same drives into the same cables be when i removed the caddys.

I checked the first logs and the current ones for which drives were plugged into ata1 and ata3, different drives from before BUT the same errors and also the only 2 drives plugged into that card.

A quick re-cable later and this morning after less sleep then i would have liked:

Total size: 2 TB

Current position: 276.38 GB (14%)

Estimated speed: 19.53 MB/sec

Estimated finish: 1471 minutes

Its not the 40+mb/Sec that itimpi suggested but its alot better and in line with what what i was getting on my old server. Once i have full rebuild i might have to look at what might be slowing it down.

How much of the server upgrade is new gear and how much is 'recycled' from your old server?

The only hardware recycled from the old server in this case were the removable drive caddys, thumbdrive (unraid) and the drives. The case, motherboard, some memory and psu (XFX 750W gaming psu) were from my old "main" pc. With new hardware being some memory, cpu (i7-3770) and hard drive controller cards (Adaptec 1420SA x2)

I include the current syslog for 2 reasons, first so somebody with more experience in these things can confirm the issue is resolved and second if you can see anywhere in them that the performance is being limited.

Thank you all for your help.

syslog.zip

Quote

August 21, 201312 yr

Its not the 40+mb/Sec that itimpi suggested but its alot better and in line with what what i was getting on my old server. Once i have full rebuild i might have to look at what might be slowing it down.

hard drive controller cards (Adaptec 1420SA x2)

First place to look for the slowdown is probably right there. Maybe I'm not finding the right card description by googling, but what I found suggests that each controller card is only good for 1 modern drive before being speed limited by the PCI bus.

Quote

August 21, 201312 yr

What model MB? All of the PCI slots share capacity on most MB. This results in enough capacity for one drive across all PCI slots. Those cards are PCI-X; if the MB has PCI-X slots then each slot will support four drives, and if they are PCI slots then they are all limited to a single drive.

slot type # of disks per slot

PCI 1 disk (for all PCI slots)

PCI-X 6

PCI-e X1 1-2

PCI-e X4 5-8

PCI-e X8 10-16

PCI-e X16 20-32

Double these number for PCI-e 2.0

Quote

August 21, 201312 yr

Author

Your correct in that these are pci-x cards, tho they are in normal pci slots which i knew would limit them to a point.

During the rebuild its accessing all 4 drives at once which i knew would cause a hold up. But i would have thought once the rebuild is done and its running under normal conditions (90% read only) that it should be alot faster unless being written to (unless i get a cache drive?).

Could somebody recommend a card which does not cost the earth? i have pci-e slots free to use.

Rick

Quote

August 21, 201312 yr

I include the current syslog for 2 reasons, first so somebody with more experience in these things can confirm the issue is resolved and second if you can see anywhere in them that the performance is being limited.

Great! Once you moved everything off that one 1420SA, there are no issues at all showing. All drives were configured successfully at their top speeds.

So I have to agree with the others that the PCI bus has become your bottleneck, a serious one. I do agree with your reasoning that it only affects certain operations, like rebuilds and parity checks or syncs. Normal read performance should be fine, so long as you don't need simultaneous access.

Quote

August 23, 201312 yr

Author

Not exactly the correct place for the question, but is there a list of controllers supported at the moment? I checked the wiki and it seemed very out of date.

I was looking at http://www.lsi.com/downloads/Public/MegaRAID%20SAS/MegaRAID%20SAS%208408E/MR_SAS_8408E_PB-Final.pdf

Unraid compatible? and tho its not listed im guessing it will take 3tb drives?

Quote

August 23, 201312 yr

There's a sticky in the hardware section of the forum.

Quote

Errors on rebuild.

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)