Jump to content

preclear failed (syslog attached)


Recommended Posts

I had a preclear failure and I'm unclear how to interpret it or what to do. Unfortunately, due to a screw up while copying the screen report from the failure I can't post it but it was pointing to a problem with the MBR. I am attaching a zip of the log from the Syslog page in unmenu. It is reporting a whole bunch of errors on ata1. I am using a Supermicro AOC-SASLP-MV8, which this drive was connected to via SAS1, so I presume these are the same interfaces.

 

I did a smart status report from unmenu this is what I got.

----

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

----

 

I did a short smart report and got this:

---

smartctl -t short -d ata /dev/sdb 2>&1

 

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

 

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

 

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

---

 

syslog111107-417pm.zip

Link to comment

basically, your disk stopped responding to commands.  The smart report is unable to communicate with the disk, it does not even find that /dev/sdb is a valid device.

 

Either you have a bad disk, of a bad connection to the disk (either power or data cable connection) or a bad drive tray, or a bad disk controller port, or a loose connection, or a defective/intermittent cable, or a power supply unable to properly feed all your drives.

 

 

Link to comment

That narrows it down  :)

 

I rebooted the system and ran a smart status report on the drive. I'm attaching it. The preclear failed in the post-read so I guess it shows as precleared though I won't use it 'til it clears an entire preclear cycle.

 

This drive was being precleared because of a network failure that left it with a corrupted file system (see this thread: http://lime-technology.com/forum/index.php?topic=16310.0).

 

Also today, during a copying operation I had another failure where the PC stalled, indicating that "the network name is no longer available." I mention this because the preclear failure and both data copying problems occured on drives connected to the same SAS port and cable from the Supermicro card. The other SAS port and cable has not had any problems, at least so far. Nor has any drive connected to the MB SATA ports.

 

I've got enough open slots to operate without the four slots connected to that SAS port, at least to test things a bit. But to do that I'd need to move drives around and re-assign. I have no parity drive as of yet. Should I simply stop the array, unassign all the drives, move them around and then reassign them. Will everything just come back together from an unraid point of view. That will at least allow me to see if replacing the card and cable might solve the problem.

 

Thanks.

 

Smart_Status_sdb.txt

Link to comment

That narrows it down  :)

I know... but I probably left out a few possibilities...  ;)

 

I rebooted the system and ran a smart status report on the drive. I'm attaching it. The preclear failed in the post-read so I guess it shows as precleared though I won't use it 'til it clears an entire preclear cycle.

If it failed in the post-read, it is probably NOT cleared.  You can check with the

preclear_disk.sh -t /dev/sdX

command.

This drive was being precleared because of a network failure that left it with a corrupted file system (see this thread: http://lime-technology.com/forum/index.php?topic=16310.0).

 

Also today, during a copying operation I had another failure where the PC stalled, indicating that "the network name is no longer available." I mention this because the preclear failure and both data copying problems occured on drives connected to the same SAS port and cable from the Supermicro card. The other SAS port and cable has not had any problems, at least so far. Nor has any drive connected to the MB SATA ports.

 

I've got enough open slots to operate without the four slots connected to that SAS port, at least to test things a bit. But to do that I'd need to move drives around and re-assign. I have no parity drive as of yet. Should I simply stop the array, unassign all the drives, move them around and then reassign them. Will everything just come back together from an unraid point of view. That will at least allow me to see if replacing the card and cable might solve the problem.

 

Thanks.

 

 

I've posted many times before about the risks inherent with loading an array with no parity disk assigned.  your problems sound far worse than I described earlier.  You have random failures on the network and on disks.      You have no idea of data corruption unless you are using checksums when transferring the data and then verifying the data after clearing the disk buffer cache.  (an immediate verify is nearly useless, since the data will probably be read from the RAM buffer cache, and not from the physical disk)
Link to comment

Joe,

 

I’ve taken some time to consider what you said and to develop a plan of action. I’d appreciate if you could comment on my approach.

 

First, here’s a listing of the hardware on my system:

 

PSU: Corsair CMPSU-750HX

Motherboard: Gigabyte GA-MA785G-UD3H (this board has a Realtek 8111C network card onboard. I don’t know if this is a problem since I’ve seen some discussion regarding flaky network interfaces regarding Realtek NIC’s. I didn’t notice any problems in the log but then I don’t know what to look for)

CPU: AMD Athlon II X2 240 Regor 2.8 Ghz

RAM: A-DATA 2x1GB 240-Pin DDR2 SDRAM DDR2 800

Case: Centurion RC-590-KKN1-GP

Cages: 3 Athena Power BP-SATA3051B

SATA Card: Supermicro AOC-SASLP-MV8 8-Port SAS/SATA (plugged into PCIX4_1)

SATA Card: Syba (Monoprice) SATA2 Serial ATA PCI-e Controller Card (Silicon Image SIL3132) (plugged into PCIEX1_1)

SAS Cable: 2 LSI/3Ware CBL-SFF8087OCF Discrete Forward Breakout Cable

 

5 WD20EARS 2 TB Drives (4 currently in the array)

2 WD15EADS 1.5 TB Drives (2 currently in the array)

1 WD20EARS 2 TB Drive (not currently mounted or precleared)

1 WD15EADS 1.5 TB Drive (not currently mounted or precleared)

1 Hitachi Deskstar 7K3000 (mounted and precleared but not in array, intended for the parity disk)

1 WD640AAKS (mounted and precleared but not in array, intended for cache disk)

 

To recap I had two instances where a copy operation from my PC to my unRAID server has stalled and wound up with some sort of “ghost” directories on the unRAID server. In one case a rm –rf command removed the directory, in the other it wouldn’t. These had occurred on separate drives but both were connected to SAS1 on the Supermicro card. I pulled the drive that was unable to have the ghost directory removed, copied the valid data on it back to my pc, removed it from the array, and while leaving it in the same slot connected to SAS1, attempted to preclear the drive again. It failed.

 

That’s where we were as of my last post. Next, I tried removing the drive that wouldn’t preclear from the tray attached to SAS1 and attempted to clear it in a tray connected to SAS2 which hadn’t given me any errors up to that point. If it cleared I would have guessed that the problem would most likely be either the Supermicro card SAS1 port, the cable connecting to that port, or the drive itself.

 

I started the plreclear on an SAS2 port and looked at the log (I wish I had saved it but I didn’t) at any rate it started throwing off errors almost immediately. So, I cancelled the operation, and moved the drive to a tray connected to one of the motherboard SATA ports. The disk did a full preclear cycle and precleared successfully in that port.

 

Next I ran a file system check (using reiserfsck –check /dev/mdX) on all six of the drives currently in the array to see if there were any problems of corrupted file structure on the disks currently in the array. All six disks indicated no problems (report attached).  There was one disk where feiserfsck reported the following “Data block pointers 305974565 (87 of them are zero). On all the other disks it said “(0 of them are zero). I don’t know if this is a problem but I note it because it was on only one disk. If that’s not a problem then reiserfsck is reporting that “all is well.”

 

That is the current status of the system. Here is my plan for going forward.

 

Since I’ve now confirmed problems on both SAS1 and SAS2, at least as far as running preclear operations, I’ve ordered a new Supermicro card. It is due to arrive this evening. I’m going to replace my old Supermicro card with this new one. Once I get it installed and working:

 

- I’m going to run a preclear operation using two new drives I have on hand putting one in a tray attached to SAS1 and one attached to SAS2.

- Assuming this goes well, I’ll add the Hitachi Deskstar drive to the array as a parity drive.

- Once that operation is complete, I’ll add the precleared drives to the array and continue loading data from my PC to the array.

- If and when that process completes successfully, I’ll add a cache drive.

 

If any of this fails, then it's back to the drawing board.

 

Of course, it could be some other component in the system that is causing problems, but this should at least narrow some things down. Some sources of problem seem unlikely. It’s probably not the SAS breakout cables since I observed preclear failures  on both cables. It’s probably not the SATA power, since I have functioning drives on all three cages and power is provided on a cage-wide basis. It seems unlikely that it’s a tray, since I’ve had problems in more than one tray.

 

As I indicated above, I have a Realtek NIC (RTL8111C). I also have a Realtek NIC on the PC side (RTL8111D). Could one or both of these be a source of the problem. I sometimes see network stalls on the PC side but these involve going out to the internet. I've always assumed that this was due to "hiccups" on the ISP side, but perhaps it is some sort of timeout on my PC network card. With an abundance of caution I've orders an Intel PCI Nic that I plan to put on the PC side, though I could also put it on the unRAID server if the TRL8111C is judged to be problematic with unraid. (I'm attaching a syslog from a fresh boot of the system which might help indicate if there IS some sort of network problem, though I don't see it.)

 

I’d appreciate some feedback specifically on these cards if people think they are problematic or there is a simple way to test them. I’d also appreciate any comments or suggestions on my plan of attack in trying to rid the system of the problems I’ve been having.

 

Thanks

syslog-2011-11-09.txt

Reiserfsck.txt

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...