2 failed drives


Recommended Posts

Need some help here. I am running version 4.5.6.

Two days ago a 2TB drive (1736 in slot 9) got spun down by itself

and had a green blinking ball next to it. I rebooted the server and after the reboot the drive

had a red ball next to it. I got a new drive, precleared it on a different server then replaced

the bad drive and started the rebuild process. One day into rebuild I noticed the speed was

14KB/s, barely crawling. The drive in slot 2 had numerous errors. I refreshed the page and

the server became unresponsive. So I hard rebooted it.

 

This time another drive (PHBV in slot 7) became red was reported as missing. I powered the

server down, re-seated the cables and powered the server on. No change.

I took both drives out and put them into my other server. The PHBV is dead completely, perhaps

the controller board went dead. The 1736 I can mount, short smart test shows 1700 pending reallocations.

I decided to put the 1736 back into the original server and try to rebuild the PHBV drive, but now

Slot 9 wants the 1007 drive, which is the replacement that I bought for the 1736 in slot 9.

I attached the Disk Status page. Any advice on how I can rebuild the PHBV drive now and then

rebuild the 1736 drive? Probably wishful thinking. In that case, how can I copy the data from this

drive? If I am able to mount, I should be able to copy. Just can't figure out how. As for the other

driver, perhaps the controller got fried and I may be able to revive it...

 

Did not capture the system log originally, unfortunately.

Link to comment

Need some help here. I am running version 4.5.6.

Two days ago a 2TB drive (1736 in slot 9) got spun down by itself

and had a green blinking ball next to it. I rebooted the server and after the reboot the drive

had a red ball next to it. I got a new drive, precleared it on a different server then replaced

the bad drive and started the rebuild process. One day into rebuild I noticed the speed was

14KB/s, barely crawling. The drive in slot 2 had numerous errors. I refreshed the page and

the server became unresponsive. So I hard rebooted it.

 

This time another drive (PHBV in slot 7)

became red. I took both drives out and put them into my other server. The PHBV is dead completely,

the 1736 I can mount, short smart test shows 1700 pending reallocated sectors.

I decided to put the 1736 back into the original server and try to rebuild the PHBV drive, but now

slot 9 wants the 1007 drive, which is the replacement that I bought for the 1736 in slot 9.

I attached the Disk Status page. Any advice on how I can rebuild the PHBV drive now and then

rebuild the 1736 drive? Probably wishful thinking. In that case, how can I copy the data from this

drive? If I am able to mount, I should be able to copy. Just can't figure out how. As for the other

driver, perhaps the controller got fried and I may be able to revive it...

 

Did not capture the system log originally, unfortunately.

 

Quite a mess.  There may be a way out, though.  

 

Before beginning, though, I would caution you that failures of this type are not common.  I might suspect a common cause like a bad PSU that could physically damage your drives.  You might have just been unluckly, but I would try to rule out hardware problems that could cause further drive failures.

 

The following procedure will work if one drive fails but all of the drives in the array are intact and consistant with each other, but that unRAID is telling you that one or more disks are misassigned (like your disk #9).

 

If you did any writing to your array since trying to rebuild 1736, then the drive may not rebuild perfectly, but you still may be partial recovery.  You may have to do a resiserfsck on the rebuilt disk AFTER recovery if initially the drive is reporting errors.

 

You will need a good 2T drvie to rebuild on to.  It needs to be installed in the computer.

 

The first thing you would need to do is backup the config directory on your flash.  You want to preserve that in case you need to get back to this point.

 

Next, press the restore button (I think 4.5.6 still had the restore button, if not, go to a telnet prompt and enter the command "initconfig", then answer "Yes").  In order to activate the restore button, I think there is a checkbox you need to click first.

 

Ok - so now if you refresh your Web GUI, all of the drives should be BLUE.  Slot 7 should be empty.

 

Go to the devices page and assign your NEW 2T disk to slot 7.

 

The main page will now show all disks blue, including slot 7.  IF ANY DISK ARE MISSING, DO NOT PROCEED.

 

Now, go to a telnet prompt and enter this command:

 

mdcmd set invalidslot 7

 

This tells unRAID to rebuild the disk in slot 7 when the array starts.  Normally unRAID will "rebuild" the disk is slot 0 (i.e., build parity) when you start a newly configured array.  This command tells it that parity is good and to "rebuild" disk7 instead.

 

After entering the command you should get a response similar to this ...

 

cmdOper=set

cmdResult=ok

 

Now go back to the Web GUI, refresh one more time (I don't think it will be any different - but take this final opportunity to make sure that everything is ready to rebuild disk7).

 

Press the start button.

 

As you refresh the browser, you should see the write count on disk7 increasing, and the read counts on the other disks increasing.  If not, stop the array immediately.

 

Now pray that your array holds together long enough to rebuild the disk!

 

Good luck!!!

Link to comment

bjp999, thank you so much for the detailed response. I followed the steps and so far so good... I think.

The writes on drive 7 are increasing, and the reads on all other drives are increasing. Though, I do see

some minor write increases on other drives as well. The speed started off at 16,000KB/s then went down

to 200Kb/s and then ramped back up to 19,500KB/s... I am letting it run and let's see what happens.

Link to comment

The speed dropped to 2,00KB/s... then back up to 6,000KB/s... keeps jumping up and down every time I refresh the

browser. There is one error on disk 1.

 

As I said in my prior post, there may be something more going on here than just a couple of failed disks.  Your PSU could be flaking out, a disk controller, a motherboard.  I am a bit worried for you.  Keeping my fingers crossed.

Link to comment

I bet your are right about the hardware. But I have no one but myself to blame. I have been having spats of problems with this server from time to time, like once a year, but then it stabilizes and I keep delaying hardware upgrade. A while ago after some troubleshooting I discovered that my mobo (P5BVMDO) all of a sudden refused to work with the RAM that was installed. Eventually I replaced the RAM but left only one stick as with two sticks the system would not boot. As soon as I recover the data I will upgrade the hardware. Perhaps, I should have done that before following the steps that you outlined.

Link to comment

I bet your are right about the hardware. But I have no one but myself to blame. I have been having spats of problems with this server from time to time, like once a year, but then it stabilizes and I keep delaying hardware upgrade. A while ago after some troubleshooting I discovered that my mobo (P5BVMDO) all of a sudden refused to work with the RAM that was installed. Eventually I replaced the RAM but left only one stick as with two sticks the system would not boot. As soon as I recover the data I will upgrade the hardware. Perhaps, I should have done that before following the steps that you outlined.

 

You will be able to try this with another motherboard / controllers.  What worried me is that you are seeing large numbers of reallocated sectors. The drive does reallocations - nothing the computer or controllers are involved with.  Usually this is a sure sign of a bad drive.  But if this is happening on multiple drives, then things like bad PSUs and bad controllers come to mind.

 

You can always stop the process and try again once you've upgraded your hardware.

Link to comment

Good to know that I can repeat the process again with the new hardware.

As far as the reallocated sectors, I checked every drive and I see 5 drives with reallocated sectors,

ranging from 1 to 5. The one drive that failed first I checked later had 1700 pending reallocated,

but no reallocated sectors.

Link to comment

The rebuild has not been going too well guys. Seems like Disk 1 is also having problems.

I captured current syslog and it's pretty much all red. Can someone please take a look

and give me a general idea what seems to be the problem?

 

 

So, I have new hardware on the way. Should I just stop now and perform the rebuild

on the new hardware or let this one finish? I am afraid though that something else

will break by the time it's done.

 

The speed goes up to 14000-23000KB/s briefly and stays below 500MB/s most of the time.

Over last night the progress was from 32.3% to 39% today. At this rate it will take it

a month to complete...

 

 

Feb 9 07:14:52 Tower kernel: ata11: hard resetting link

Feb 9 07:14:53 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 07:14:53 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 07:14:53 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 07:14:53 Tower kernel: ata11: EH complete

Feb 9 07:15:23 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 07:15:23 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 07:15:23 Tower kernel: ata11.00: cmd 25/00:00:27:d9:0c/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 07:15:23 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 07:15:23 Tower kernel: ata11.00: status: { DRDY }

Feb 9 07:15:23 Tower kernel: ata11: hard resetting link

Feb 9 07:15:25 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 07:15:25 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 07:15:25 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 07:15:25 Tower kernel: ata11: EH complete

Feb 9 07:15:55 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 07:15:55 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 07:15:55 Tower kernel: ata11.00: cmd 25/00:00:27:d9:0c/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 07:15:55 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 07:15:55 Tower kernel: ata11.00: status: { DRDY }

Feb 9 07:15:55 Tower kernel: ata11: hard resetting link

Feb 9 07:15:56 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 07:15:56 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 07:15:56 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 07:15:56 Tower kernel: ata11: EH complete

Feb 9 07:16:27 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 07:16:27 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 07:16:27 Tower kernel: ata11.00: cmd 25/00:00:27:d9:0c/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 07:16:27 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 07:16:27 Tower kernel: ata11.00: status: { DRDY }

Feb 9 07:16:27 Tower kernel: ata11: hard resetting link

 

 

..........................

 

Feb 9 08:01:16 Tower kernel: handle_stripe read error: 1529839464/1, count: 1

Feb 9 08:01:16 Tower kernel: md: disk1 read error

Feb 9 08:01:16 Tower kernel: handle_stripe read error: 1529839472/1, count: 1

Feb 9 08:02:02 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:02:02 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:02:02 Tower kernel: ata11.00: cmd 25/00:00:b7:9e:2f/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:02:02 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:02:02 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:02:02 Tower kernel: ata11: hard resetting link

Feb 9 08:02:02 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:02:02 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:02:02 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:02:02 Tower kernel: ata11: EH complete

Feb 9 08:04:26 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:04:26 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:04:26 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:04:26 Tower kernel: ata11.00: cmd 25/00:00:57:02:31/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:04:26 Tower kernel: res 51/40:2f:21:03:31/00:03:5b:00:00/e0 Emask 0x9 (media error)

Feb 9 08:04:26 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:04:26 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:04:26 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:04:26 Tower kernel: ata11: EH complete

Feb 9 08:10:53 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:10:53 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:10:53 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:10:53 Tower kernel: ata11.00: cmd 25/00:00:af:15:79/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:10:53 Tower kernel: res 51/40:7f:26:19:79/00:00:5b:00:00/e0 Emask 0x9 (media error)

Feb 9 08:10:53 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:10:53 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:10:53 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:10:53 Tower kernel: ata11: EH complete

Feb 9 08:19:04 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:19:04 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:19:04 Tower kernel: ata11.00: cmd 25/00:00:27:04:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:19:04 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:19:04 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:19:04 Tower kernel: ata11: hard resetting link

Feb 9 08:19:05 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:19:05 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:19:05 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:19:05 Tower kernel: ata11: EH complete

Feb 9 08:19:36 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:19:36 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:19:36 Tower kernel: ata11.00: cmd 25/00:00:27:04:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:19:36 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:19:36 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:19:36 Tower kernel: ata11: hard resetting link

Feb 9 08:19:36 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:19:36 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:19:36 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:19:36 Tower kernel: ata11: EH complete

Feb 9 08:20:52 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:20:52 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:20:52 Tower kernel: ata11.00: cmd 25/00:00:27:0d:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:20:52 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:20:52 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:20:52 Tower kernel: ata11: hard resetting link

Feb 9 08:20:53 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:20:53 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:20:53 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:20:53 Tower kernel: ata11: EH complete

Feb 9 08:21:23 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:21:23 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:21:23 Tower kernel: ata11.00: cmd 25/00:00:27:0d:9f/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:21:23 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:21:23 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:21:23 Tower kernel: ata11: hard resetting link

Feb 9 08:21:24 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:21:24 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:21:24 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:21:24 Tower kernel: ata11: EH complete

Feb 9 08:23:58 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:23:58 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:23:58 Tower kernel: ata11.00: cmd 25/00:00:27:02:bf/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:23:58 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 9 08:23:58 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:23:58 Tower kernel: ata11: hard resetting link

Feb 9 08:23:59 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:23:59 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:23:59 Tower kernel: ata11.00: device reported invalid CHS sector 0

Feb 9 08:23:59 Tower kernel: ata11: EH complete

Feb 9 08:25:03 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:25:03 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:25:03 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:25:03 Tower kernel: ata11.00: cmd 25/00:00:27:22:bf/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:25:03 Tower kernel: res 51/40:ff:26:23:bf/00:02:5b:00:00/e0 Emask 0x9 (media error)

Feb 9 08:25:03 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:25:03 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:25:03 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:25:03 Tower kernel: ata11: EH complete

Feb 9 08:25:20 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:25:20 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:25:20 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:25:20 Tower kernel: ata11.00: cmd 25/00:00:27:22:bf/00:04:5b:00:00/e0 tag 0 dma 524288 in

Feb 9 08:25:20 Tower kernel: res 51/40:9f:78:22:bf/00:03:5b:00:00/e0 Emask 0x9 (media error)

Feb 9 08:25:20 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:25:20 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:25:20 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:25:20 Tower kernel: ata11: EH complete

Feb 9 08:32:12 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:32:12 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:32:12 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:32:12 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

Feb 9 08:32:12 Tower kernel: res 51/40:27:7f:cf:2d/00:01:5c:00:00/e0 Emask 0x9 (media error)

Feb 9 08:32:12 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:32:12 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:32:12 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:32:12 Tower kernel: ata11: EH complete

Feb 9 08:32:35 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Feb 9 08:32:35 Tower kernel: ata11.00: irq_stat 0x40000001

Feb 9 08:32:35 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:32:35 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

Feb 9 08:32:35 Tower kernel: res 51/40:37:75:cf:2d/00:01:5c:00:00/e0 Emask 0x9 (media error)

Feb 9 08:32:35 Tower kernel: ata11.00: status: { DRDY ERR }

Feb 9 08:32:35 Tower kernel: ata11.00: error: { UNC }

Feb 9 08:32:35 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:32:35 Tower kernel: ata11: EH complete

Feb 9 08:33:05 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:33:05 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:33:05 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

Feb 9 08:33:05 Tower kernel: res 40/00:37:75:cf:2d/00:01:5c:00:00/e0 Emask 0x4 (timeout)

Feb 9 08:33:05 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:33:05 Tower kernel: ata11: hard resetting link

Feb 9 08:33:06 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:33:06 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:33:06 Tower kernel: ata11: EH complete

Feb 9 08:33:36 Tower kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 9 08:33:36 Tower kernel: ata11.00: failed command: READ DMA EXT

Feb 9 08:33:36 Tower kernel: ata11.00: cmd 25/00:a8:07:cd:2d/00:03:5c:00:00/e0 tag 0 dma 479232 in

Feb 9 08:33:36 Tower kernel: res 40/00:37:75:cf:2d/00:01:5c:00:00/e0 Emask 0x4 (timeout)

Feb 9 08:33:36 Tower kernel: ata11.00: status: { DRDY }

Feb 9 08:33:36 Tower kernel: ata11: hard resetting link

Feb 9 08:33:37 Tower kernel: ata11: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Feb 9 08:33:37 Tower kernel: ata11.00: configured for UDMA/33

Feb 9 08:33:37 Tower kernel: ata11: EH complete

 

Total lines: 3000

Link to comment

Can you pull a smart report on disk1?  Are the errors on disk1 continuing to climb by the second - or were there a bunch of errors and now it is not increasing.  Normally errors in theh error column are indicative of an unreadable sector on a disk.  What unRAID would normally do is read all of the other disks in the array to reconstruct the block, and write the block back to the target disk.  Of course we aren't seeing write activity to disk1, so I am assuming it is not handling these errors (which is a good thing in this situation, I believe).  Some of the syslog errors look like bad cable issues, but I am assume that cabling is not the problem.  But it could be BOTH a failing disk and a bad cable.  Just don't know.

 

Can you access disk7 (I'm not sure, when doing this type of operation, if the disk being reconstructed can be accessed, but I think it probably can.)  If so, you might want to look at some of the earliest written files to the disk and see if they look good.  The new disk7 seems pretty happy.

 

If the errors are more or less continuous, I would consider stopping it.  But, on the other hand, if the errors are sporatic, you might let it keep going.  This may be your only chance to reconstruct what you can.  Hard to advise without being there and observing.  Maybe Joe L., RobJ, or one of other experts here would have some other suggestion.

Link to comment

The errors next to disk 1 are increasing. Initially the count went to 27000

and then stopped. Later I came home the count was 47K. It was about the same

this morning, and then it increased. Currently the count is at 49,600. So,

I would say it is sporadic.

 

Here is the smart test report on Disk 1:

 

 

 

Statistics for /dev/sdl 00R_WD-WCAVY0252674

 

smartctl -a -d ata /dev/sdl

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

 

=== START OF INFORMATION SECTION ===

Device Model:     WDC WD20EADS-00R6B0

Firmware Version: 01.00A01

User Capacity:    2,000,398,934,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   8

ATA Standard is:  Exact ATA specification draft version not indicated

Local Time is:    Wed Feb  9 09:57:47 2011 EST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x85) Offline data collection activity

was aborted by an interrupting command from host.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  41) The self-test routine was interrupted

by the host with a hard or soft reset.

Total time to complete Offline

data collection: (40800) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (   2) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (   5) minutes.

SCT capabilities:       (0x303f) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

 1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       204446

 3 Spin_Up_Time            0x0027   149   148   021    Pre-fail  Always       -       9541

 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       75

 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0

 9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12088

10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0

11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0

12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       53

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       7

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       67

194 Temperature_Celsius     0x0022   127   114   000    Old_age   Always       -       25

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   196   196   000    Old_age   Always       -       1374

198 Offline_Uncorrectable   0x0030   199   196   000    Old_age   Offline      -       394

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   025   001   000    Old_age   Offline      -       35182

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Interrupted (host reset)      90%     12088         -

# 2  Short offline       Interrupted (host reset)      90%     12085         -

# 3  Short offline       Completed: read failure       10%     12023         461689921

# 4  Short offline       Completed: read failure       10%     12023         551849266

# 5  Short offline       Completed without error       00%      6976         -

# 6  Short offline       Completed without error       00%      5520         -

# 7  Short offline       Completed without error       00%      4787         -

# 8  Short offline       Completed without error       00%      4761         -

# 9  Short offline       Completed without error       00%      4711         -

#10  Short offline       Completed without error       00%      4710         -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

   1        0        0  Not_testing

   2        0        0  Not_testing

   3        0        0  Not_testing

   4        0        0  Not_testing

   5        0        0  Not_testing

Selective self-test flags (0x0):

 After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

Link to comment

 

Can you access disk7 (I'm not sure, when doing this type of operation, if the disk being reconstructed can be accessed, but I think it probably can.)  If so, you might want to look at some of the earliest written files to the disk and see if they look good.  The new disk7 seems pretty happy.

 

 

 

Yes, last night I was able to access Disk 7. Though it was extremely slow. After clicking on the drive it took several minutes

before the directory under the drive appeared.

Link to comment

Would it maybe make sense to stop the rebuild, upgrade the hardware and then simply copy

the the files off Disk 1, and Disk 7 to another server? I suppose at this point I don't care

how retarded or time consuming the process will be, the only thing I want is to not lose

any or much of my data. Desperately need some expert advice.

Link to comment

Would it maybe make sense to stop the rebuild, upgrade the hardware and then simply copy

the the files off Disk 1, and Disk 7 to another server? I suppose at this point I don't care

how retarded or time consuming the process will be, the only thing I want is to not lose

any or much of my data. Desperately need some expert advice.

 

Disk1 is a "real" disk.  You can try to copy its contents to another drive on a workstation (I would not recommend writing anything to the array right now).

 

You could remove disk1 from the server, mount in an external USB, and attach to a workstation computer.  There is a reiserfs program you can run that will give you read-only access to the disk.  If you fear your unRAID computer is failing, this would be a way to eliminate the unRAID server.  If the copy works smoothly you may be justified in doubting the server hardware.  But if it is slow and problematic, you have confirmed the disk really is bad.

 

Disk7 is imposssible to rebuild without disk1 in the array.  If you were able to get disk1 working well in the step above, it would mean that your unRAID server is suspect and you should upgrade it.  Once it is rebuilt, you can retry the rebuild.

 

But if disk1 is bad, there is not much I know to do.  Perhaps you could send it to some recovery service that could create an image copy of the disk.

 

Is the physical disk that used to be in slot7 still around?  If so, you might have more luck recovering from that than recovering from the array with the failing disk1.

Link to comment

 

Disk1 is a "real" disk.  You can try to copy its contents to another drive on a workstation (I would not recommend writing anything to the array right now).

 

You could remove disk1 from the server, mount in an external USB, and attach to a workstation computer.  There is a reiserfs program you can run that will give you read-only access to the disk.  If you fear your unRAID computer is failing, this would be a way to eliminate the unRAID server.  If the copy works smoothly you may be justified in doubting the server hardware.  But if it is slow and problematic, you have confirmed the disk really is bad.

 

Disk7 is imposssible to rebuild without disk1 in the array.  If you were able to get disk1 working well in the step above, it would mean that your unRAID server is suspect and you should upgrade it.  Once it is rebuilt, you can retry the rebuild.

 

But if disk1 is bad, there is not much I know to do.  Perhaps you could send it to some recovery service that could create an image copy of the disk.

 

Is the physical disk that used to be in slot7 still around?  If so, you might have more luck recovering from that than recovering from the array with the failing disk1.

 

I have a very strong suspicion that, like you said before, it may be PSU related. The disks are dropping like flies. Must be something with the hardware that is causing this epidemic of failures. I did not mention this before, but during one of the reboots the other night another two drives went missing. I powered down the server, re-seated power and SATA cables and powered it up. The drives, thank God, came back up. It finally dawned on me that I should not take any more chances so I stopped the rebuild and powered down the server. I don't rule out other hardware problems either, especially that I know I've had memory/mobo issues before. So, I decided to rebuild the entire machine. It's been in my plans anyway. The new hardware is on the way so I should have the new build next week.

 

As of right now, my plan is as follows:

 

1. Take out disks 1 and 9 and copy files off of them to my second unraid machine.

I am looking to find a way to do it via some kind of gui as my command line skills are lacking

and I don't trust them 100%.

Your suggestion to copy files to a Windows machine sounds like something I'd like to try,

but I am not sure I want to do that as I have two empty 2TB drives in my second unRaid.

 

2. Once the new machine is ready I will put disks 1 and 9 back in, and try to rebuild disk 7,

or maybe I can extract files from the 'virtual' disk 7. Yes, I still have the physical disk 7, but it is bricked.

 

3. If step 2 fails, I will back up files off of another 1TB disk that is the same model and same

firmware and swap controller boards with the dead drive. Perhaps I can revive it.

 

4. If that fails as well... well, I hope I can somehow figure out what was on it and

get that content back. Good thing is all of my critical files are backed up on the second server.

 

Question, suppose I can copy files off Disks 1 and 9, is there an efficient way to tell if all files are there

and if there are any corruptions? Or should I just compare the 'used' space in the original config with the

one on the new drive, and then go check each individual file manually?

 

Thanks for your support bjp999.

Link to comment

If the files are zip files, or rar files, you can test them.  Or if you have a backup, you can compare them.  Or if you ran an md5 checksum, you can run another md5 checksum and compare them.  Or if you ran PAR2 on a directory, you can verify with quickpar.

 

If it is a movie, you can watch it.  If it is a song, you can listen to it.  If it is a program, you can run it.

 

You get the idea.

 

There is nothing built into unRAID that provides the ability to verify that a file has not been corrupted.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.