[SOLVED] Failed disk?

June 23, 201313 yr

Hi guys,

Yesterday I was copying some files, and in a moment we had some power "flicks". Even though other devices didn't even turn off, and I have a UPS to my unraid server (reason why I didn't even think about it), short time later I noticed my server had shut down....

Turning it on it wouldn't boot, no screen on the monitor... Shut the whole thinkg completely off, no power, and then after a while turned it back on and it did boot, however one of the disks had a red ball.

I tried powering off and on again, no deal, same thing.

I tried doing a rebuild of the drive following the procedure stop/remove/start/stop/add/start+ rebuild but nothing, after a certain amount of write errors t just stops the process by itself and starts the array unprotected leaving the disk disabled...

Anything else I could try before crossing off the drive and going out to buy a replacement..?

Quote

June 23, 201313 yr

Post a picture of the Web GUI to show the full array's status.

But from your outline of the situation, it does sound like it's time for a new drive.

Did you re-seat the SATA cables on the failed drive? That's always a good thing to try ... and if it's in a hot-swap cage, you may want to try a direct SATA connection instead.

Quote

June 23, 201313 yr

Author

Didn't try reseating anything, since nothing moved, but I suppose I could try. The thing that worries me is it's actually giving errors though

Attached is the screenshot you asked me for.

Quote

June 23, 201313 yr

You've got good parity, and the drive clearly is having issues => so I'd simply get a new 1TB drive and replace it.

Quote

June 23, 201313 yr

You've got good parity, and the drive clearly is having issues => so I'd simply get a new 1TB drive and replace it.

Without knowing WHY the writes to the drive are failing, this advice is a bit premature in my opinion.

If you can attach a copy of the syslog to your next post, analysis will be possible. It could easily be that an SATA connector, or a POWER connect is loose or intermittent to that drive, or it could even be a drive tray, or backplane. Yes, it could be a failing disk, but the odds are as good of it being a loose cable. Since the loss of power would have caused the temperature in the server to cool, it is easy to see how thermal expansion when re-starting the server could cause an intermittent connection on one of the cables. You owe it to yourself to stop the server, shut the server down and re-seat the cables to the failing disk.

If you've got a spare drive, sure, you can replace it, but since you've not even attempted to re-seat the connectors to the failing drive, or get a smart report from it, replacement without attempting the re-seating of the cables is the equivalent of purchasing a new car simply because your old car won't start without even having a mechanic look under the hood to see if the battery cable is loose..

Quote

June 23, 201313 yr

I DID say to try reseating the cables, and if it was connected through a hot-swap tray to try a direct connection BEFORE replacing the drive

I suppose saying all that AND mentioning a new drive in the same post might be considered a bit premature.

Quote

June 23, 201313 yr

Author

Thanks for the help guys, here I'm attaching the syslog to this post.

I will try re-seating meanwhile.

syslog-2013-06-23.zip

Quote

June 23, 201313 yr

Post a SMART report for disk2.

Quote

June 24, 201313 yr

This is the very first time I've ever seen messages describing the battery being disconnected in the UPS. (All I've ever seen in the past on other system logs are messages like these from my server this afternoon):

Jun 22 16:56:05 Tower apcupsd[1495]: Power failure.
Jun 22 16:56:06 Tower apcupsd[1495]: Power is back. UPS running on mains.

The disk failed to respond to a "write" command within seconds of your UPS battery being disconnected. I suspect the power failure caused a power spike and might have been the root cause of the write failure to the disk. With a proper (and consistent) power supply voltage, the disk might be fine. It failed with a "Drive Ready" error within seconds of the power hiccup. Since I've seen several "battery disconnected/reattached" message pairs in your syslog, I suspect your UPS may not be reliable at this time.

Jun 23 12:45:13 Tower apcupsd[1671]: Battery disconnected.

Jun 23 12:45:19 Tower apcupsd[1671]: Battery reattached.

Jun 23 12:46:00 Tower kernel: ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Jun 23 12:46:00 Tower kernel: ata5.01: BMDMA stat 0x65

Jun 23 12:46:00 Tower kernel: ata5.01: failed command: WRITE DMA

Jun 23 12:46:00 Tower kernel: ata5.01: cmd ca/00:00:bf:9b:f1/00:00:00:00:00/f0 tag 0 dma 131072 out

Jun 23 12:46:00 Tower kernel: res 51/04:30:b8:93:f1/00:00:00:00:00/f0 Emask 0x1 (device error)

Jun 23 12:46:00 Tower kernel: ata5.01: status: { DRDY ERR }

Jun 23 12:46:00 Tower kernel: ata5.01: error: { ABRT }

Jun 23 12:46:00 Tower kernel: ata5.01: failed to read native max address (err_mask=0x1)

Jun 23 12:46:00 Tower kernel: ata5.01: HPA support seems broken, skipping HPA handling

Jun 23 12:46:00 Tower kernel: ata5.00: configured for UDMA/133

Jun 23 12:46:00 Tower kernel: ata5.01: configured for UDMA/133 (device error ignored)

Jun 23 12:46:00 Tower kernel: ata5: EH complete

Jun 23 12:46:00 Tower kernel: ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Jun 23 12:46:00 Tower kernel: ata5.01: BMDMA stat 0x65

Jun 23 12:46:00 Tower kernel: ata5.01: failed command: WRITE DMA

Jun 23 12:46:00 Tower kernel: ata5.01: cmd ca/00:00:bf:9b:f1/00:00:00:00:00/f0 tag 0 dma 131072 out

Jun 23 12:46:00 Tower kernel: res 61/04:00:bf:9b:f1/00:00:00:00:00/f0 Emask 0x1 (device error)

Jun 23 12:46:00 Tower kernel: ata5.01: status: { DRDY DF ERR }

once the drive failed to respond, you can ignore the subsequent errors in the log, they are because the drive stopped responding to commands.

Quote

June 24, 201313 yr

Author

Hmm.. Yes, I need to replace the battery in my UPS, as it has been warning already it needs to be changed.

Being that it is still providing power (albeit not backup power) I dodn't think it would have anything to do...

So would you suggest untill I replace the battery that I just plug it directly to an outlet and try the rebuild process again?

Quote

June 24, 201313 yr

If you didn't have any power failures during the rebuild, it shouldn't matter whether you were plugged in to the UPS or directly to the outlet; but it certainly wouldn't hurt to plug directly into an outlet and try one more time before you replace the drive.

... and get a new battery !! I presume you've learned not to ignore messages about that

Quote

June 24, 201313 yr

Author

Ok, so I opened up the case, reseated everything (even though it all looked fine) but still the same thing...

*BUT* I did observe something different that points me to the PSU...

Having left the computer off all night, when I turned it on it booted up just fine (with the same array status of course) however, a power down and restart, the machine would not boot... everything turns on but it does not boot )not even a BIOS logo...). I tried disconnecting one of the PSU cables that powers 2 of the drives, and it boots up just fine it seems...

I measured power on that cable and seems fine (+12 and +5) but with all plugged in it's just not booting (unless I unplug the computer PSU and wait a while, and then it does...).

The PSU is an OCZ Silent x treme 600W, and I currently have 5 drives. Do you think the PSU could be the culprit or something on one of the (now)disconnected drives?

Quote

June 25, 201313 yr

Author

Ok, update and more info..

Seems that my observation from before was just a one time... tried againleaving one of those 2 disconnected drives,nothing, then the other booted, ut then both off again, didn't boot.. so now I'm clueless again... could it be the cpu? or something on the mobo now? really confused..

Anyway, I shut the whole thing down again, powerless and waited a bit. Turned everything back on (all drives connected), booted up, aray started but unprotected with red ball disk. Did the rebuild process, and this time it went up to about 20% and failed again due to errors

I tried to do a SMART report but can't because after it fails and disables the disc, when I try to do a SMART status all I get is this:

Statistics for /dev/sdc 00Y_WD-WCAV55557259

smartctl -a -d ata /dev/sdc

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

So would this be the indicator that for sure the drive is f*cked up? Attaching the latest syslog that has a lot of errors in it....

syslog-2013-06-24.txt

Quote

June 25, 201313 yr

Ok, update and more info..

Seems that my observation from before was just a one time... tried againleaving one of those 2 disconnected drives,nothing, then the other booted, ut then both off again, didn't boot.. so now I'm clueless again... could it be the cpu? or something on the mobo now? really confused..

Anyway, I shut the whole thing down again, powerless and waited a bit. Turned everything back on (all drives connected), booted up, aray started but unprotected with red ball disk. Did the rebuild process, and this time it went up to about 20% and failed again due to errors

I tried to do a SMART report but can't because after it fails and disables the disc, when I try to do a SMART status all I get is this:

Statistics for /dev/sdc 00Y_WD-WCAV55557259

smartctl -a -d ata /dev/sdc

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

So would this be the indicator that for sure the drive is f*cked up? Attaching the latest syslog that has a lot of errors in it....

no, not necessarily, it could be your disk controller does not need/like the "-d ata" option added.

Try just

smartctl -a /dev/sdc

Joe L.

Quote

June 25, 201313 yr

It certainly wouldn't hurt to try a different high-quality power supply.

Unstable/intermittent power issues can be very hard to isolate; and can cause unpredictable problems. The easiest way to eliminate that as a potential cause is to use another PSU ... but be sure to use a high-quality single-rail unit [seasonic X series or Corsair HX series are good choices].

Quote

June 25, 201313 yr

Ok, so I opened up the case, reseated everything (even though it all looked fine) but still the same thing...

*BUT* I did observe something different that points me to the PSU...

Having left the computer off all night, when I turned it on it booted up just fine (with the same array status of course) however, a power down and restart, the machine would not boot... everything turns on but it does not boot )not even a BIOS logo...). I tried disconnecting one of the PSU cables that powers 2 of the drives, and it boots up just fine it seems...

I measured power on that cable and seems fine (+12 and +5) but with all plugged in it's just not booting (unless I unplug the computer PSU and wait a while, and then it does...).

The PSU is an OCZ Silent x treme 600W, and I currently have 5 drives. Do you think the PSU could be the culprit or something on one of the (now)disconnected drives?

It could be the PSU. It has 4 18A rails. The disks have less than 18A available.

Quote

June 25, 201313 yr

Author

Ok, so I opened up the case, reseated everything (even though it all looked fine) but still the same thing...

*BUT* I did observe something different that points me to the PSU...

Having left the computer off all night, when I turned it on it booted up just fine (with the same array status of course) however, a power down and restart, the machine would not boot... everything turns on but it does not boot )not even a BIOS logo...). I tried disconnecting one of the PSU cables that powers 2 of the drives, and it boots up just fine it seems...

I measured power on that cable and seems fine (+12 and +5) but with all plugged in it's just not booting (unless I unplug the computer PSU and wait a while, and then it does...).

The PSU is an OCZ Silent x treme 600W, and I currently have 5 drives. Do you think the PSU could be the culprit or something on one of the (now)disconnected drives?

It could be the PSU. It has 4 18A rails. The disks have less than 18A available.

Hmm.. I thought that would be more than enough for the number of disks I have... Power wise I was told about 30W peak per disk. How much current do these suckers consume???

Quote

June 25, 201313 yr

Your PSU has plenty of power ... but it's distributed over 4 different 18a rails. If you have all 5 disks on the same rail, that could be causing unstable power.

That's why I suggested you try a high-quality SINGLE-RAIL power supply to eliminate that as a possibility.

Quote

June 25, 201313 yr

Ok, so I opened up the case, reseated everything (even though it all looked fine) but still the same thing...

*BUT* I did observe something different that points me to the PSU...

Having left the computer off all night, when I turned it on it booted up just fine (with the same array status of course) however, a power down and restart, the machine would not boot... everything turns on but it does not boot )not even a BIOS logo...). I tried disconnecting one of the PSU cables that powers 2 of the drives, and it boots up just fine it seems...

I measured power on that cable and seems fine (+12 and +5) but with all plugged in it's just not booting (unless I unplug the computer PSU and wait a while, and then it does...).

The PSU is an OCZ Silent x treme 600W, and I currently have 5 drives. Do you think the PSU could be the culprit or something on one of the (now)disconnected drives?

It could be the PSU. It has 4 18A rails. The disks have less than 18A available.

Hmm.. I thought that would be more than enough for the number of disks I have... Power wise I was told about 30W peak per disk. How much current do these suckers consume???

30W is 2.5A per disk = 15A. If the rail is also used for the MB there is not enough power. Less than 1/4 of the power is available for the drives. The PSU is great for a gaming or parallel computation machine. It is not appropriate for a file server.

Quote

June 25, 201313 yr

Author

Hmm sorry trying to understand, but I don't see it... What's more after reading about the differences of multi/single it doesn't seem to even matter

http://www.overclock.net/t/761202/single-rail-vs-multi-rail-explained

Also, I didn't mention it but my setup has been running just fine for over 2 years...

Quote

June 26, 201313 yr

The article is quite clear. I suggest you read it again. Here is the part that the author suggests re-reading:

Multiple rail: each trace is monitored separately, so if, say, one trace goes over 25A the power supply will shut down.

Single rail: all traces are monitored all together, so if the total current going through the +12V outputs goes over, say, 60A, the power supply will shut down. Alternatively, no OCP may be present at all on the +12V rail.

Your mutil-rail unit will shutdown when 18A is reached. All of the hard drives (+ possibly the MB) are on a single 18A rail. PSU performance degrades over time. If the PSU was slightly over spec then it may have worked for 2 years only to fail now.

Quote

June 26, 201313 yr

Author

The article is quite clear. I suggest you read it again. Here is the part that the author suggests re-reading:

Multiple rail: each trace is monitored separately, so if, say, one trace goes over 25A the power supply will shut down.

Single rail: all traces are monitored all together, so if the total current going through the +12V outputs goes over, say, 60A, the power supply will shut down. Alternatively, no OCP may be present at all on the +12V rail.

Your mutil-rail unit will shutdown when 18A is reached. All of the hard drives (+ possibly the MB) are on a single 18A rail. PSU performance degrades over time. If the PSU was slightly over spec then it may have worked for 2 years only to fail now.

Right, that I get, but my psu does not shut down..

Quote

June 26, 201313 yr

Right, that I get, but my psu does not shut down..

How do you know? Do you have an oscilloscope on the 12V line to watch for momentary shutdowns?

Quote

June 26, 201313 yr

Your PSU may not have OCP as described in the article; it may simply limit the amperage. I'm certain that 18A is not enough to run your system. You've been lucky up 'til now.

Quote

June 26, 201313 yr

Author

Right, that I get, but my psu does not shut down..
How do you know? Do you have an oscilloscope on the 12V line to watch for momentary shutdowns?

Huh? What do you mean? Like micro failures which would cause the disk (and that specific disk...) to lose power, and yet not enough to power down the system..?

For the record I did put a multimeter to the 12 and 5 v cables during post, and there was no change to the voltage values.

Quote

[SOLVED] Failed disk?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)