Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

[SOLVED] Failed disk?

Featured Replies

Hi guys,

 

Yesterday I was copying some files, and in a moment we had some power "flicks". Even though other devices didn't even turn off, and I have a UPS to my unraid server (reason why I didn't even think about it), short time later I noticed my server had shut down....

 

Turning it on it wouldn't boot, no screen on the monitor... Shut the whole thinkg completely off, no power, and then after a while turned it back on and it did boot, however one of the disks had a red ball.

 

I tried powering off and on again, no deal, same thing.

 

I tried doing a rebuild of the drive following the procedure stop/remove/start/stop/add/start+ rebuild but nothing, after a certain amount of write errors t just stops the process by itself and starts the array unprotected leaving the disk disabled... :(

 

Anything else I could try before crossing off the drive and going out to buy a replacement..? :S

 

  • Replies 54
  • Views 11.5k
  • Created
  • Last Reply

Post a picture of the Web GUI to show the full array's status.

 

But from your outline of the situation, it does sound like it's time for a new drive.

 

Did you re-seat the SATA cables on the failed drive?  That's always a good thing to try ... and if it's in a hot-swap cage, you may want to try a direct SATA connection instead.

 

  • Author

Didn't try reseating anything, since nothing moved, but I suppose I could try. The thing that worries me is it's actually giving errors though :(

 

Attached is the screenshot you asked me for.

Unraid_Error.png.f219d364b97ca639392d7cb66de6d48d.png

You've got good parity, and the drive clearly is having issues => so I'd simply get a new 1TB drive and replace it.

 

You've got good parity, and the drive clearly is having issues => so I'd simply get a new 1TB drive and replace it.

Without knowing WHY the writes to the drive are failing, this advice is a bit premature in my opinion.

 

If you can attach a copy of the syslog to your next post, analysis will be possible.  It could easily be that an SATA connector, or a POWER connect is loose or intermittent to that drive, or it could even be a drive tray, or backplane.  Yes, it could be a failing disk, but the odds are as good of it being a loose cable.  Since the loss of power would have caused the temperature in the server to cool, it is easy to see how thermal expansion when re-starting the server could cause an intermittent connection on one of the cables.  You owe it to yourself to stop the server,  shut the server down and re-seat the cables to the failing disk.

 

If you've got a spare drive, sure, you can replace it, but since you've not even attempted to re-seat the connectors to the failing drive, or get a smart report from it, replacement without attempting the re-seating of the cables is the equivalent of purchasing a new car simply because your old car won't start without even having a mechanic look under the hood to see if the battery cable is loose..

I DID say to try reseating the cables, and if it was connected through a hot-swap tray to try a direct connection BEFORE replacing the drive  :)

 

I suppose saying all that AND mentioning a new drive in the same post might be considered a bit premature.

 

  • Author

Thanks for the help guys, here I'm attaching the syslog to this post.

 

I will try re-seating meanwhile.

syslog-2013-06-23.zip

This is the very first time I've ever seen messages describing the battery being disconnected in the UPS.  (All I've ever seen in the past on other system logs are messages like these from my server this afternoon):

Jun 22 16:56:05 Tower apcupsd[1495]: Power failure.

Jun 22 16:56:06 Tower apcupsd[1495]: Power is back. UPS running on mains.

 

 

The disk failed to respond to a "write" command within seconds of your UPS battery being disconnected.  I suspect the power failure caused a power spike and might have been the root cause of the write failure to the disk.  With a proper (and consistent) power supply voltage, the disk might be fine.  It failed with a "Drive Ready" error within seconds of the power hiccup.  Since I've seen several "battery disconnected/reattached" message pairs in your syslog, I suspect your UPS may not be reliable at this time.

 

Jun 23 12:45:13 Tower apcupsd[1671]: Battery disconnected.

Jun 23 12:45:19 Tower apcupsd[1671]: Battery reattached.

Jun 23 12:46:00 Tower kernel: ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Jun 23 12:46:00 Tower kernel: ata5.01: BMDMA stat 0x65

Jun 23 12:46:00 Tower kernel: ata5.01: failed command: WRITE DMA

Jun 23 12:46:00 Tower kernel: ata5.01: cmd ca/00:00:bf:9b:f1/00:00:00:00:00/f0 tag 0 dma 131072 out

Jun 23 12:46:00 Tower kernel:          res 51/04:30:b8:93:f1/00:00:00:00:00/f0 Emask 0x1 (device error)

Jun 23 12:46:00 Tower kernel: ata5.01: status: { DRDY ERR }

Jun 23 12:46:00 Tower kernel: ata5.01: error: { ABRT }

Jun 23 12:46:00 Tower kernel: ata5.01: failed to read native max address (err_mask=0x1)

Jun 23 12:46:00 Tower kernel: ata5.01: HPA support seems broken, skipping HPA handling

Jun 23 12:46:00 Tower kernel: ata5.00: configured for UDMA/133

Jun 23 12:46:00 Tower kernel: ata5.01: configured for UDMA/133 (device error ignored)

Jun 23 12:46:00 Tower kernel: ata5: EH complete

Jun 23 12:46:00 Tower kernel: ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Jun 23 12:46:00 Tower kernel: ata5.01: BMDMA stat 0x65

Jun 23 12:46:00 Tower kernel: ata5.01: failed command: WRITE DMA

Jun 23 12:46:00 Tower kernel: ata5.01: cmd ca/00:00:bf:9b:f1/00:00:00:00:00/f0 tag 0 dma 131072 out

Jun 23 12:46:00 Tower kernel:          res 61/04:00:bf:9b:f1/00:00:00:00:00/f0 Emask 0x1 (device error)

Jun 23 12:46:00 Tower kernel: ata5.01: status: { DRDY DF ERR }

 

once the drive failed to respond, you can ignore the subsequent errors in the log, they are because the drive stopped responding to commands.

 

  • Author

Hmm.. Yes, I need to replace the battery in my UPS, as it has been warning already it needs to be changed.

 

Being that it is still providing power (albeit not backup power) I dodn't think it would have anything to do...

 

So would you suggest untill I replace the battery that I just plug it directly to an outlet and try the rebuild process again?

If you didn't have any power failures during the rebuild, it shouldn't matter whether you were plugged in to the UPS or directly to the outlet;  but it certainly wouldn't hurt to plug directly into an outlet and try one more time before you replace the drive.

 

... and get a new battery !!  I presume you've learned not to ignore messages about that  :)

  • Author

 

Ok, so I opened up the case, reseated everything (even though it all looked fine) but still the same thing...

 

*BUT* I did observe something different that points me to the PSU...

 

Having left the computer off all night, when I turned it on it booted up just fine (with the same array status of course) however, a power down and restart, the machine would not boot... everything turns on but it does not boot )not even a BIOS logo...). I tried disconnecting one of the PSU cables that powers 2 of the drives, and it boots up just fine it seems...

 

I measured power on that cable and seems fine (+12 and +5) but with all plugged in it's just not booting (unless I unplug the computer PSU and wait a while, and then it does...).

 

The PSU is an OCZ Silent x treme 600W, and I currently have 5 drives.  Do you think the PSU could be the culprit or something on one of the (now)disconnected drives?

 

  • Author

Ok, update and more info..

 

Seems that my observation from before was just a one time... tried againleaving one of those 2 disconnected drives,nothing, then the other booted, ut then both off again, didn't boot.. so now I'm clueless again... could it be the cpu? or something on the mobo now? really confused..

 

Anyway, I shut the whole thing down again, powerless and waited a bit. Turned everything back on (all drives connected), booted up, aray started but unprotected with red ball disk. Did the rebuild process, and this time it went up to about 20% and failed again due to errors :(

 

I tried to do a SMART report but can't because after it fails and disables the disc, when I try to do a SMART status all I get is this:

 

 

Statistics for /dev/sdc 00Y_WD-WCAV55557259

smartctl -a -d ata /dev/sdc

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

 

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

 

So would this be the indicator that for sure the drive is f*cked up? Attaching the latest syslog that has a lot of errors in it....

syslog-2013-06-24.txt

Ok, update and more info..

 

Seems that my observation from before was just a one time... tried againleaving one of those 2 disconnected drives,nothing, then the other booted, ut then both off again, didn't boot.. so now I'm clueless again... could it be the cpu? or something on the mobo now? really confused..

 

Anyway, I shut the whole thing down again, powerless and waited a bit. Turned everything back on (all drives connected), booted up, aray started but unprotected with red ball disk. Did the rebuild process, and this time it went up to about 20% and failed again due to errors :(

 

I tried to do a SMART report but can't because after it fails and disables the disc, when I try to do a SMART status all I get is this:

 

 

Statistics for /dev/sdc 00Y_WD-WCAV55557259

smartctl -a -d ata /dev/sdc

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

 

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

 

So would this be the indicator that for sure the drive is f*cked up? Attaching the latest syslog that has a lot of errors in it....

no, not necessarily, it could be your disk controller does not need/like the "-d ata" option added.

Try just

smartctl -a /dev/sdc

 

Joe L.

 

It certainly wouldn't hurt to try a different high-quality power supply.

 

Unstable/intermittent power issues can be very hard to isolate; and can cause unpredictable problems.    The easiest way to eliminate that as a potential cause is to use another PSU ... but be sure to use a high-quality single-rail unit [seasonic X series or Corsair HX series are good choices].

 

 

Ok, so I opened up the case, reseated everything (even though it all looked fine) but still the same thing...

 

*BUT* I did observe something different that points me to the PSU...

 

Having left the computer off all night, when I turned it on it booted up just fine (with the same array status of course) however, a power down and restart, the machine would not boot... everything turns on but it does not boot )not even a BIOS logo...). I tried disconnecting one of the PSU cables that powers 2 of the drives, and it boots up just fine it seems...

 

I measured power on that cable and seems fine (+12 and +5) but with all plugged in it's just not booting (unless I unplug the computer PSU and wait a while, and then it does...).

 

The PSU is an OCZ Silent x treme 600W, and I currently have 5 drives.  Do you think the PSU could be the culprit or something on one of the (now)disconnected drives?

 

It could be the PSU. It has 4 18A rails. The disks have less than 18A available.

  • Author

 

Ok, so I opened up the case, reseated everything (even though it all looked fine) but still the same thing...

 

*BUT* I did observe something different that points me to the PSU...

 

Having left the computer off all night, when I turned it on it booted up just fine (with the same array status of course) however, a power down and restart, the machine would not boot... everything turns on but it does not boot )not even a BIOS logo...). I tried disconnecting one of the PSU cables that powers 2 of the drives, and it boots up just fine it seems...

 

I measured power on that cable and seems fine (+12 and +5) but with all plugged in it's just not booting (unless I unplug the computer PSU and wait a while, and then it does...).

 

The PSU is an OCZ Silent x treme 600W, and I currently have 5 drives.  Do you think the PSU could be the culprit or something on one of the (now)disconnected drives?

 

It could be the PSU. It has 4 18A rails. The disks have less than 18A available.

 

Hmm..  I thought that would be more than enough for the number of disks I have... Power  wise I was told about 30W peak per disk. How much current do these suckers consume???

Your PSU has plenty of power ... but it's distributed over 4 different 18a rails.  If you have all 5 disks on the same rail, that could be causing unstable power.

 

That's why I suggested you try a high-quality SINGLE-RAIL power supply to eliminate that as a possibility.

 

 

Ok, so I opened up the case, reseated everything (even though it all looked fine) but still the same thing...

 

*BUT* I did observe something different that points me to the PSU...

 

Having left the computer off all night, when I turned it on it booted up just fine (with the same array status of course) however, a power down and restart, the machine would not boot... everything turns on but it does not boot )not even a BIOS logo...). I tried disconnecting one of the PSU cables that powers 2 of the drives, and it boots up just fine it seems...

 

I measured power on that cable and seems fine (+12 and +5) but with all plugged in it's just not booting (unless I unplug the computer PSU and wait a while, and then it does...).

 

The PSU is an OCZ Silent x treme 600W, and I currently have 5 drives.  Do you think the PSU could be the culprit or something on one of the (now)disconnected drives?

 

It could be the PSU. It has 4 18A rails. The disks have less than 18A available.

 

Hmm..  I thought that would be more than enough for the number of disks I have... Power  wise I was told about 30W peak per disk. How much current do these suckers consume???

 

30W is 2.5A per disk = 15A. If the rail is also used for the MB there is not enough power. Less than 1/4 of the power is available for the drives. The PSU is great for a gaming or parallel computation machine. It is not appropriate for a file server.

The article is quite clear. I suggest you read it again. Here is the part that the author suggests re-reading:

Multiple rail: each trace is monitored separately, so if, say, one trace goes over 25A the power supply will shut down.

 

Single rail: all traces are monitored all together, so if the total current going through the +12V outputs goes over, say, 60A, the power supply will shut down. Alternatively, no OCP may be present at all on the +12V rail.

 

Your mutil-rail unit will shutdown when 18A is reached. All of the hard drives (+ possibly the MB) are on a single 18A rail. PSU performance degrades over time. If the PSU was slightly over spec then it may have worked for 2 years only to fail now.

  • Author

The article is quite clear. I suggest you read it again. Here is the part that the author suggests re-reading:

Multiple rail: each trace is monitored separately, so if, say, one trace goes over 25A the power supply will shut down.

 

Single rail: all traces are monitored all together, so if the total current going through the +12V outputs goes over, say, 60A, the power supply will shut down. Alternatively, no OCP may be present at all on the +12V rail.

 

Your mutil-rail unit will shutdown when 18A is reached. All of the hard drives (+ possibly the MB) are on a single 18A rail. PSU performance degrades over time. If the PSU was slightly over spec then it may have worked for 2 years only to fail now.

 

Right, that I get, but my psu does not shut down..

Right, that I get, but my psu does not shut down..
How do you know? Do you have an oscilloscope on the 12V line to watch for momentary shutdowns?

Your PSU may not have OCP as described in the article; it may simply limit the amperage. I'm certain that 18A is not enough to run your system. You've been lucky up 'til now.

  • Author

Right, that I get, but my psu does not shut down..
How do you know? Do you have an oscilloscope on the 12V line to watch for momentary shutdowns?

 

Huh? What do you mean? Like micro failures which would cause the disk (and that specific disk...) to lose power, and yet not enough to power down the system..?

For the record I did put a multimeter to the 12 and 5 v cables during post, and there was no change to the voltage values.

Archived

This topic is now archived and is closed to further replies.

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.