Disk showing as disabled

January 14, 200917 yr

One of my disks is now showing as a red dot, and shows as disabled. I saw some errors in the management page before I rebooted, but now is showing 0.

I appear to be able to read/write to the disk, however I assume the disk is now defective? Is there any way to find out what the errors are.. i.e. what do I put in the RMA form to WD?

January 14, 200917 yr

Read and follow: http://lime-technology.com/wiki/index.php?title=Troubleshooting

The errors on the page would go away after a reboot, since that is transient data, not smartctl info.

Take some smartctl dumps and post the results.

January 14, 200917 yr

Read and follow: http://lime-technology.com/wiki/index.php?title=Troubleshooting

The errors on the page would go away after a reboot, since that is transient data, not smartctl info.

Take some smartctl dumps and post the results.

Also see here: http://lime-technology.com/wiki/index.php?title=FAQ#What_does_the_Red_Ball_mean.3F

Your failed drive might really be defective, or it as easily could be a bad cable, a loose cable. The SMART tests will let you know more. If the smartctl command complains about a missing library, it will need to be installed. (The library was accidentally left out of the last few unRAID releases) See this post for how to get and install the missing library.

A copy of the SMART status report probably has enough information to RMA your drive.

If it appears as if the drive is OK, and it was just a defective or loose cable, you can get the disk back online by following the steps here in the wiki

Joe L.

January 15, 200917 yr

Author

ahh the missing package explains why the buttons on the disk management screen of unMenu didn't do anything!

Here is the smartctl output which I can't see any errors in:

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE family
Device Model:     WDC WD5000AAKB-00YSA0
Serial Number:    WD-WCAS87014936
Firmware Version: 12.01C02
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Jan 15 09:59:41 2009 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (13200) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 154) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x203f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   209   173   021    Pre-fail  Always       -       4550
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       737
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3083
10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       283
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       88
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       737
194 Temperature_Celsius     0x0022   143   112   000    Old_age   Always       -       7
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

January 15, 200917 yr

Author

Having said this, the disk is now showing as "not installed".

Unfortunately I'm away from the server, so cant confirm if the disk is dead or not.. however I have tried rebooting several times (remotely).

January 15, 200917 yr

Having said this, the disk is now showing as "not installed".

Unfortunately I'm away from the server, so cant confirm if the disk is dead or not.. however I have tried rebooting several times (remotely).

I'd verify the cabling to the drive. For it to work one minute, and not the next is a strong indication of a loose connector or a defective cable affected by vibration, temperature expansion, phase-of-the-moon, etc..

Joe L.

Only kidding about phase-of-the-moon... odds are good that that would only affect you if you are a werewolf. (Full Moon was on Monday, I think)

January 15, 200917 yr

I had an experience recently with a SATA socket on the motherboard. It caused some flakiness. The slightest jiggle of a cable would cause that connection to drop. I installed a locking SATA cable and that staightened it out.

January 15, 200917 yr

I think it's not just a case, because i started to experience red-balls since i updated to 4.4.2 too. In syslog it looks like this:

Jan 16 00:43:13 Tower kernel: ata5: hard resetting link
Jan 16 00:43:13 Tower kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan 16 00:43:13 Tower kernel: ata5.00: configured for UDMA/33
Jan 16 00:43:13 Tower kernel: ata5: EH complete
Jan 16 00:43:13 Tower kernel: sd 5:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB)
Jan 16 00:43:13 Tower kernel: sd 5:0:0:0: [sdd] Write Protect is off
Jan 16 00:43:13 Tower kernel: sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00
Jan 16 00:43:13 Tower kernel: sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 16 00:43:13 Tower kernel: ata5.00: exception Emask 0x10 SAct 0x5f SErr 0x400100 action 0x6 frozen
Jan 16 00:43:13 Tower kernel: ata5.00: irq_stat 0x08000000, interface fatal error
Jan 16 00:43:13 Tower kernel: ata5: SError: { UnrecovData Handshk }
Jan 16 00:43:13 Tower kernel: ata5.00: cmd 60/f8:00:ff:d8:bd/00:00:0a:00:00/40 tag 0 ncq 126976 in
Jan 16 00:43:13 Tower kernel:          res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY }
Jan 16 00:43:13 Tower kernel: ata5.00: cmd 60/08:08:9f:00:14/00:00:2e:00:00/40 tag 1 ncq 4096 in
Jan 16 00:43:13 Tower kernel:          res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY }
Jan 16 00:43:13 Tower kernel: ata5.00: cmd 60/08:10:e7:34:2d/00:00:0d:00:00/40 tag 2 ncq 4096 in
Jan 16 00:43:13 Tower kernel:          res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY }
Jan 16 00:43:13 Tower kernel: ata5.00: cmd 60/08:18:bf:28:00/00:00:00:00:00/40 tag 3 ncq 4096 in
Jan 16 00:43:13 Tower kernel:          res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY }
Jan 16 00:43:13 Tower kernel: ata5.00: cmd 61/30:20:4f:d6:bd/02:00:0a:00:00/40 tag 4 ncq 286720 out
Jan 16 00:43:13 Tower kernel:          res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY }
Jan 16 00:43:13 Tower kernel: ata5.00: cmd 60/80:30:7f:d8:bd/00:00:0a:00:00/40 tag 6 ncq 65536 in
Jan 16 00:43:13 Tower kernel:          res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error)
Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY }

It results either in ATA bus error or in Timeouts and other issues that eventually cause a red ball on a random disk. It never took a place when i was using an older distro (i guess something is wrong with current kernel and ICH9R controller), though i'm thinking of re-wiring the whole server. Cables attached to my hot-spare bays dont seem to be well-made. Can you advise which cables are the best? (the shorter the better)

January 16, 200917 yr

Author

Hi all,

I got home now and checked the cabling to the drive. It looked ok, but I re-secured the power/ide leads. Now the drive is showing in unRaid as "Disabled Disk replaced" and asking me if to "....bring the array on-line, start Data-Rebuild, and then expand the file system (if possible)."

OR

Restore will initialize the stored array configuration; all drives will appear as New, but data disk contents are not affected.

Now surely the disk that it is seeing as replaced, is actually the original disk with all the data still present?

What should I do? Should I allow unraid to rebuild the drive, or just bring it online and should be able to see all the data?

January 16, 200917 yr

Hi all,

I got home now and checked the cabling to the drive. It looked ok, but I re-secured the power/ide leads. Now the drive is showing in unRaid as "Disabled Disk replaced" and asking me if to "....bring the array on-line, start Data-Rebuild, and then expand the file system (if possible)."

OR

Restore will initialize the stored array configuration; all drives will appear as New, but data disk contents are not affected.

Now surely the disk that it is seeing as replaced, is actually the original disk with all the data still present?

What should I do? Should I allow unraid to rebuild the drive, or just bring it online and should be able to see all the data?

Best is this procedure: http://lime-technology.com/wiki/index.php?title=Make_unRAID_Trust_the_Parity_Drive%2C_Avoid_Rebuilding_Parity_Unnecessarily

Joe L.

January 16, 200917 yr

Author

Thanks Joe, but does that procedure cause me any issues if the drive actually is dead/dying?

At this point in time I have no idea whether the drive has all my data there or not, so I guess I should let unraid rebuild it?

January 16, 200917 yr

Thanks Joe, but does that procedure cause me any issues if the drive actually is dead/dying?

At this point in time I have no idea whether the drive has all my data there or not, so I guess I should let unraid rebuild it?

It puts the array in a state where it assumes all is well, but you pressed the parity check button.

If you do not trust the drive, then it is probably best to let unRAID rebuild it.

Before you do, if you have room on other disks, copy from the "failed" drive to another any critical files. (Think of it as similar to using both a belt and suspenders.)

Joe L.

January 16, 200917 yr

That non-destructive parity check feature would come in very handy in situations like this.

January 16, 200917 yr

Author

I think perhaps I'm not explaining the situation properly, or I'm misunderstanding how unraid works.

Currently I believe the parity is valid and the parity disk itself is fine.
A data disk (disk 9) went offline and is now seen by unraid as new, when in reality it probably isn't new, and probably still contains my data

I can't start the array to check the data on disk9, as unraid wants me to either rebuild, or reset. If I could bring the array online, with Disk9 left alone - probably everything would be as it was prior to the disk showing as unavailable.

I think the safest option is to allow unraid to rebuild disk9, even though it is probably not necessary.

Does that make sense?

January 16, 200917 yr

I think I understand.

What has happened is that unRAID, at some point in the past, tried to write to that disk and got an error. A "real" write error is quite serious and likely means the disk is bad, but several things could happen to make it appear to unRAID that a write error occurred but, from the disk's perspective, a write error didn't really occur. For example, a bad or lose data or power cable.

But it doesn't matter to unRAID. Once it sees a write error it jumps into action and removes the physical drive from the array. But through the magic of unRAID, it simulates that drive. Anything you write to the drive goes to the simulated disk, not to the physical disk. I'm not sure if you wrote anything to the simulated disk or not. (I think you understand this but for the benefit of others, unRAID will NEVER put that disk back into service without user action.)

So now you think you've fixed whatever the problem was and feel pretty confident that the disk is really okay and and you'd like to bring that disk back online.

So there have been 2 alternatives discussed:

1 - Let unRAID rebuild the disk back onto the original disk. This is a tried and true process, but nothing is without risk. And the risk here is that, while rebuilding, unRAID hits a read error on one of the other disks. If this occurs, your reconstructed disk will have an error on it. (This is pretty unlikely if you have been doing routine parity checks). Even less likely (hardly worth mentioning actually), but if a disk were to fail during the process you might be left with a major headache trying to recover data from 2 disks.

2 - Run the "trust my parity" procedure. This is much less "invasive" and does not cause any updates to your original disk. What it does cause is a parity check. If the simulated disk and the actual disks are very different (i.e., you've done some writing to the simulated disk), all of those updates will be lost. These will be seen as parity sync errors and "corrected" to make parity accurate with the physical disk contents. At the end you're back to the state when the drive initially failed. (Updates to other disks are not lost or affected). A substantial problem could occur, however, if you got part way through the "trust" process and the disk experienced a failure (i.e., your corrective actions were not successful after all). Your parity updates would be half baked, and the ability to accurately simulate potentially lost.

I would go with #2 if I was darn sure that I didn't write anything to the simulated disk.

There is are other options / variations ...

3 - Get ANOTHER ("fresh") disk and rebuild onto that, saving the original disk in a safe place. Then, if anything goes bad in the drive rebuild, you still have the original and have a second "at bat" to recover the data.

3b. You could then, if you wanted to, remove the fresh disk and put the original back in, and do another rebuild. You'd then have your fresh disk as a backup in case anything went wrong. BTW, this is for anal retentives only (like me)!

4. You could take the original disk out of the machine and install into another unRAID server. (Or copy your config directory to a backup (to restore later) and reinitialize the array for testing purposes, installing just that one disk without parity. Make sure you install it in a data slot and not the parity slot). You could then take a look around on that disk and make sure it seems to be okay. You could run a long smartctl test. Once you are confident in the drive, you could powerdown (safely) the unRAID server, restore your backup config directory (by putting the flash into a Windows machine), reboot unRAID, and run procedure #2 with more confidence.

Joe L.'s advice to save critical info (and new data if you know of something specific that you wrote while in simulated mode) from the disk is very appropriate. I would call it using a safety harness, more than a belt and suspenders. Do that before you do anything.

Lots of options. Not sure if this helps or confuses. Please post back if I can help clarify.

January 16, 200917 yr

I think I understand.

What has happened is that unRAID, at some point in the past, tried to write to that disk and got an error. A "real" write error is quite serious and likely means the disk is bad, but several things could happen to make it appear to unRAID that a write error occurred but, from the disk's perspective, a write error didn't really occur. For example, a bad or lose data or power cable.

But it doesn't matter to unRAID. Once it sees a write error it jumps into action and removes the physical drive from the array. But through the magic of unRAID, it simulates that drive. Anything you write to the drive goes to the simulated disk, not to the physical disk (not sure if you wrote anything to the simulated disk or not. (I think you understand this but for the benefit of others, unRAID will NEVER put that disk back into service without user action.)

So now you think you've fixed whatever the problem was and feel pretty confident that the disk is really okay and and you'd like to bring that disk back online.

Important to remember is a write to the physical disk failed... regardless of the reason. This almost guarantees the physical disk and the "virtual" one simulated by use of parity and the remaining disks differ.

Now, if the write failure occurred when you were storing a critical document, or an irreplaceable picture, then the odds are you want the unRAID server to rebuild the drive it had previously taken out of service, since the physical drive does not have the data written (or at least, not all of it)

Now, the "write" to the failed disk could have been something far less important, it could have been a timestamp being updated when the drive was first being mounted as you started the array. In that case, the file-system may need checking, but the data is OK.

Unless you know exactly when the "write" failure occurred, and exactly what was affected, it is best to consider the physical disk to be incorrect and the "virtual" disk correct.

bjp999's idea of restoring to a different disk is a great idea. So would be copying all the files from the "virtual" disk to other physical disks. (if you have the free space in your array) Either would give you a way to compare the files later for differences. If you did either you could as unRAID to trust the parity is correct.

Joe L.

January 16, 200917 yr

Author

Thank you all for your replies, you are being incredibly helpful.

One bit I still don't understand though.. how do I copy the data off the drive before I do anything? I can't start the array, therefore cannot access the disk.

I'm leaning toward doing a rebuild, because I have no idea what state the disk is in. Also I assume doing the rebuild will stress the drive and therefore potentially test whether it is actually defective or not?

January 16, 200917 yr

You should be able to unassign the disk (on the devices page) and unRAID should allow you to start the array "one disk short".

January 16, 200917 yr

Thank you all for your replies, you are being incredibly helpful.

One bit I still don't understand though.. how do I copy the data off the drive before I do anything? I can't start the array, therefore cannot access the disk.

I'm leaning toward doing a rebuild, because I have no idea what state the disk is in. Also I assume doing the rebuild will stress the drive and therefore potentially test whether it is actually defective or not?

Because you had rebooted, unRAID has forgotten the serial number of the drive that "was" in the slot that had failed. It now sees a "new" disk, so it is offering to rebuild onto it.

To prevent that, but to still be able to start the array, go to the devices page and un-assign the defective drive. You will then be able to go back to the main page and start the array AND still be able to get to the data on the "virtual" drive. Un-assigning a drive is exactly the same as having one fail.

Whatever you do, DO NOT PRESS THE RESTORE BUTTON and then start the array with the disk unassigned. Use the "Start" button to start it back up.

Joe L.

January 16, 200917 yr

Author

Thanks all, I've kicked off a rebuild.

I have a daft question now.. I use the powerdown script to shut down at 11pm... how do I cancel it? (Dont want to powedown if the rebuild is still running!)

I've tried atq to list the pending at jobs, but it doesn't return anything.. (Go script uses: echo "powerdown" | at 23:00)

January 16, 200917 yr

Thanks all, I've kicked off a rebuild.

I have a daft question now.. I use the powerdown script to shut down at 11pm... how do I cancel it? (Dont want to powedown if the rebuild is still running!)

I've tried atq to list the pending at jobs, but it doesn't return anything.. (Go script uses: echo "powerdown" | at 23:00)

If it does not list the job, odds are it does not have one scheduled.

You could try atrm -V 1

Your job should be the only one scheduled. (if it exists)

January 16, 200917 yr

Author

It says "cannot find job id 1".. which is wierd, cos the server shuts down every day @ 11pm with no problems. It did so last night fine too.

Would parts of the go script be missed due to the problems I've been having?

January 16, 200917 yr

can't answer that question... don't know how it is set up when the array is off-line.

If you think it might still execuite, type

which powerdown

It will tell you where the powerdown command lives.

Then, either edit it and add

exit

as the top line

or rename the whole command so it cannot be found by the original name of "powerdown"

January 17, 200917 yr

Author

Thanks all, I allowed unraid to rebuild the disk and all appears normal again

January 17, 200917 yr

Excellent!

Disk showing as disabled

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)