January 14, 200917 yr One of my disks is now showing as a red dot, and shows as disabled. I saw some errors in the management page before I rebooted, but now is showing 0. I appear to be able to read/write to the disk, however I assume the disk is now defective? Is there any way to find out what the errors are.. i.e. what do I put in the RMA form to WD?
January 14, 200917 yr Read and follow: http://lime-technology.com/wiki/index.php?title=Troubleshooting The errors on the page would go away after a reboot, since that is transient data, not smartctl info. Take some smartctl dumps and post the results.
January 14, 200917 yr Read and follow: http://lime-technology.com/wiki/index.php?title=Troubleshooting The errors on the page would go away after a reboot, since that is transient data, not smartctl info. Take some smartctl dumps and post the results. Also see here: http://lime-technology.com/wiki/index.php?title=FAQ#What_does_the_Red_Ball_mean.3F Your failed drive might really be defective, or it as easily could be a bad cable, a loose cable. The SMART tests will let you know more. If the smartctl command complains about a missing library, it will need to be installed. (The library was accidentally left out of the last few unRAID releases) See this post for how to get and install the missing library. A copy of the SMART status report probably has enough information to RMA your drive. If it appears as if the drive is OK, and it was just a defective or loose cable, you can get the disk back online by following the steps here in the wiki Joe L.
January 15, 200917 yr Author ahh the missing package explains why the buttons on the disk management screen of unMenu didn't do anything! Here is the smartctl output which I can't see any errors in: smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Western Digital Caviar SE family Device Model: WDC WD5000AAKB-00YSA0 Serial Number: WD-WCAS87014936 Firmware Version: 12.01C02 User Capacity: 500,107,862,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Thu Jan 15 09:59:41 2009 GMT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (13200) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 154) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x203f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 209 173 021 Pre-fail Always - 4550 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 737 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000e 200 200 051 Old_age Always - 0 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3083 10 Spin_Retry_Count 0x0012 100 100 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 283 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 88 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 737 194 Temperature_Celsius 0x0022 143 112 000 Old_age Always - 7 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
January 15, 200917 yr Author Having said this, the disk is now showing as "not installed". Unfortunately I'm away from the server, so cant confirm if the disk is dead or not.. however I have tried rebooting several times (remotely).
January 15, 200917 yr Having said this, the disk is now showing as "not installed". Unfortunately I'm away from the server, so cant confirm if the disk is dead or not.. however I have tried rebooting several times (remotely). I'd verify the cabling to the drive. For it to work one minute, and not the next is a strong indication of a loose connector or a defective cable affected by vibration, temperature expansion, phase-of-the-moon, etc.. Joe L. Only kidding about phase-of-the-moon... odds are good that that would only affect you if you are a werewolf. (Full Moon was on Monday, I think)
January 15, 200917 yr I had an experience recently with a SATA socket on the motherboard. It caused some flakiness. The slightest jiggle of a cable would cause that connection to drop. I installed a locking SATA cable and that staightened it out.
January 15, 200917 yr I think it's not just a case, because i started to experience red-balls since i updated to 4.4.2 too. In syslog it looks like this: Jan 16 00:43:13 Tower kernel: ata5: hard resetting link Jan 16 00:43:13 Tower kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 16 00:43:13 Tower kernel: ata5.00: configured for UDMA/33 Jan 16 00:43:13 Tower kernel: ata5: EH complete Jan 16 00:43:13 Tower kernel: sd 5:0:0:0: [sdd] 976773168 512-byte hardware sectors (500108 MB) Jan 16 00:43:13 Tower kernel: sd 5:0:0:0: [sdd] Write Protect is off Jan 16 00:43:13 Tower kernel: sd 5:0:0:0: [sdd] Mode Sense: 00 3a 00 00 Jan 16 00:43:13 Tower kernel: sd 5:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jan 16 00:43:13 Tower kernel: ata5.00: exception Emask 0x10 SAct 0x5f SErr 0x400100 action 0x6 frozen Jan 16 00:43:13 Tower kernel: ata5.00: irq_stat 0x08000000, interface fatal error Jan 16 00:43:13 Tower kernel: ata5: SError: { UnrecovData Handshk } Jan 16 00:43:13 Tower kernel: ata5.00: cmd 60/f8:00:ff:d8:bd/00:00:0a:00:00/40 tag 0 ncq 126976 in Jan 16 00:43:13 Tower kernel: res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error) Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY } Jan 16 00:43:13 Tower kernel: ata5.00: cmd 60/08:08:9f:00:14/00:00:2e:00:00/40 tag 1 ncq 4096 in Jan 16 00:43:13 Tower kernel: res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error) Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY } Jan 16 00:43:13 Tower kernel: ata5.00: cmd 60/08:10:e7:34:2d/00:00:0d:00:00/40 tag 2 ncq 4096 in Jan 16 00:43:13 Tower kernel: res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error) Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY } Jan 16 00:43:13 Tower kernel: ata5.00: cmd 60/08:18:bf:28:00/00:00:00:00:00/40 tag 3 ncq 4096 in Jan 16 00:43:13 Tower kernel: res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error) Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY } Jan 16 00:43:13 Tower kernel: ata5.00: cmd 61/30:20:4f:d6:bd/02:00:0a:00:00/40 tag 4 ncq 286720 out Jan 16 00:43:13 Tower kernel: res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error) Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY } Jan 16 00:43:13 Tower kernel: ata5.00: cmd 60/80:30:7f:d8:bd/00:00:0a:00:00/40 tag 6 ncq 65536 in Jan 16 00:43:13 Tower kernel: res 40/00:34:7f:d8:bd/00:00:0a:00:00/40 Emask 0x10 (ATA bus error) Jan 16 00:43:13 Tower kernel: ata5.00: status: { DRDY } It results either in ATA bus error or in Timeouts and other issues that eventually cause a red ball on a random disk. It never took a place when i was using an older distro (i guess something is wrong with current kernel and ICH9R controller), though i'm thinking of re-wiring the whole server. Cables attached to my hot-spare bays dont seem to be well-made. Can you advise which cables are the best? (the shorter the better)
January 16, 200917 yr Author Hi all, I got home now and checked the cabling to the drive. It looked ok, but I re-secured the power/ide leads. Now the drive is showing in unRaid as "Disabled Disk replaced" and asking me if to "....bring the array on-line, start Data-Rebuild, and then expand the file system (if possible)." OR Restore will initialize the stored array configuration; all drives will appear as New, but data disk contents are not affected. Now surely the disk that it is seeing as replaced, is actually the original disk with all the data still present? What should I do? Should I allow unraid to rebuild the drive, or just bring it online and should be able to see all the data?
January 16, 200917 yr Hi all, I got home now and checked the cabling to the drive. It looked ok, but I re-secured the power/ide leads. Now the drive is showing in unRaid as "Disabled Disk replaced" and asking me if to "....bring the array on-line, start Data-Rebuild, and then expand the file system (if possible)." OR Restore will initialize the stored array configuration; all drives will appear as New, but data disk contents are not affected. Now surely the disk that it is seeing as replaced, is actually the original disk with all the data still present? What should I do? Should I allow unraid to rebuild the drive, or just bring it online and should be able to see all the data? Best is this procedure: http://lime-technology.com/wiki/index.php?title=Make_unRAID_Trust_the_Parity_Drive%2C_Avoid_Rebuilding_Parity_Unnecessarily Joe L.
January 16, 200917 yr Author Thanks Joe, but does that procedure cause me any issues if the drive actually is dead/dying? At this point in time I have no idea whether the drive has all my data there or not, so I guess I should let unraid rebuild it?
January 16, 200917 yr Thanks Joe, but does that procedure cause me any issues if the drive actually is dead/dying? At this point in time I have no idea whether the drive has all my data there or not, so I guess I should let unraid rebuild it? It puts the array in a state where it assumes all is well, but you pressed the parity check button. If you do not trust the drive, then it is probably best to let unRAID rebuild it. Before you do, if you have room on other disks, copy from the "failed" drive to another any critical files. (Think of it as similar to using both a belt and suspenders.) Joe L.
January 16, 200917 yr That non-destructive parity check feature would come in very handy in situations like this.
January 16, 200917 yr Author I think perhaps I'm not explaining the situation properly, or I'm misunderstanding how unraid works. Currently I believe the parity is valid and the parity disk itself is fine. A data disk (disk 9) went offline and is now seen by unraid as new, when in reality it probably isn't new, and probably still contains my data I can't start the array to check the data on disk9, as unraid wants me to either rebuild, or reset. If I could bring the array online, with Disk9 left alone - probably everything would be as it was prior to the disk showing as unavailable. I think the safest option is to allow unraid to rebuild disk9, even though it is probably not necessary. Does that make sense?
January 16, 200917 yr I think I understand. What has happened is that unRAID, at some point in the past, tried to write to that disk and got an error. A "real" write error is quite serious and likely means the disk is bad, but several things could happen to make it appear to unRAID that a write error occurred but, from the disk's perspective, a write error didn't really occur. For example, a bad or lose data or power cable. But it doesn't matter to unRAID. Once it sees a write error it jumps into action and removes the physical drive from the array. But through the magic of unRAID, it simulates that drive. Anything you write to the drive goes to the simulated disk, not to the physical disk. I'm not sure if you wrote anything to the simulated disk or not. (I think you understand this but for the benefit of others, unRAID will NEVER put that disk back into service without user action.) So now you think you've fixed whatever the problem was and feel pretty confident that the disk is really okay and and you'd like to bring that disk back online. So there have been 2 alternatives discussed: 1 - Let unRAID rebuild the disk back onto the original disk. This is a tried and true process, but nothing is without risk. And the risk here is that, while rebuilding, unRAID hits a read error on one of the other disks. If this occurs, your reconstructed disk will have an error on it. (This is pretty unlikely if you have been doing routine parity checks). Even less likely (hardly worth mentioning actually), but if a disk were to fail during the process you might be left with a major headache trying to recover data from 2 disks. 2 - Run the "trust my parity" procedure. This is much less "invasive" and does not cause any updates to your original disk. What it does cause is a parity check. If the simulated disk and the actual disks are very different (i.e., you've done some writing to the simulated disk), all of those updates will be lost. These will be seen as parity sync errors and "corrected" to make parity accurate with the physical disk contents. At the end you're back to the state when the drive initially failed. (Updates to other disks are not lost or affected). A substantial problem could occur, however, if you got part way through the "trust" process and the disk experienced a failure (i.e., your corrective actions were not successful after all). Your parity updates would be half baked, and the ability to accurately simulate potentially lost. I would go with #2 if I was darn sure that I didn't write anything to the simulated disk. There is are other options / variations ... 3 - Get ANOTHER ("fresh") disk and rebuild onto that, saving the original disk in a safe place. Then, if anything goes bad in the drive rebuild, you still have the original and have a second "at bat" to recover the data. 3b. You could then, if you wanted to, remove the fresh disk and put the original back in, and do another rebuild. You'd then have your fresh disk as a backup in case anything went wrong. BTW, this is for anal retentives only (like me)! 4. You could take the original disk out of the machine and install into another unRAID server. (Or copy your config directory to a backup (to restore later) and reinitialize the array for testing purposes, installing just that one disk without parity. Make sure you install it in a data slot and not the parity slot). You could then take a look around on that disk and make sure it seems to be okay. You could run a long smartctl test. Once you are confident in the drive, you could powerdown (safely) the unRAID server, restore your backup config directory (by putting the flash into a Windows machine), reboot unRAID, and run procedure #2 with more confidence. Joe L.'s advice to save critical info (and new data if you know of something specific that you wrote while in simulated mode) from the disk is very appropriate. I would call it using a safety harness, more than a belt and suspenders. Do that before you do anything. Lots of options. Not sure if this helps or confuses. Please post back if I can help clarify.
January 16, 200917 yr I think I understand. What has happened is that unRAID, at some point in the past, tried to write to that disk and got an error. A "real" write error is quite serious and likely means the disk is bad, but several things could happen to make it appear to unRAID that a write error occurred but, from the disk's perspective, a write error didn't really occur. For example, a bad or lose data or power cable. But it doesn't matter to unRAID. Once it sees a write error it jumps into action and removes the physical drive from the array. But through the magic of unRAID, it simulates that drive. Anything you write to the drive goes to the simulated disk, not to the physical disk (not sure if you wrote anything to the simulated disk or not. (I think you understand this but for the benefit of others, unRAID will NEVER put that disk back into service without user action.) So now you think you've fixed whatever the problem was and feel pretty confident that the disk is really okay and and you'd like to bring that disk back online. Important to remember is a write to the physical disk failed... regardless of the reason. This almost guarantees the physical disk and the "virtual" one simulated by use of parity and the remaining disks differ. Now, if the write failure occurred when you were storing a critical document, or an irreplaceable picture, then the odds are you want the unRAID server to rebuild the drive it had previously taken out of service, since the physical drive does not have the data written (or at least, not all of it) Now, the "write" to the failed disk could have been something far less important, it could have been a timestamp being updated when the drive was first being mounted as you started the array. In that case, the file-system may need checking, but the data is OK. Unless you know exactly when the "write" failure occurred, and exactly what was affected, it is best to consider the physical disk to be incorrect and the "virtual" disk correct. bjp999's idea of restoring to a different disk is a great idea. So would be copying all the files from the "virtual" disk to other physical disks. (if you have the free space in your array) Either would give you a way to compare the files later for differences. If you did either you could as unRAID to trust the parity is correct. Joe L.
January 16, 200917 yr Author Thank you all for your replies, you are being incredibly helpful. One bit I still don't understand though.. how do I copy the data off the drive before I do anything? I can't start the array, therefore cannot access the disk. I'm leaning toward doing a rebuild, because I have no idea what state the disk is in. Also I assume doing the rebuild will stress the drive and therefore potentially test whether it is actually defective or not?
January 16, 200917 yr You should be able to unassign the disk (on the devices page) and unRAID should allow you to start the array "one disk short".
January 16, 200917 yr Thank you all for your replies, you are being incredibly helpful. One bit I still don't understand though.. how do I copy the data off the drive before I do anything? I can't start the array, therefore cannot access the disk. I'm leaning toward doing a rebuild, because I have no idea what state the disk is in. Also I assume doing the rebuild will stress the drive and therefore potentially test whether it is actually defective or not? Because you had rebooted, unRAID has forgotten the serial number of the drive that "was" in the slot that had failed. It now sees a "new" disk, so it is offering to rebuild onto it. To prevent that, but to still be able to start the array, go to the devices page and un-assign the defective drive. You will then be able to go back to the main page and start the array AND still be able to get to the data on the "virtual" drive. Un-assigning a drive is exactly the same as having one fail. Whatever you do, DO NOT PRESS THE RESTORE BUTTON and then start the array with the disk unassigned. Use the "Start" button to start it back up. Joe L.
January 16, 200917 yr Author Thanks all, I've kicked off a rebuild. I have a daft question now.. I use the powerdown script to shut down at 11pm... how do I cancel it? (Dont want to powedown if the rebuild is still running!) I've tried atq to list the pending at jobs, but it doesn't return anything.. (Go script uses: echo "powerdown" | at 23:00)
January 16, 200917 yr Thanks all, I've kicked off a rebuild. I have a daft question now.. I use the powerdown script to shut down at 11pm... how do I cancel it? (Dont want to powedown if the rebuild is still running!) I've tried atq to list the pending at jobs, but it doesn't return anything.. (Go script uses: echo "powerdown" | at 23:00) If it does not list the job, odds are it does not have one scheduled. You could try atrm -V 1 Your job should be the only one scheduled. (if it exists)
January 16, 200917 yr Author It says "cannot find job id 1".. which is wierd, cos the server shuts down every day @ 11pm with no problems. It did so last night fine too. Would parts of the go script be missed due to the problems I've been having?
January 16, 200917 yr can't answer that question... don't know how it is set up when the array is off-line. If you think it might still execuite, type which powerdown It will tell you where the powerdown command lives. Then, either edit it and add exit as the top line or rename the whole command so it cannot be found by the original name of "powerdown"
January 17, 200917 yr Author Thanks all, I allowed unraid to rebuild the disk and all appears normal again
Archived
This topic is now archived and is closed to further replies.