skyhawk Posted October 20, 2015 Share Posted October 20, 2015 So, ive been away for 4 months for work. Luckily I return in 2 weeks. I tried my best to add redundancy to the server before I left, and it turns out that 1 more backup drive would have been a good option.. or adding a 5 in 3 hot swap cage Here's the rundown: Disk 6: 2 weeks ago I started to get emails about disk 6 having 1 Offline Uncorrectable error. Looked it up and the forum said 1 is ok, more is bad. As of Thursday night, this number is up to 6. Time to move. This is my original parity drive from 6-7 years ago when i started with unraid.. so shes old. Thurs night i took disk 6 out of my shared folders so the mover would stop adding to it and used MC to move the data over to disk 5. Everything was going great at first (moved about 300 GB successfully), but then I saw the errors adding up on Disk 5 (1536 read errors on the unraid dashboard). Then it said Disk offline, data emulated (or something to that effect). So, I took the array offline. Its still powered up but i stopped the array. I tried starting the array but now Disk 5 shows nothing and the hard drive isnt selectable. So... is it possible that a sata cable just happened to come loose after all this time? The drive has been in use for 6 months. Or, is Disk 5 possibly failing as well? I know this is hard to diagnose over the internet. Drive is only 6 months old and has been used successfully since then. 2? preclears before installing. no issues. What would you do in this situation... Options that I can think of: (NOTE: with disk 5 and Disk 6 BOTH out of the array, I only have 3TB between the other drives, which isnt enough to move both of these drives over to the remaining drives) -Shut down the server, wait 2 weeks, check it out then. (problem: wifes at home and she wont have access to her tv shows) -Turn off parity check and and just run with disk 5 offline, move disk 6 data to another disk. (turn off parity check so parity is still valid when i get there so i have more options in case the disk 5 is DOA). hhmmm on second thought this wont work if any changes are made in the meantime -order a new drive and have wifey install a new drive in the one and only hot swap bay that i have... but it has no fan and would run at 50-53 degrees C until i get home... then just rebuild parity with disk 5 missing and copy over disk 6. - Offer a south florida unraid member money/beer/etc to check it out or pay an IT guy to go out. - ? ? ? ? ? ? Disk 6 Smart attached. I cant access disk 5 so i dont have the smart report. Disk_6_Smart_Test_1016.txt syslog.zip Quote Link to comment
trurl Posted October 20, 2015 Share Posted October 20, 2015 Complete diagnostics would be better. Go to Tools - Diagnostics, get the zip file, and post it. That would allow us to look at some other things, including your other drives SMART. Do you have backups? If not then if it were me I would probably shut it down until I could work on it myself. Or at least quit writing to it. You don't really have any parity protection unless all disks are good. Do you use docker? Lots of stuff in your syslog about the loop device. Also, your syslog rotated so is incomplete. Complete diagnostics would give us the rest of your syslogs too. Quote Link to comment
skyhawk Posted October 20, 2015 Author Share Posted October 20, 2015 Diagnostics attached here... too large to attach on the forum https://drive.google.com/file/d/0B1uaV-iMhp2lTjdiRHBRMVAxOEE/view?usp=sharing No other backups. Its 95% tv and movies, not essential but a pain in the ass to lose. I have docker running the usual suspects... SAB, CP, Sonarr, not much else really. Sometime in the next year or 2 I plan to build a second server for off-site redundancy, but its not in the budget yet. worst case I can shut down and wife will just buy episodes from amazon prime. Idiot tax for not getting a hot swap bay. ...and... THANKS! Quote Link to comment
garycase Posted October 20, 2015 Share Posted October 20, 2015 With no backups (that's another story), you should always have at least one spare drive handy for failures. Then you could have IMMEDIATELY replaced disk #6 when it started to show errors. Of course without a hot-swap bay, trying to talk your wife through that process might have been interesting ... although if you kept the spare drive installed and connected to a controller, you could have done the replacement yourself by simply stopping the array; unassigning the bad drive; starting the array (so it was marked as missing); stopping the array and assigning your spare; and then starting the array (which would start the rebuild). Of course after you had done that, you should immediately order another spare. At this point, if you don't want to lose more data than you likely already have, I'd (a) shut down the array and leave it off until you're home; and (b) order a couple of spare drives so you'll have them in hand when you arrive home. Quote Link to comment
skyhawk Posted October 20, 2015 Author Share Posted October 20, 2015 I replaced the parity and had 3tb extra space and 3tb data, so I thought I was covered. 100% extra. But most of my drives are smaller and older, so I figured 2 2tb drives was good overhead. Never expected both could have issues simultaneously.. Figured a old small one would die first. But live and learn. I see most drives are built to handle 55 degrees c. So, I doubt a new drive would be too negatively effected if wifey put it in the fanless hot swap. But...opinions....I won't have a lot of free time when i get home...moving. Quote Link to comment
skyhawk Posted October 21, 2015 Author Share Posted October 21, 2015 see below So, I'm going to shut down until I get home Any input on the best route for recovery? Check cables on disk 5, if drive dead, remove from array, rebuild parity with missing drive. Immediately thereafter replace disk 6 (or just migrate data as my replacement will be a 5 tb drive). If disk 5 was just a loose cable, run parity, and then replace disk 6. Or skip parity and remove 6 and rebuild. Thanks. Quote Link to comment
skyhawk Posted October 31, 2015 Author Share Posted October 31, 2015 UPDATE.. Im back in town new update at bottom OK, re-seated the cables and Disk 5 was visible. Tried to bring it online but got an Offline error. Power is working as the drive is spinning. I replaced the sata cable, restarted and I can access disk 5 again. See smart report (attached). However, there is still a red x next to the drive and it says Disabled, data emulated. When I boot the server, I can select the disk from the drop down list. Once I try to start the array, it goes offline. If i stop the array, it shows No Device and its missing from the pull down list until the next restart. So, i think the drive is working. Smart looks good to me (but I might be wrong). So, does this mean the drive has issues or do i have to do something to force unraid to see the disk again as valid? Keep in mind that Disk 6 is also throwing errors. A new 5tb drive arrived today, but it still needs to be precleared. But, I dont want to risk Disk 6 dying or Im losing some data. DISK 5 smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.0.4-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Toshiba 3.5" HDD DT01ACA... Device Model: TOSHIBA DT01ACA200 Serial Number: Y4B5UKNTS LU WWN Device Id: 5 000039 ffaded4d9 Firmware Version: MX4OABB0 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s) Local Time is: Sat Oct 31 18:08:40 2015 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (14535) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 243) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 139 139 054 Pre-fail Offline - 71 3 Spin_Up_Time 0x0007 253 253 024 Pre-fail Always - 96 (Average 131) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 698 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 094 094 067 Pre-fail Always - 6 8 Seek_Time_Performance 0x0005 124 124 020 Pre-fail Offline - 33 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 5266 10 Spin_Retry_Count 0x0013 090 090 060 Pre-fail Always - 131072 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 12 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 699 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 701 194 Temperature_Celsius 0x0002 200 200 000 Old_age Always - 30 (Min/Max 24/35) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. DISK 6 smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.0.4-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (AF) Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WMAZA4709370 LU WWN Device Id: 5 0014ee 002c2a55a Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Sat Oct 31 18:22:15 2015 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (40860) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 394) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 3910 3 Spin_Up_Time 0x0027 203 164 021 Pre-fail Always - 4841 4 Start_Stop_Count 0x0032 097 097 000 Old_age Always - 3430 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 051 051 000 Old_age Always - 35794 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 114 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 57 193 Load_Cycle_Count 0x0032 119 119 000 Old_age Always - 243719 194 Temperature_Celsius 0x0022 115 096 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 7 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 6 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 199 199 000 Old_age Offline - 488 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 35631 346803864 # 2 Extended offline Completed: read failure 90% 35579 346803867 # 3 Extended offline Completed: read failure 90% 35297 346803865 # 4 Extended offline Completed without error 00% 30589 - # 5 Short offline Completed without error 00% 30538 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Quote Link to comment
trurl Posted October 31, 2015 Share Posted October 31, 2015 Disk5 SMART is OK, but unRAID will not actually use that disk until it is rebuilt. See this recent post about how unRAID emulates a drive. Disk6 SMART shows pending sectors which is bad and might cause an issue rebuilding Disk5, but at this point you don't really have a choice. Do you have 2 spare disks? It would be best if you could rebuild disk5 to a new disk, and it will be even more important after you get disk5 green again to rebuild disk6 to a new drive, since disk6 needs to be precleared to see if you can resolve the pending sectors. Quote Link to comment
skyhawk Posted October 31, 2015 Author Share Posted October 31, 2015 Thanks for the reply. I can easily get a second drive.. it would offer better fault tolerance anyways. I feel like Im wasting time and energy with smaller (500gb, 1.5tb, 1tb) disks when I can just get another 5tb, have a ton of extra storage and more fault tolerance. Whats the best route to rebuild disk 5? - preclear the new drive, remove disk 5, replace with new drive, rebuild parity? I wasnt aware that a preclear might resolve the pending sectors issue. Thats cool. Quote Link to comment
trurl Posted October 31, 2015 Share Posted October 31, 2015 Thanks for the reply. I can easily get a second drive.. it would offer better fault tolerance anyways. I feel like Im wasting time and energy with smaller (500gb, 1.5tb, 1tb) disks when I can just get another 5tb, have a ton of extra storage and more fault tolerance. Whats the best route to rebuild disk 5? - preclear the new drive, remove disk 5, replace with new drive, rebuild parity? I wasnt aware that a preclear might resolve the pending sectors issue. Thats cool. not rebuild parity, rebuild disk5. Quote Link to comment
skyhawk Posted October 31, 2015 Author Share Posted October 31, 2015 Thanks for the reply. I can easily get a second drive.. it would offer better fault tolerance anyways. I feel like Im wasting time and energy with smaller (500gb, 1.5tb, 1tb) disks when I can just get another 5tb, have a ton of extra storage and more fault tolerance. Whats the best route to rebuild disk 5? - preclear the new drive, remove disk 5, replace with new drive, rebuild parity? I wasnt aware that a preclear might resolve the pending sectors issue. Thats cool. not rebuild parity, rebuild disk5. lol thats what i meant... rebuild the 'missing' disk onto the new disk using the parity's data. sorry.. been a crazy week. just got home and needed a hobby for the afternoon! Quote Link to comment
trurl Posted October 31, 2015 Share Posted October 31, 2015 lol thats what i meant... rebuild the 'missing' disk onto the new disk using the parity's data. sorry.. been a crazy week. just got home and needed a hobby for the afternoon! Even that's not really quite right though. The parity doesn't have the disks data. Parity plus ALL the other disks allow the drives data to be calculated. Quote Link to comment
skyhawk Posted November 14, 2015 Author Share Posted November 14, 2015 Unfortunately, another update So, after being out of town, I had the pleasure of moving and had 8 days to do so.. fun! Before I left, I ran preclear on the new disk (5tb... 72 hours later), took the old disk 5 out of the array and restored the data. Everything worked great. Since my tower was full, I had the new disk sitting on a flat surface with a fan on it. So, after everything was complete, I decided I needed to take the bad drive out of the computer, and put the new drive in its place. That way, I could deal with formatting and testing the 'bad' drive after the move, and the new drive was secure. Powered down, replaced the drive. Within 15 mins of restarting... Bam. New disk is Invalid, contents emulated. Im wondering if my 4 Port Sata card is bad. Thats the only thing that both drives have in common (I replaced the sata cable, and power cable is the same but that seems less likely). Here is the card that Im using and have been for the past 4 years in Unraid (http://www.monoprice.com/product?c_id=104&cp_id=10407&cs_id=1040702&p_id=2667&seq=1&format=2) Thoughts? If I should replace it, whats a good and budget suggestion. 4 port minimum. greater than 4 would be great. $100 max if possible Seeing as i had the exact same issue a few weeks ago, its unlikely that Im having a drive issue. Is there a way for me to get unraid to accept this disk as good again so that I dont have to preclear another drive (72 hours) and rebuild (36 hours). or would this be a bad idea. thanks Quote Link to comment
trurl Posted November 14, 2015 Share Posted November 14, 2015 Did you ever do anything about the pending sectors on disk6? Quote Link to comment
skyhawk Posted November 14, 2015 Author Share Posted November 14, 2015 Not yet. Disk 5 was up less than a day and I was packing the house. I was going to fix disk 6 once I moved. I'm here now with no furniture, but a server and internet lol. I did back up all of disk 6 to an external hard drive, so the data is safe. Quote Link to comment
trurl Posted November 14, 2015 Share Posted November 14, 2015 Not yet. Disk 5 was up less than a day and I was packing the house. I was going to fix disk 6 once I moved. I'm here now with no furniture, but a server and internet lol. I did back up all of disk 6 to an external hard drive, so the data is safe. Disk6 data may be safe, but what about disk5? One of the reasons it is recommended to deal with pending sectors is because they can cause a rebuild of another drive to fail. If there are any files on any disk that are irreplaceable and you don't have them backed up, do so. Another diagnostic would let us see the SMART for the new disk. Assuming it is OK you can rebuild onto it again. But you need to address the issue that is causing this if it is not the disk. Quote Link to comment
skyhawk Posted November 14, 2015 Author Share Posted November 14, 2015 new diagnostics here: https://drive.google.com/file/d/0B1uaV-iMhp2lY2xBODgxbDVpVFU/view?usp=sharing the current disk 5 is the brand new 5tb. Passed preclear (1 pass, didnt have time for more). Assuming that it is not the disk (2 identical failures would be very unlikely), Im leaning towards the sata card as it was a cheap card and its 4 years old. And its the only common denominator between the 2. The new disk 5 was working fine (played a dozen movies and other files to test that it was working... all OK) until I moved it and switched it to the sata card (same port) as the old disk 5. The 2 other drives on the sata card are working fine (3rd isnt hooked up). Quote Link to comment
skyhawk Posted November 21, 2015 Author Share Posted November 21, 2015 Bump. Thanks Quote Link to comment
skyhawk Posted November 24, 2015 Author Share Posted November 24, 2015 Since this seems to be a pain in the butt to fix, I'm going with the following solution unless other advice is offered today. Backing up my 4.5 tb of data to a new 5tb external. Then I'm just going to force the invalid drive back to working. If I lose anything I have it all backed up. I'll then preclear the drives with issues and see how they are working and if the errors clear. Also buying a new 4 port sata card since I'm convinced one port is bad. After, I'll keep the new 5tb as a hot spare since I see the value in that now. I always have drives sitting around but didn't remember you can't replace a problem drive with a smaller one. Suggestions welcomes on sata card. Leaning towards this.SYBA SI-PEX40064 PCI-Express 2.0 Low Profile Ready SATA III (6.0 Gb/s) Controller Card. $22 at newegg. Hopefully I see faster speeds since my current card is sata i. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.