Two drive issues and Im out of town (Im back now)


skyhawk

Recommended Posts

So, ive been away for 4 months for work. Luckily I return in 2 weeks. I tried my best to add redundancy to the server before I left, and it turns out that 1 more backup drive would have been a good option.. or adding a 5 in 3 hot swap cage  >:(

 

Here's the rundown:

Disk 6: 2 weeks ago I started to get emails about disk 6 having 1 Offline Uncorrectable error. Looked it up and the forum said 1 is ok, more is bad. As of Thursday night, this number is up to 6. Time to move. This is my original parity drive from 6-7 years ago when i started with unraid.. so shes old.

 

Thurs night i took disk 6 out of my shared folders so the mover would stop adding to it and used MC to move the data over to disk 5. Everything was going great at first (moved about 300 GB successfully), but then I saw the errors adding up on Disk 5 (1536 read errors on the unraid dashboard). Then it said Disk offline, data emulated (or something to that effect).

 

So, I took the array offline. Its still powered up but i stopped the array. I tried starting the array but now Disk 5 shows nothing and the hard drive isnt selectable.

 

So... is it possible that a sata cable just happened to come loose after all this time? The drive has been in use for 6 months. Or, is Disk 5 possibly failing as well? I know this is hard to diagnose over the internet. Drive is only 6 months old and has been used successfully since then. 2? preclears before installing. no issues.

 

What would you do in this situation... Options that I can think of: (NOTE: with disk 5 and Disk 6 BOTH out of the array, I only have 3TB between the other drives, which isnt enough to move both of these drives over to the remaining drives)

-Shut down the server, wait 2 weeks, check it out then. (problem: wifes at home and she wont have access to her tv shows)

-Turn off parity check and and just run with disk 5 offline, move disk 6 data to another disk. (turn off parity check so parity is still valid when i get there so i have more options in case the disk 5 is DOA). hhmmm on second thought this wont work if any changes are made in the meantime

-order a new drive and have wifey install a new drive in the one and only hot swap bay that i have... but it has no fan and would run at 50-53 degrees C until i get home... then just rebuild parity with disk 5 missing and copy over disk 6.

- Offer a south florida unraid member money/beer/etc to check it out or pay an IT guy to go out.

-  ? ? ? ? ? ?

 

Disk 6 Smart attached. I cant access disk 5 so i dont have the smart report.

 

Capture.PNG.fbdc8224d5ad42a97753bc7a1eac4b9f.PNG

Disk_6_Smart_Test_1016.txt

syslog.zip

Link to comment

Complete diagnostics would be better. Go to Tools - Diagnostics, get the zip file, and post it. That would allow us to look at some other things, including your other drives SMART.

 

Do you have backups? If not then if it were me I would probably shut it down until I could work on it myself. Or at least quit writing to it. You don't really have any parity protection unless all disks are good.

 

Do you use docker? Lots of stuff in your syslog about the loop device. Also, your syslog rotated so is incomplete. Complete diagnostics would give us the rest of your syslogs too.

Link to comment

Diagnostics attached here... too large to attach on the forum

https://drive.google.com/file/d/0B1uaV-iMhp2lTjdiRHBRMVAxOEE/view?usp=sharing

 

No other backups. Its 95% tv and movies, not essential but a pain in the ass to lose. I have docker running the usual suspects... SAB, CP, Sonarr, not much else really. Sometime in the next year or 2 I plan to build a second server for off-site redundancy, but its not in the budget yet.

 

worst case I can shut down and wife will just buy episodes from amazon prime. Idiot tax for not getting a hot swap bay.

 

...and... THANKS!

Link to comment

With no backups (that's another story), you should always have at least one spare drive handy for failures.

 

Then you could have IMMEDIATELY replaced disk #6 when it started to show errors.    Of course without a hot-swap bay, trying to talk your wife through that process might have been interesting  :)    ... although if you kept the spare drive installed and connected to a controller, you could have done the replacement yourself by simply stopping the array; unassigning the bad drive; starting the array (so it was marked as missing);  stopping the array and assigning your spare; and then starting the array (which would start the rebuild).

 

Of course after you had done that, you should immediately order another spare.

 

At this point, if you don't want to lose more data than you likely already have, I'd (a) shut down the array and leave it off until you're home;  and (b) order a couple of spare drives so you'll have them in hand when you arrive home.

 

Link to comment

I replaced the parity and had 3tb extra space and 3tb data, so I thought I was covered. 100% extra. But most of my drives are smaller and older, so I figured 2 2tb drives was good overhead. Never expected both could have issues simultaneously.. Figured a old small one would die first. But live and learn.

 

I see most drives are built to handle 55 degrees c. So, I doubt a new drive would be too negatively effected if wifey put it in the fanless hot swap. But...opinions....I won't have a lot of free time when i get home...moving.

Link to comment

see below

 

So, I'm going to shut down until I get home

 

Any input on the best route for recovery?

 

Check cables on disk 5, if drive dead, remove from array, rebuild parity with missing drive. Immediately thereafter replace disk 6 (or just migrate data as my replacement will be a 5 tb drive).

 

If disk 5 was just a loose cable, run parity, and then replace disk 6. Or skip parity and remove 6 and rebuild.

 

Thanks.

Link to comment
  • 2 weeks later...

UPDATE.. Im back in town new update at bottom

OK, re-seated the cables and Disk 5 was visible. Tried to bring it online but got an Offline error. Power is working as the drive is spinning. I replaced the sata cable, restarted and I can access disk 5 again. See smart report (attached).

 

However, there is still a red x next to the drive and it says Disabled, data emulated. When I boot the server, I can select the disk from the drop down list. Once I try to start the array, it goes offline. If i stop the array, it shows No Device and its missing from the pull down list until the next restart.

 

So, i think the drive is working. Smart looks good to me (but I might be wrong). So, does this mean the drive has issues or do i have to do something to force unraid to see the disk again as valid?

 

Keep in mind that Disk 6 is also throwing errors. A new 5tb drive arrived today, but it still needs to be precleared. But, I dont want to risk Disk 6 dying or Im losing some data.

 

DISK 5

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.0.4-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba 3.5" HDD DT01ACA...
Device Model:     TOSHIBA DT01ACA200
Serial Number:    Y4B5UKNTS
LU WWN Device Id: 5 000039 ffaded4d9
Firmware Version: MX4OABB0
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Sat Oct 31 18:08:40 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(14535) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 243) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   139   139   054    Pre-fail  Offline      -       71
  3 Spin_Up_Time            0x0007   253   253   024    Pre-fail  Always       -       96 (Average 131)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       698
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   094   094   067    Pre-fail  Always       -       6
  8 Seek_Time_Performance   0x0005   124   124   020    Pre-fail  Offline      -       33
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       5266
10 Spin_Retry_Count        0x0013   090   090   060    Pre-fail  Always       -       131072
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       12
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       699
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       701
194 Temperature_Celsius     0x0002   200   200   000    Old_age   Always       -       30 (Min/Max 24/35)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

DISK 6

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.0.4-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZA4709370
LU WWN Device Id: 5 0014ee 002c2a55a
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sat Oct 31 18:22:15 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(40860) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 394) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       3910
  3 Spin_Up_Time            0x0027   203   164   021    Pre-fail  Always       -       4841
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3430
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   051   051   000    Old_age   Always       -       35794
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       114
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       57
193 Load_Cycle_Count        0x0032   119   119   000    Old_age   Always       -       243719
194 Temperature_Celsius     0x0022   115   096   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       7
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       6
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   199   199   000    Old_age   Offline      -       488

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     35631         346803864
# 2  Extended offline    Completed: read failure       90%     35579         346803867
# 3  Extended offline    Completed: read failure       90%     35297         346803865
# 4  Extended offline    Completed without error       00%     30589         -
# 5  Short offline       Completed without error       00%     30538         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

Disk5 SMART is OK, but unRAID will not actually use that disk until it is rebuilt. See this recent post about how unRAID emulates a drive.

 

Disk6 SMART shows pending sectors which is bad and might cause an issue rebuilding Disk5, but at this point you don't really have a choice.

 

Do you have 2 spare disks? It would be best if you could rebuild disk5 to a new disk, and it will be even more important after you get disk5 green again to rebuild disk6 to a new drive, since disk6 needs to be precleared to see if you can resolve the pending sectors.

Link to comment

Thanks for the reply. I can easily get a second drive.. it would offer better fault tolerance anyways. I feel like Im wasting time and energy with smaller (500gb, 1.5tb, 1tb) disks when I can just get another 5tb, have a ton of extra storage and more fault tolerance.

 

Whats the best route to rebuild disk 5?

- preclear the new drive, remove disk 5, replace with new drive, rebuild parity?

 

I wasnt aware that a preclear might resolve the pending sectors issue. Thats cool.

 

Link to comment

Thanks for the reply. I can easily get a second drive.. it would offer better fault tolerance anyways. I feel like Im wasting time and energy with smaller (500gb, 1.5tb, 1tb) disks when I can just get another 5tb, have a ton of extra storage and more fault tolerance.

 

Whats the best route to rebuild disk 5?

- preclear the new drive, remove disk 5, replace with new drive, rebuild parity?

 

I wasnt aware that a preclear might resolve the pending sectors issue. Thats cool.

not rebuild parity, rebuild disk5.
Link to comment

Thanks for the reply. I can easily get a second drive.. it would offer better fault tolerance anyways. I feel like Im wasting time and energy with smaller (500gb, 1.5tb, 1tb) disks when I can just get another 5tb, have a ton of extra storage and more fault tolerance.

 

Whats the best route to rebuild disk 5?

- preclear the new drive, remove disk 5, replace with new drive, rebuild parity?

 

I wasnt aware that a preclear might resolve the pending sectors issue. Thats cool.

not rebuild parity, rebuild disk5.

 

lol thats what i meant... rebuild the 'missing' disk onto the new disk using the parity's data. sorry.. been a crazy week. just got home and needed a hobby for the afternoon!

Link to comment

lol thats what i meant... rebuild the 'missing' disk onto the new disk using the parity's data. sorry.. been a crazy week. just got home and needed a hobby for the afternoon!

Even that's not really quite right though. The parity doesn't have the disks data. Parity plus ALL the other disks allow the drives data to be calculated.
Link to comment
  • 2 weeks later...

Unfortunately, another update

 

So, after being out of town, I had the pleasure of moving and had 8 days to do so.. fun!

 

Before I left, I ran preclear on the new disk (5tb... 72 hours later), took the old disk 5 out of the array and restored the data. Everything worked great. Since my tower was full, I had the new disk sitting on a flat surface with a fan on it. So, after everything was complete, I decided I needed to take the bad drive out of the computer, and put the new drive in its place. That way, I could deal with formatting and testing the 'bad' drive after the move, and the new drive was secure.

 

Powered down, replaced the drive. Within 15 mins of restarting... Bam. New disk is Invalid, contents emulated.

 

Im wondering if my 4 Port Sata card is bad. Thats the only thing that both drives have in common (I replaced the sata cable, and power cable is the same but that seems less likely). Here is the card that Im using and have been for the past 4 years in Unraid (http://www.monoprice.com/product?c_id=104&cp_id=10407&cs_id=1040702&p_id=2667&seq=1&format=2)

 

Thoughts?

 

If I should replace it, whats a good and budget suggestion. 4 port minimum. greater than 4 would be great. $100 max if possible

 

Seeing as i had the exact same issue a few weeks ago, its unlikely that Im having a drive issue. Is there a way for me to get unraid to accept this disk as good again so that I dont have to preclear another drive (72 hours) and rebuild (36 hours). or would this be a bad idea.

 

thanks

Link to comment

Not yet. Disk 5 was up less than a day and I was packing the house. I was going to fix disk 6 once I moved. I'm here now with no furniture, but a server and internet lol.

 

I did back up all of disk 6 to an external hard drive, so the data is safe.

Disk6 data may be safe, but what about disk5? One of the reasons it is recommended to deal with pending sectors is because they can cause a rebuild of another drive to fail. If there are any files on any disk that are irreplaceable and you don't have them backed up, do so.

 

Another diagnostic would let us see the SMART for the new disk. Assuming it is OK you can rebuild onto it again. But you need to address the issue that is causing this if it is not the disk.

Link to comment

new diagnostics here:

https://drive.google.com/file/d/0B1uaV-iMhp2lY2xBODgxbDVpVFU/view?usp=sharing

 

the current disk 5 is the brand new 5tb. Passed preclear (1 pass, didnt have time for more).

 

Assuming that it is not the disk (2 identical failures would be very unlikely), Im leaning towards the sata card as it was a cheap card and its 4 years old. And its the only common denominator between the 2. The new disk 5 was working fine (played a dozen movies and other files to test that it was working... all OK) until I moved it and switched it to the sata card (same port) as the old disk 5. The 2 other drives on the sata card are working fine (3rd isnt hooked up).

Link to comment

Since this seems to be a pain in the butt to fix, I'm going with the following solution unless other advice is offered today.

 

Backing up my 4.5 tb of data to a new 5tb external. Then I'm just going to force the invalid drive back to working. If I lose anything I have it all backed up. I'll then preclear the drives with issues and see how they are working and if the errors clear. Also buying a new 4 port sata card since I'm convinced one port is bad.

 

After, I'll keep the new 5tb as a hot spare since I see the value in that now. I always have drives sitting around but didn't remember you can't replace a problem drive with a smaller one.

 

Suggestions welcomes on sata card. Leaning towards this.SYBA SI-PEX40064 PCI-Express 2.0 Low Profile Ready SATA III (6.0 Gb/s) Controller Card. $22 at newegg.

 

Hopefully I see faster speeds since my current card is sata i.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.