[SOLVED] Dual red balls of death...


Recommended Posts

So,

 

I'm trying to deal with an iffy one.

 

I've got two failed data drives.

 

One of the two seems to be okay-ish, or so says smartctl (I'll post the report at the bottom of this post). This is the one that red balled first - I'd suspect ButterfingersMcIdiot here hit a cable on the way to trying to figure out which fan that was making a weird noise. Of course, the array was happily writing some truly trivial stuff right then.

 

A 2nd one was throwing write errors. So I figured I might as well try to copy everything off both drives to an external one. This was the kiss of death for drive 2. I tried reviving by switching slots and everything, to no avail - kaput.

 

So now, I have two dead drives, one of which is _really_ dead, and the other one that smartctl says should be able to limp along, at least for now.

 

The completely dead drive has been replaced with a fresh, identically-sized one.

 

Now...

 

 

Anybody know what the procedure on 5.04 to try to loose as little data as possible would be ?

 

 

Also, here's smartctl -a for the zombie drive.

 

 

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(30360) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 347) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       8400
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2323
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   049   049   000    Old_age   Always       -       37442
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       173
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       70
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2252
194 Temperature_Celsius     0x0022   110   101   000    Old_age   Always       -       42
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   001   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 14017 hours (584 days + 1 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 17 97 00 e0  Error: UNC 8 sectors at LBA = 0x00009717 = 38679

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 17 97 00 e0 08  16d+13:07:43.844  READ DMA
  e5 00 00 00 00 00 40 08  16d+13:07:42.851  CHECK POWER MODE
  e5 00 00 00 00 00 40 08  16d+13:07:32.819  CHECK POWER MODE
  e5 00 00 00 00 00 40 08  16d+13:07:22.802  CHECK POWER MODE
  e5 00 00 00 00 00 40 08  16d+13:07:12.782  CHECK POWER MODE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     37431         -
# 2  Short offline       Completed without error       00%     37426         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

unRAID redballs (disables) a drive when a write to it fails. After the drive is disabled, unRAID will not actually use the disk anymore until it is rebuilt. Instead it emulates the drive.

 

When you read from an emulated drive, unRAID actually reads ALL of the other data drives, plus the parity drive, and calculates the data of the emulated drive. The disabled drive is not read.

 

When you write to an emulated drive, unRAID actually reads ALL of the other data drives, plus the parity drive, calculates the data of the emulated drive, calculates what change would happen to parity as a result of the write to the emulated drive, and then writes parity. The disabled drive is not written.

 

Sounds like you are saying disk11 redballed first, possibly due to a loose connection, and then instead of rebuilding it, you did a some additional reading/writing on the array. Note that disk11 would have been emulated and not actually used during this.

 

Then disk7 actually failed.

 

Are you absolutely sure disk7 is dead? The only possibility I see for a completely successful rebuild is if you can get unRAID to greenball the original disk7. Then you would only have the original redball and you could rebuild it.

 

Before we continue with additional speculation of what else could be tried to possibly recover at least some of the data from those disks, can you confirm that disk11 was the 1st disk and disk7 was the 2nd disk in your original post?

Link to comment

 

Before we continue with additional speculation of what else could be tried to possibly recover at least some of the data from those disks, can you confirm that disk11 was the 1st disk and disk7 was the 2nd disk in your original post?

 

 

You're right that disk11 was the 1st disk, disk7 was the 2nd. disk7 was the one throwing errors.

 

I've tried putting disk7 into an external enclosure, it seems to whirr up when connected over firewire to my mac, then I run into OS X not having ReiserFS support (and of course, no SMART testing either).

 

I've also tried connecting it via eSata (if only to try to minimise the odds of wonky Sata cables or a power supply gremlin) to the unRaid box, and that, well, didn't work.

 

There shouldn't be any recently-written (i.e, last 6 months or so) irreplaceable data of any importance to me on either of the drives... but I'd rather get as much as possible back.

Link to comment

 

Before we continue with additional speculation of what else could be tried to possibly recover at least some of the data from those disks, can you confirm that disk11 was the 1st disk and disk7 was the 2nd disk in your original post?

 

 

You're right that disk11 was the 1st disk, disk7 was the 2nd. disk7 was the one throwing errors.

 

I've tried putting disk7 into an external enclosure, it seems to whirr up when connected over firewire to my mac, then I run into OS X not having ReiserFS support (and of course, no SMART testing either).

 

I've also tried connecting it via eSata (if only to try to minimise the odds of wonky Sata cables or a power supply gremlin) to the unRaid box, and that, well, didn't work.

 

There shouldn't be any recently-written (i.e, last 6 months or so) irreplaceable data of any importance to me on either of the drives... but I'd rather get as much as possible back.

Disk11 must have "thrown" a write error also or it would not have been disabled.

 

Been a while since I have used v5. Possibly you could just forget about disk7, New Config without it, and then see if disk11 needs any filesystem repair.

 

Another possibility might be to New Config and Trust Parity with a new drive in disk7. Then stop, unassign disk7, start without it, stop again, reassign disk7 so unRAID rebuilds it. This method is probably going to result in some filesystem corruption on the rebuilt disk in addition to the filesystem corruption that might already be on disk11, but perhaps something could be recovered from disk7 that way.

 

Another thing I have seen mentioned on some older thread is the invalidslot command. Don't know if that still applies or not.

 

I don't have any personal experience with this other than participating in some forum threads and having some pretty good understanding of how unRAID parity arrays work.

 

Hopefully someone else will chime in.

Link to comment

Thank you so much for the pointers !!!

 

 

Another possibility might be to New Config and Trust Parity with a new drive in disk7. Then stop, unassign disk7, start without it, stop again, reassign disk7 so unRAID rebuilds it. This method is probably going to result in some filesystem corruption on the rebuilt disk in addition to the filesystem corruption that might already be on disk11, but perhaps something could be recovered from disk7 that way.

 

 

I'll give this a go and see what happens... Am I right in my understanding (and assuming disk11 is reasonably OK) that given that parity is essentially shot anyway because of the two missing disks, no further damage can be done to it with the blank drive ?

Link to comment

Thank you so much for the pointers !!!

 

 

Another possibility might be to New Config and Trust Parity with a new drive in disk7. Then stop, unassign disk7, start without it, stop again, reassign disk7 so unRAID rebuilds it. This method is probably going to result in some filesystem corruption on the rebuilt disk in addition to the filesystem corruption that might already be on disk11, but perhaps something could be recovered from disk7 that way.

 

 

I'll give this a go and see what happens... Am I right in my understanding (and assuming disk11 is reasonably OK) that given that parity is essentially shot anyway because of the two missing disks, no further damage can be done to it with the blank drive ?

What do you mean parity is essentially shot? That procedure relies on parity and all other disks to rebuild. Possibly parity is not perfect, and all disks are not perfect, but they are all you have to rebuild that disk with, so the rebuild may not be perfect, but maybe something can be recovered with other tools after the rebuild.

 

During the 1st step where you New Config with Trust Parity, then start, if it starts a parity check stop it. The actual rebuild part of the procedure will not write to parity or any other drives except the drive to be rebuilt.

 

Better yet, after you unassign the disk and start the array without it, you could see if unRAID is able to emulate a filesystem for the disk. Possibly not.

 

I would wait a while to see if anyone else has a better idea. If you do decide to proceed, you should probably have some more detailed instructions and feedback as you go.

 

I will probably be unavailable for some time, but maybe someone else can help. Be patient.

Link to comment

 

What do you mean parity is essentially shot? That procedure relies on parity and all other disks to rebuild. Possibly parity is not perfect, and all disks are not perfect, but they are all you have to rebuild that disk with, so the rebuild may not be perfect, but maybe something can be recovered with other tools after the rebuild.

 

 

By "parity being shot", I was assuming that both disks7 and 11 were screwed up, and that that had somewhat messed with parity by ricochet.

 

My concern was that by declaring the configuration as trusted, the parity rebuild might assume disk7, which is blank, is correct, and possibly try to correct "errors" on good drives as a result. It's quite reassuring to hear I was wrong in my understanding.

 

I did run a reiserfsck on disk11, which came back good. The unknown, of course, is exactly how long it was out of commission for (something that'd affect the rebuild, if I'm not mistaken).

 

 

 


root@Tower:/# reiserfsck --check /dev/sdi1
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/sdi1
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Fri Oct 30 21:39:51 2015
###########
Replaying journal: Trans replayed: mountid 81, transid 154065, desc 2632, len 1, commit 2634, next trans offset 2617
Trans replayed: mountid 81, transid 154066, desc 2635, len 1, commit 2637, next trans offset 2620
Trans replayed: mountid 81, transid 154067, desc 2638, len 2, commit 2641, next trans offset 2624
Trans replayed: mountid 81, transid 154068, desc 2642, len 26, commit 2669, next trans offset 2652
Trans replayed: mountid 81, transid 154069, desc 2670, len 4, commit 2675, next trans offset 2658
Trans replayed: mountid 81, transid 154070, desc 2676, len 2, commit 2679, next trans offset 2662
Replaying journal: Done.                                                        
Reiserfs journal '/dev/sdi1' in blocks [18..8211]: 6 transactions replayed
Checking internal tree.. finished                                
Comparing bitmaps..finished
Checking Semantic tree:
finished                                                                       
No corruptions found
There are on the filesystem:
Leaves 435114
Internal nodes 2724
Directories 12919
Other files 44325
Data block pointers 437408310 (0 of them are zero)
Safe links 0
###########
reiserfsck finished at Fri Oct 30 22:49:41 2015
###########

Link to comment

The reiserfsck of disk11 is encouraging.

 

Take a screenshot of your Main page so you will know which disk goes where, since New Config is going to clear all the assignments and you need to put them back exactly.

 

We'll have to work through this a little bit with the webUI as a guide as I don't remember all the details exactly.

 

Go to Tools - New Config and it will let you define a new disk configuration. The reason we need to do this is to get unRAID to use the disabled disk11 and so we can avoid having it rebuild parity against the blank.

 

After you assign the drives, including the new disk, there will be a checkbox that will tell it to not rebuild parity or trust parity or something like that and you should check the box.

 

When you start the array, it may start to check parity anyway. Stop it as quickly as you can.

 

After you have stopped the array, unassign the new disk and start the array again. At this point, you may be able to see if unRAID will let you browse the emulated disk. Possibly it won't really be able to mount it if it is corrupt.

 

Just go this far and let's see what happens. Ask questions and post screenshots as you go as needed.

 

 

Link to comment

So,

 

Swapped the drive out, rebuilt it from parity.

 

as (somewhat expected), couldn't ls it :

 

root@Tower:/mnt/disk7/Downloads# ls
/bin/ls: cannot access TV: Permission denied
/bin/ls: cannot access Music: Permission denied
/bin/ls: cannot access Movies: Permission denied
(...)

 

fsck'd it :

 

root@Tower:/dev# reiserfsck --check /dev/sdn1
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/sdn1
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Sat Oct 31 19:00:00 2015
###########
Replaying journal: Trans replayed: mountid 253, transid 852825, desc 1903, len 1, commit 1905, next trans offset 1888
Trans replayed: mountid 253, transid 852826, desc 1906, len 1, commit 1908, next trans offset 1891
Trans replayed: mountid 253, transid 852827, desc 1909, len 9, commit 1919, next trans offset 1902
Trans replayed: mountid 253, transid 852828, desc 1920, len 17, commit 1938, next trans offset 1921
Replaying journal: Done.                                                        
Reiserfs journal '/dev/sdn1' in blocks [18..8211]: 4 transactions replayed
Checking internal tree.. \/  1 (of  20|/  1 (of 121//119 (of 119|block 121044994: The level of the node (0) is not correct, (1) expected
the problem in the internal node occured (121044994), whole subtree is skipped
/  2 (of 121/block 136686277: The level of the node (0) is not correct, (2) expected
the problem in the internal node occured (136686277), whole subtree is skipped
/  2 (of  20-block 165446653: The level of the node (0) is not correct, (3) expected
the problem in the internal node occured (165446653), whole subtree is skipped
finished     
Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
Bad nodes were found, Semantic pass skipped
3 found corruptions can be fixed only when running with --rebuild-tree
###########
reiserfsck finished at Sat Oct 31 19:02:06 2015

 

and now, let's wait a few hours to see if anything can be salvaged.

 


root@Tower:/dev# reiserfsck --scan-whole-partition --rebuild-tree /dev/sdn1
reiserfsck 3.6.21 (2009 www.namesys.com)

*************************************************************
** Do not  run  the  program  with  --rebuild-tree  unless **
** something is broken and MAKE A BACKUP  before using it. **
** If you have bad sectors on a drive  it is usually a bad **
** idea to continue using it. Then you probably should get **
** a working hard drive, copy the file system from the bad **
** drive  to the good one -- dd_rescue is  a good tool for **
** that -- and only then run this program.                 **
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to [email protected], **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will rebuild the filesystem (/dev/sdn1) tree
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
Replaying journal: Done.
Reiserfs journal '/dev/sdn1' in blocks [18..8211]: 0 transactions replayed
###########
reiserfsck --rebuild-tree started at Sat Oct 31 19:09:08 2015
###########

Pass 0:
####### Pass 0 #######
The whole partition (488378638 blocks) is to be scanned
Skipping 23115 blocks (super block, journal, bitmaps) 488355523 blocks will be read

Link to comment

Ok...

 

So, results of the checking :

 

Pass 1 (will try to insert 437476 leaves):
####### Pass 1 #######
Looking for allocable blocks .. finished
0%....20%....40%....60%....80%....100%                          left 0, 65 /sec
Flushing..finished
        437476 leaves read
                437438 inserted
                38 not inserted
        non-unique pointers in indirect items (zeroed) 1
####### Pass 2 #######
Pass 2:
0%....20%....40%....60%....80%....100%                           left 0, 0 /sec
Flushing..finished
        Leaves inserted item by item 38
Pass 3 (semantic):
####### Pass 3 #########

<snip>

DataFlushing..finished                                                                                
        Files found: 9738
        Directories found: 1754
        Others: 1
        Broken (of files/symlinks/others): 2
        Names pointing to nowhere (removed): 4
Pass 3a (looking for lost dir/files):
####### Pass 3a (lost+found pass) #########
Looking for lost directories:
Looking for lost files:1 /sec
rewrite_file: 2 items of file [3336 26] moved to [3336 132]
Flushing..finished
        Objects without names 1
        Empty lost dirs removed 1
        Files linked to /lost+found 1
        Objects having used objectids: 1
                files fixed 1
Pass 4 - finished done 434801, 39 /sec
        Deleted unreachable items 3381
Flushing..finished
Syncing..finished

 

The array is stopped right now.

 

Am I right to assume that the next thing to do is essentially to unassign parity, start the array, stop the array, and restart to rebuild parity based on the rebuilt drive ?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.