Jump to content

Started large copy to tower; server locked up; hard reset now files are gone


Vocatus

Recommended Posts

Please help, I stand to lose months of critical work if I can't recover these files.

 

I was moving a number of large VM disk files to the tower. About 80% of the way through it crashed (hard lock). I hard reset it, and it came back up just fine. However, the directory I was moving everything to is empty. Additionally, the cache drive appears as unformatted.

 

I'm on version 5.0.5. Please help, how do I recover the data? I assume/hope it's sitting on the cache drive...

Link to comment

When you say moving to tower, do you mean they were on some other machine, and instead of just copying them, you moved them so you no longer have the originals? If something is important you should have multiple copies. unRAID is not a backup solution unless you use it as an additional copy of your files. You should never have only one copy of anything important.

 

What are you seeing that makes you think the directory is empty? Are you looking at it from the unRAID command line, Windows Explorer, or what? Are you looking at a disk share or a user share?

 

A syslog and maybe a screenshot might better demonstrate what you are working with.

Link to comment

What format is the cache drive meant to be?  The fact that it is showing as unformatted suggests that some sort of file system corruption has happened so that unRAID can no longer mount it (do not format it unless you want to lose your data).  That would explain why your files are not showing up as they were almost certainly written to the cache drive.  Depending on what format it is meant to be the recovery tools will be different.

Link to comment

What are you seeing that makes you think the directory is empty? Are you looking at it from the unRAID command line, Windows Explorer, or what? Are you looking at a disk share or a user share?

 

A syslog and maybe a screenshot might better demonstrate what you are working with.

 

Hi trurl,

 

Yes, I was moving the files to the tower. The reason I was moving instead of copying was I needed to temporarily clear local disk space for a different operation, and do not have enough space to hold three copies (originally one copy was on workstation, one copy on tower array).

 

When I browse to the directory in Windows 7 file explorer I do not see any of the files I moved prior to the crash sitting there - the directory is empty.

 

What format is the cache drive meant to be?  The fact that it is showing as unformatted suggests that some sort of file system corruption has happened so that unRAID can no longer mount it (do not format it unless you want to lose your data).  That would explain why your files are not showing up as they were almost certainly written to the cache drive.  Depending on what format it is meant to be the recovery tools will be different.

 

The cache drive was in whatever filesystem Unraid defaults to; ReiserFS I believe? And yes, it's showing as unformatted and that's new. I have not hit "format" yet.

Link to comment

What do I need to do to recover the files from the cache drive?

 

Thanks for helping with this, seriously.

You need to first look up the device name for the cache in the GUI.  Also check that the GUI shows reiserfs format for the cache (as other formats require different actions).  Then from a telnet/console session run a command of the form

reiserfsck --check /dev/sd?1
where sd? Is the device name you looked up.  Do not forget the 1 on the end which specifies the partition.  If there is any corruption that command will tell you and what is the recommended action to fix it.
Link to comment

OK, I followed your instructions and this was the result:

 

Trans replayed: mountid 21, transid 161379, desc 4709, len 72, commit 4782, next trans offset 4765
<40 lines of output omitted>
Trans replayed: mountid 21, transid 161380, desc 4783, len 561, commit 5345, next trans offset 5328
Replaying journal: Done.
Reiserfs journal '/dev/sdf1' in blocks [18..8211]: 40 transactions replayed

Checking internal tree.. finished
Comparing bitmaps..finished
Checking Semantic tree:
finished
No corruptions found
There are on the filesystem:
        Leaves 35883
        Internal nodes 217
        Directories 41
        Other files 811
        Data block pointers 36156914 (495824 of them are zero)
        Safe links 0
###########
reiserfsck finished at Thu Oct 16 12:32:32 2014
###########

 

Should I reboot the server now? In the GUI the cache drive still shows "unformatted"

 

EDIT: Stopped and restarted the array, and all files are back.

 

Thank-you for the help, you saved me months of work. I learned my lesson - copy THEN delete, not just move, critical files.

Link to comment

Well, some more bad news. I ran the mover script to get everything off the cache drive and onto the array, and it failed. When re-running reiserfsck, I get this message:

 

###########
reiserfsck --check started at Thu Oct 16 13:55:21 2014
###########
Replaying journal: Trans replayed: mountid 22, transid 161381, desc 5346, len 1, commit 5348, next trans offset 5331
Trans replayed: mountid 22, transid 161382, desc 5349, len 1, commit 5351, next trans offset 5334
Trans replayed: mountid 22, transid 161383, desc 5352, len 10, commit 5363, next trans offset 5346

The problem has occurred looks like a hardware problem. If you have
bad blocks, we advise you to get a new hard drive, because once you
get one bad block  that the disk  drive internals  cannot hide from
your sight,the chances of getting more are generally said to become
much higher  (precise statistics are unknown to us), and  this disk
drive is probably not expensive enough  for you to you to risk your
time and  data on it.  If you don't want to follow that follow that
advice then  if you have just a few bad blocks,  try writing to the
bad blocks  and see if the drive remaps  the bad blocks (that means
it takes a block  it has  in reserve  and allocates  it for use for
of that block number).  If it cannot remap the block,  use badblock
option (-B) with  reiserfs utils to handle this block correctly.

bread: Cannot read the block (5358): (Input/output error).

Aborted (core dumped)

 

Anything else I can try?

Link to comment

Well, some more bad news. I ran the mover script to get everything off the cache drive and onto the array, and it failed. When re-running

Anything else I can try?

 

 

Can you post a smart report of the cache drive?

I had something like this once and I had to use ddrescue to copy a bad drive to another drive in order to repair it.

 

 

Link to comment

Can you post a smart report of the cache drive?

I had something like this once and I had to use ddrescue to copy a bad drive to another drive in order to repair it.

 

Sure thing.

 

root@TheBrain:~# smartctl --all /dev/sdf
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST3500830AS
Serial Number:    6QG0G750
Firmware Version: 3.AAE
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
Local Time is:    Fri Oct 17 07:13:52 2014 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  430) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 163) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   113   088   006    Pre-fail  Always       -       229052527
  3 Spin_Up_Time            0x0003   098   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1948
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       531914233
  9 Power_On_Hours          0x0032   060   060   000    Old_age   Always       -       35522
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       161
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   044   045    Old_age   Always   In_the_past 32 (Min/Max 32/32)
194 Temperature_Celsius     0x0022   032   056   000    Old_age   Always       -       32 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   062   054   000    Old_age   Always       -       243699885
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

I would carefully check SATA/power cabling as well.  It is possible that a connection is loose which is causing the drive to drop offline for some reason.

 

Thanks, I'll check this when I get home today.

Link to comment

Nothing jumps out at me via the smart report, can you post the syslog associated at the time of the bread failure?

i.e.

bread: Cannot read the block (5358): (Input/output error).

 

I would suggest a confidence test and do a smart -t long test on the drive.

Make sure you turn off spin down on the drive before issuing the test and do not access it until the test is done.

Upon initiating the test it will provide an ETA of how long it will take.

 

 

It could be a cabling issue. Vibration can cause intermittent connection issues.

 

Link to comment

Well, I managed to get everything off the cache drive by rebooting multiple times over a few days. I noticed after a reboot the drive would work for about an hour before crashing again, so I used that to get everything off slowly.

 

The drive is very old and I suspect failing, so now that everything is recovered I'm replacing it.

 

Thank-you again for the help.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...