Jump to content
We're Hiring! Full Stack Developer ×

Another red-balled disk - superblock corruption?


Recommended Posts

Another disk came up red-balled tonight. I did have an event earlier where I couldn't get the server to shut down, and had to forcibly shut it down, and reboot. The SMART report came back as PASSED, there do seem to be some pending sectors. However when I ran reiserfsck --check against the disk I got this error:

 

reiserfs_open: the reiserfs superblock cannot be found on /dev/sdg.
Failed to open the filesystem.

If the partition table has not been changed, and the partition is
valid  and  it really  contains  a reiserfs  partition,  then the
superblock  is corrupted and you need to run this utility with
--rebuild-sb.

 

I just want to be sure that this is a safe procedure before I proceed. After fixing the superblock I plan to rebuild the disk. Are there any other things I should be doing first? I'm as sure as I can be that the data on the disk is OK.

SMART_11.12.11.txt

syslog-20111012-215229.txt.zip

Link to comment

well, this doesn't look good:

 

###########
reiserfsck --check started at Wed Oct 12 23:00:00 2011
###########
Replaying journal: Trans replayed: mountid 152, transid 47341, desc 5188, len 1, commit 5190, next trans offset 5173
Trans replayed: mountid 152, transid 47342, desc 5191, len 1, commit 5193, next trans offset 5176
Replaying journal: Done.                                                        
Reiserfs journal '/dev/sdg1' in blocks [18..8211]: 2 transactions replayed
Checking internal tree.. \/  6 (of  16//131 (of 170|/ 39 (of 170\
The problem has occurred looks like a hardware problem. If you have
bad blocks, we advise you to get a new hard drive, because once you
get one bad block  that the disk  drive internals  cannot hide from
your sight,the chances of getting more are generally said to become
much higher  (precise statistics are unknown to us), and  this disk
drive is probably not expensive enough  for you to you to risk your
time and  data on it.  If you don't want to follow that follow that
advice then  if you have just a few bad blocks,  try writing to the
bad blocks  and see if the drive remaps  the bad blocks (that means
it takes a block  it has  in reserve  and allocates  it for use for
of that block number).  If it cannot remap the block,  use badblock
option (-B) with  reiserfs utils to handle this block correctly.

bread: Cannot read the block (263314195): (Input/output error).

Aborted

 

and now I can't get a SMART report:

 

root@Tower:~# smartctl  -a  -d  ata  /dev/sdg
smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

Link to comment

A bit of an update: I restarted the array and was able to get a SMART report from that disk. I ran the long SMART tests overnight and it aborted about 10% of the way through with many read errors. I am currently copying all data off the drive , with the plan to remove it from the array, then re-add and rebuild it. Does that seem like a good strategy, or is there something I should be doing first?

 

SMART report below

SMART_101113.txt

Link to comment

A bit of an update: I restarted the array and was able to get a SMART report from that disk. I ran the long SMART tests overnight and it aborted about 10% of the way through with many read errors. I am currently copying all data off the drive , with the plan to remove it from the array, then re-add and rebuild it. Does that seem like a good strategy, or is there something I should be doing first?

 

SMART report below

 

I would not add the drive back into the array first, I would run a preclear cycle on it and see if it manages to pass.

Link to comment

A bit of an update: I restarted the array and was able to get a SMART report from that disk. I ran the long SMART tests overnight and it aborted about 10% of the way through with many read errors. I am currently copying all data off the drive , with the plan to remove it from the array, then re-add and rebuild it. Does that seem like a good strategy, or is there something I should be doing first?

 

SMART report below

 

I would not add the drive back into the array first, I would run a preclear cycle on it and see if it manages to pass.

The read errors are not expected, but if your spin-down timer spins a drive down, the long test will abort.

Make sure you disable any spin down timer when doing a "long" test.

Link to comment
The read errors are not expected, but if your spin-down timer spins a drive down, the long test will abort.

Make sure you disable any spin down timer when doing a "long" test.

 

I thought I had done so, but now I'm second guessing myself. I assume the individual disc spin down settings override the general array spin down settings, correct? I had the bad disk set to never spin down and the array was left at 30 minute spin down. Hope that was right.

Link to comment

I hope you have the array running and simulating the disk and you're copying that data. It's quite pointless to be copying the data off of the failed disk directly, since you won't know what data failed to copy to the disk and which files have been corrupted  by the bad sectors.

 

Peter

 

 

Yes the array is started, and I'm using rsync to copy all files from /mnt/disk3 (the red-balled one) to /mnt/disk10. Please tell me this is using the simulated data and I haven't been wasting my time all day. It seems like it must be correct as the UnRAID main page shows only a handful of reads on disk3 compared to well over 2 million writes to disk 10.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...