disk problem - need help please!


Recommended Posts

I was previously running rc4 but had been having problems with the server dropping of the network every so often and needing to be restarted.  This became more frequent so I ran a memory test overnight.  This did not reveal any problems so I decided to upgrade to rc12a.  I previously has some plugins installed and did not remember the exact details so decided to start with a fresh install.  This all seemed to run fine, I assigned the disks as they had been this morning and all looked good so I started I brought it online and started a parity check.

 

I just got home from work and released that one share that spans several disks including disk 1 is missing.  When I open the share for disk 1 it shows no files!  The parity check is at 87% and shows 470258523 errors for disk 1!!!

 

Please help, is there anyway I can recover the contents of disk 1?

 

the syslog is 35mb as a txt file and zips down to 635kb.  what is the best way to share it?

 

Link to comment

Did you do a parity check BEFORE you upgraded, so you know your parity is good?

 

And do all of the other disks show zero errors?

 

If so, you can try 2 things:

 

(1)  Reseat the cables for Disk 1 and try the parity check again (or just try to access it and see if you can read the contents okay).    The errors you're seeing look like Reiser file system errors -- you need to run a reiserfsck check on it ... you need to wait for a Linux guru to advise you on how best to run this.  [Or read the details here and be VERY careful:  http://lime-technology.com/wiki/index.php/Check_Disk_Filesystems ]

 

(2)  You should be able to read the contents of Disk 1 even if it's not able to provide that data => the fault tolerance will "take over" and the data will be reconstructed from your other disks and the parity drive ... IF the parity is good (thus my first question).

 

In any event, if you don't have backups of your data (you SHOULD, of course), then you should definitely back up what's on Disk 1 NOW before proceeding.

 

Link to comment

... just to be clear, I'd backup the contents of Disk 1 BEFORE running the reiserfsck tool on it.  While reiserfsck shouldn't cause any data loss, I'm always VERY careful before doing anything that may alter the contents of a disk that contains data you need to recover.

 

 

Link to comment

thanks for your reply.

 

yes, parity was valid before the upgrade and all the other discs have 0 errors.

currently the disk 1 share shows no files on the disk....  so I can't backup the data....

 

what I don't understand is why the disk hasn't red balled if its a lose cable etc......  I will power down and try reseating the cables after the parity check finishes.

 

I will need to wait for a  Linux guru to chime in as I have little experience....  :(

 

 

Link to comment

Whoa -- are you saying that Disk1 still shows as green ... AND that if you access \\Tower\Disk1 it shows as empty?? 

 

If that's the case, I'd do NOTHING until you either hear from a Linux guru (hopefully Joe L will chime in);  or you can send a note to Limetech with a pointer to this thread and a copy of your syslog.

 

Link to comment

the syslog is 35mb as a txt file and zips down to 635kb.  what is the best way to share it?

 

Wow!  That's a BIG log.  I'm not sure what the attachment size limit is here -- have you tried attaching the .Zip file?    If that's still too large, wait and see what sections of the log Joe L (or Limetech) need to see;  then just save that section and .zip it.    But first, see if you can simply attach it to a reply -- it won't let you if it's too large; and I assume that if that's an issue, it will also tell you what the max size is  :)

Link to comment

it zips down to 600kb with max rar compression.  still too large to attach here.   

 

 

wow.....  the parity check finished with 470258525 errors on disc 1.......  but now the files are back!!!!!!!!!!!!  can anyone explain what happened?

Link to comment

It could be still green ball as the errors column in the webui is for read errors not write errors, the drive will go red ball only on write errors. sound like the disk Is in a bad way to me, post smart report

 

Sent from my GT-I9300 using Tapatalk 2

 

 

Link to comment

things keep getting weirder.... after a restart disk 1 is redballed......  but I am able to restart the array and access the files on disk 1.  I have attached the current syslog as it is smaller......    so...... is the parity still valid and disc 1 is being emulated?  I don't remember how to run smart report without simple features... can somebody give me step by step instructions....  sorry

syslog2.txt

Link to comment

yes it will be emulated, you can get a smart report by doing the following from command line (using putty):-

 

smartctl -a -d ata /dev/xxx >/boot/smart.txt

 

where xxx is the device in question e.g. sda, then attach smart.txt to here

 

if your parity is valid, and it looks like it is then worst case scenario is rip out disk1 and replace and let parity rebuild it, but first smart report required to confirm the disk is faulty, it could be PSU related.

Link to comment

here is the smart report.

 

I'm thinking maybe I bumped a cable when removing the thumb drive.....      I don't understand why the drive didn't redball during the parity check...  but if I am currently running under emulation for disk 1 then that means that the parity is still correct even though I was getting errors for disk 1 and there appeared to be wrights to the parity during the parity check........

smart.txt

Link to comment

"... disk 1 is redballed......  but I am able to restart the array and access the files on disk 1 " ==>  That's what fault-tolerance is all about  :)

 

When you read Disk 1, you're not actually using it at all.  ALL of the other disks, plus the parity disk, are being read, and the data is being reconstructed.    In RAID lingo, this is called "at risk" ... since you're running in a state where any more failures will result in data loss.

 

Note that if there were writes to the parity drive during the parity check, your data MAY not actually be correct.  When a correcting parity check is done, UnRAID always assumes that any error is on the parity drive, and corrects the parity drive based on that assumption.

 

Nevertheless, at this point, there's not much else you can do except hope that parity is in fact correct.

 

At this point, the issue could be (a) cables (re-seat them);  (b) power (not likely, since it was fine until now, but always possible);  or © a failing/failed disk.    The SMART test says it Passed; but shows a lot of read errors, which may simply be due to cable seating issues.

 

I'd reseat the cables;  boot again; and try a rebuild of Disk 1.    If it doesn't work with that disk, you'll need to replace it.

 

Link to comment

just to reiterate, the errors column is only to do with READ errors, not write errors, when you click on parity check it will read the source drive and read the parity disk and compare, in your situation it was unable to read the source disk and thus logged it as an error.

 

write errors will be shown as a disk having a red ball, which is what eventually happened with this disk, so it now is failing read and writes, looking at your syslog you have a large amount of pending sectors which is not a good sign:-

 

197 Current_Pending_Sector  0x0032  196  196  000    Old_age  Always      -      1446

 

ive also noticed that the drive heads have emergency retracted 102 times as power loss was detected by the drive:-

 

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      102

 

you will need to check power supply (and cabling), this could be an intermittent connection, a bad drive tray, or a power supply splitter/connector, or just a power supply too small for the total number of drives. If power issues are ruled out then i would say remove the drive and replace (with a pre-cleared drive) and let parity rebuild.

 

edit - did you do a parity check with the option "Correct any Parity-Check errors by writing the Parity disk with corrected parity." ticked when you saw the errors for disk1?

Link to comment

OK so it's starting to look like a straight forward disc failure.... just I still don't understand why disc1 didn't red ball when the errors began.

 

The current pending sectors are recent as I had looked at smart reports not to long ago and no discs had pending sectors.  The power off retract count does look high but this is one of my oldest drives and as well as power outages this wouldn't be the first time I have had issues with loose cables.

 

As for the parity check, I did not change any options.  Are you saying that by default the parity check does not wright corrections to the parity drive?  If not that is good news and hopefully my data is still all intact.

 

 

Link to comment

it seems I have bigger problems.  I purchased a new 3tb hd to replace the failed 2tb drive.  After a successful preclear I rebuilt the drive.  Things seemed fine right up until the drive finished rebuilding and then all of a sudden drive 6 showed errors and was red balled.  I have no idea if this means the rebuild would have been full of errors????

 

An initial smart report resulted in the following:

 

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

 

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

 

It seems that the drive had gone 'offline'.  After a reboot a smart report results in:

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      3

  3 Spin_Up_Time            0x0027  205  165  021    Pre-fail  Always      -      4733

  4 Start_Stop_Count        0x0032  099  099  000    Old_age  Always      -      1851

  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  078  078  000    Old_age  Always      -      16655

10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  100  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      130

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      93

193 Load_Cycle_Count        0x0032  179  179  000    Old_age  Always      -      63201

194 Temperature_Celsius    0x0022  124  105  000    Old_age  Always      -      26

196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      9

 

 

 

no sectors pending but I have never seen multi zone error rate errors before... what do they mean?

 

I reseated the cables and then unassigned the disc and then reboot and began a rebuild but almost straight away it red balled again and gone offline again.......

 

 

 

This is getting really frustrating, going on a week and I still can't use my server :(  Thanks for all your help so far, but really need some more!

 

 

Link to comment

I think the issues is more related to power than the data cables.

 

According to this line in the SMART report, the disk heads have been retracted in response to power loss 93 times.  That suggests either a lose power connection, or a bad power splitter, or an overloaded power supply, or a bad drive tray.  (although it could also be the disk itself I suppose)

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      93

 

Joe L.,

Link to comment

prior to my current problems since I upgraded to RC12a, I was running RC4 with special features.  Sometimes the server was becoming unresponsive and I had to reboot it.  This began every few weeks and at the end up to a couple of times a day which led to me upgrading.  could me restarting the server have resulted in the power off retract count?

Link to comment

prior to my current problems since I upgraded to RC12a, I was running RC4 with special features.  Sometimes the server was becoming unresponsive and I had to reboot it.  This began every few weeks and at the end up to a couple of times a day which led to me upgrading.  could me restarting the server have resulted in the power off retract count?

yes, it could be an alternate explanation, if you did it a lot.
Link to comment

as for the power supply, I have a Seasonic S12II 430W.  I am currently running 10 WD green discs with an atom board + controller card.  This might sound like small power supply but according to the power supply thread it should have no problem powering this hardware and I chose it so that it would still be efficient when idling.  If there is a power problem then I would suspect the sata power splitters........

 

I really have no idea how to proceed

 

 

Link to comment

As Joe mentioned, the power retract count indicates a lot of head retractions due to power failures.  This count does NOT increment when you do a normal shutdown ... so simply restarting the server wouldn't cause it unless it was "hung" and you did a forced power-down (holding the power button or simply unplugging the PC).

 

While I agree your Seasonic SHOULD have sufficient power, the issue certainly seems like it may be power-related.  I've mentioned a couple of times to check all the cables; so I assume you've already done that.    At this point trying a "heftier" power supply would be a good idea.

 

There IS a way to check the likelihood of this being a power issue without getting another PSU;  but it would require that you "run at risk" ... and you COULD lose data if any disk has actually failed.    That would be to simply remove your parity disk from the array; then create a smaller array configuration with just a few disks (e.g. 4 disks).  Then add the parity disk back to the array and compute parity (so now this small array is protected);  then run a parity check.    If the array is running fine at that point, your problem is almost certainly power.    But if anything goes wrong before you have a protected array, any failed disk can NOT be rebuilt ... so any data that can't be read from that disk is lost.

 

i.e. it's "safer" to simply buy a nice new 650W power supply :-)

 

Link to comment

its been a while since I researched PSU's but I do recall that a single 12v rail was important.  Is this psu a good replacement 

 

http://www.pccasegear.com/index.php?main_page=product_info&cPath=15_354&products_id=21713

 

No, I wouldn't buy a CX series unit.  The Corsair AX & HX units are superb;  the TX units are very good and are probably all you need to buy;  but the low-end CX units aren't nearly as good as their big brothers.

 

I'd buy a Corsair TX or HX, or a Seasonic M12 or X series unit.

 

~ 600 to 650W is a good range.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.