Jump to content
Imba

Failing drive and ungodly long parity-sync

72 posts in this topic Last Reply

Recommended Posts

I just resently had to replace a dead flash drive for the server. Right now it's in parity-sync but it's gone from about 1000 minutes to 316303.2 minutes. I noticed a drive has over 6k errors and need to replaced but can I do this before the sync is done? Is there anything I can do or do I have to wait it out?

 

UnRaid: Ver 4.7

Unraid 4.7.png

Share this post


Link to post

Moved to Legacy Support.

 

How is it that you are just now coming to the forum with your first post and it's about a very very old version of unRAID? Many of us haven't worked with that version, and I just barely remember it myself.

 

Stop the parity sync until we can get a better idea of what the problem is.

 

Unfortunately, getting useful diagnostics from that old version is a lot more trouble than we have to go to on the latest versions.

 

We need the syslog and SMART report for that drive giving errors. Even better would be syslog and SMART report for all drives. If you were on V6 you could get all of this in a nice zip to post for us, but instead you will have to get each separately. If you want you could zip them yourself and then you would only need to attach one thing to your next post.

 

See here:

 

https://lime-technology.com/forums/topic/9277-how-to-report-a-defect-and-capture-syslog-and-smart-reports/

 

 

Share this post


Link to post

Be sure to read the first several posts at that link I gave so you know how to get syslog and SMART reports.

 

Did you have to go into the case to replace the flash drive? Sometimes people will disturb the disk connections if they open the case.

 

It would also be useful if you could tell us a little about your hardware. It would be nice if you can easily upgrade to V6 after we get this problem squared.

Share this post


Link to post

Well I've never had a problem with 4.7 so I didn't think there was a reason to upgrade, although I have seen the nifty things that the new versions offer. Plus I thought you had to pay for another license. I've attached a zip file with all the smart reports for each drive and the system log.

 

As for the hardware:

MB: ASRock FM2A85X Extreme6

CPU: AMD A4-5300

Mem: 1GB DDR3-1066

Controllers: Adaptec 1430SA x 2

 

Is there anything else you need?

Many thanks!

 

 

 

HIVE Syslog and Smart Reports.zip

Share this post


Link to post

I'm still using my USB and license that I got when I had 4.7.  Upgrades are free and so far it doesn't look like that will change any time soon.  But I would probably pay for upgrades if I was stuck on a version that only supports 2TB drives like 4.7.  And then again when the VM manager and Docker were added.

Share this post


Link to post

You have multiple disks with issues. Unfortunately, the syslog has rotated and is only showing all the recent errors, but none of the old information that would make it possible for me to identify each disk by their assigned slot. I can see the serial numbers in the SMART though and that will be enough. You could perhaps get older syslogs from /var/log/syslog.1, /var/log/syslog.2, etc. but it's probably not necessary. The latest unRAID makes all this much easier.

 

Also, the latest unRAID also helps you to keep track of impending issues by notifying you immediately by email or other agent, for example, when a disk SMART begins to show problems. We may have problems saving all your data in this current state since you have multiple unreliable disks, and parity plus all other disks must be read reliably in order to rebuild any disk.

 

The disk you have labeled SMART as having errors is actually FAILING NOW and must be replaced immediately. Unfortunately, you also have 2 other disks with pending sectors and so they can't really be trusted to accurately rebuild the failing disk. Those disks should be replaced also, ASAP, but of course you can only rebuild one at a time and the FAILING NOW disk takes priority. I guess we will have to start there and hope for the best, possibly if you wind up with a corrupt rebuild we can repair the filesystem and save most things.

 

Why did you decide to do a parity sync anyway? That probably has corrupted parity somewhat, which of course also makes an accurate rebuild unlikely.

 

One last comment, you don't really have enough RAM for an upgrade to V6. I haven't checked the specs for the other hardware.

 

Let us know if you need more details about how to proceed with rebuilding the failing disk to a new disk.

 

Device Model:     Hitachi HDS723020BLA642
Serial Number:    MN1220F326RLAD
  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1975
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       2392
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       31

Device Model:     WDC WD20EARS-22MVWB0
Serial Number:    WD-WCAZA3935061
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       1
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       17
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1

Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA5742681
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       8

 

Share this post


Link to post

Another approach would be to create a new array with only the good disks, sync parity, then see if you can mount the bad disks outside the array (another thing that is much simpler in V6) and try to copy their contents. That has the advantage of getting the good disks protected, but it means you can't really rebuild any of the bad disks and will just have to hope you can read them well enough to get something off them.

 

Do you have any backups?

Share this post


Link to post

You should run an extended test on those WD disks with pending sectors, they can some time show false positives, i.e., the disks may be fine for now, the Hitachi is definitely failing.

Share this post


Link to post

Sigh, well if I replace the failing disk how would I go about it? Would it be the same steps to replace the other two drives?

 

Also, I didn't actually start the parity sync it did it on it's own when I replaced the usb and started it back up.

Share this post


Link to post

Parity isn't valid, best way forward is doing like trurl suggested, update unRAID them do a new config and copy the data from the failing disk(s).

Share this post


Link to post
4 hours ago, Imba said:

I didn't actually start the parity sync it did it on it's own when I replaced the usb and started it back up.

 

It must have seen super.dat that was copied without the array stopped and assumed unclean shutdown. Latest unRAID does a non-correcting parity check on unclean shutdown.

 

Can you add RAM? Maybe V6 NAS capability only would work with 1GB but it would be very tight.

 

Here is a link to the upgrading wiki:

 

https://lime-technology.com/wiki/index.php/Upgrading_to_UnRAID_v6

Share this post


Link to post

It was doing a parity sync, not a check, so something more serious happened, and with all the errors on disk2 it won't be valid anymore.

Share this post


Link to post

Older versions of unRAID was more interested in doing corrective parity sync. Didn't all versions before version 6 default to have the 'correcting' checkbox set even if someone wanted to manually start a parity scan?

Share this post


Link to post
18 hours ago, pwm said:

Older versions of unRAID was more interested in doing corrective parity sync.

I believe you mean check, sync are always write.

 

18 hours ago, pwm said:

Didn't all versions before version 6 default to have the 'correcting' checkbox set even if someone wanted to manually start a parity scan?

It still does, you need to uncheck the "write corrections to parity" box before starting a non correcting manual check, though on newer releases it does default to non correct after an unclean shutdown.

Share this post


Link to post

unRAID really should stay away from writing corrections unless the user more or less forces that operation. In case there is something wrong, the user should be given the full set of options of what steps to try to recover - which means the most recent parity must be left intact.

  • Like 1

Share this post


Link to post

Ok so I'm confused as to what steps I should be taking, I can replace the failing hard drive and more than likely add more RAM. But I don't understand how to go about all this. 

Upgrade before anything?

Replace drive first?

How to get info from the failing drives?

 

I'm sorry UnRAID seems to be beyond the limits of my usual comprehension.

Share this post


Link to post
On 5/18/2018 at 11:07 PM, trurl said:

Do you have any backups?

 

This is probably the first thing to consider. If you have any important and irreplaceable files that you don't have backed up then try to copy them from the server to your PC.

Share this post


Link to post

I don't have many overly important files on the server, but there are some things that I would like to save of course. So should I just start the server again, stop the sync, and try to pull from the failing drive? Or do I have to pull files in general (e.g. from shares as oppose to the drive that is failing).

Share this post


Link to post

Did you run an extended test on the drives with pending sectors to confirm if they are failing or not?

 

On 5/19/2018 at 7:41 AM, johnnie.black said:

You should run an extended test on those WD disks with pending sectors, they can some time show false positives, i.e., the disks may be fine for now, the Hitachi is definitely failing.

 

Share this post


Link to post

Hmm, how do I do that?

Share this post


Link to post

On the main page click on the disk, scroll down to Self Test section then click start on "SMART extended self-test"

Share this post


Link to post

Sigh, well it looks like I don't have that option.

Share this post


Link to post

I guess I'll have to use the command line, should I run the short or long test?

Share this post


Link to post

Long test - only that one will scan the surface.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now