Failing drive and ungodly long parity-sync

Imba · May 18, 2018

I just resently had to replace a dead flash drive for the server. Right now it's in parity-sync but it's gone from about 1000 minutes to 316303.2 minutes. I noticed a drive has over 6k errors and need to replaced but can I do this before the sync is done? Is there anything I can do or do I have to wait it out?

UnRaid: Ver 4.7

trurl · May 18, 2018

Moved to Legacy Support.

How is it that you are just now coming to the forum with your first post and it's about a very very old version of unRAID? Many of us haven't worked with that version, and I just barely remember it myself.

Stop the parity sync until we can get a better idea of what the problem is.

Unfortunately, getting useful diagnostics from that old version is a lot more trouble than we have to go to on the latest versions.

We need the syslog and SMART report for that drive giving errors. Even better would be syslog and SMART report for all drives. If you were on V6 you could get all of this in a nice zip to post for us, but instead you will have to get each separately. If you want you could zip them yourself and then you would only need to attach one thing to your next post.

See here:

https://lime-technology.com/forums/topic/9277-how-to-report-a-defect-and-capture-syslog-and-smart-reports/

trurl · May 18, 2018

Be sure to read the first several posts at that link I gave so you know how to get syslog and SMART reports.

Did you have to go into the case to replace the flash drive? Sometimes people will disturb the disk connections if they open the case.

It would also be useful if you could tell us a little about your hardware. It would be nice if you can easily upgrade to V6 after we get this problem squared.

Imba · May 18, 2018

Well I've never had a problem with 4.7 so I didn't think there was a reason to upgrade, although I have seen the nifty things that the new versions offer. Plus I thought you had to pay for another license. I've attached a zip file with all the smart reports for each drive and the system log.

As for the hardware:

MB: ASRock FM2A85X Extreme6

CPU: AMD A4-5300

Mem: 1GB DDR3-1066

Controllers: Adaptec 1430SA x 2

Is there anything else you need?

Many thanks!

HIVE Syslog and Smart Reports.zip

BobPhoenix · May 19, 2018

I'm still using my USB and license that I got when I had 4.7. Upgrades are free and so far it doesn't look like that will change any time soon. But I would probably pay for upgrades if I was stuck on a version that only supports 2TB drives like 4.7. And then again when the VM manager and Docker were added.

trurl · May 19, 2018

You have multiple disks with issues. Unfortunately, the syslog has rotated and is only showing all the recent errors, but none of the old information that would make it possible for me to identify each disk by their assigned slot. I can see the serial numbers in the SMART though and that will be enough. You could perhaps get older syslogs from /var/log/syslog.1, /var/log/syslog.2, etc. but it's probably not necessary. The latest unRAID makes all this much easier.

Also, the latest unRAID also helps you to keep track of impending issues by notifying you immediately by email or other agent, for example, when a disk SMART begins to show problems. We may have problems saving all your data in this current state since you have multiple unreliable disks, and parity plus all other disks must be read reliably in order to rebuild any disk.

The disk you have labeled SMART as having errors is actually FAILING NOW and must be replaced immediately. Unfortunately, you also have 2 other disks with pending sectors and so they can't really be trusted to accurately rebuild the failing disk. Those disks should be replaced also, ASAP, but of course you can only rebuild one at a time and the FAILING NOW disk takes priority. I guess we will have to start there and hope for the best, possibly if you wind up with a corrupt rebuild we can repair the filesystem and save most things.

Why did you decide to do a parity sync anyway? That probably has corrupted parity somewhat, which of course also makes an accurate rebuild unlikely.

One last comment, you don't really have enough RAM for an upgrade to V6. I haven't checked the specs for the other hardware.

Let us know if you need more details about how to proceed with rebuilding the failing disk to a new disk.

Device Model:     Hitachi HDS723020BLA642
Serial Number:    MN1220F326RLAD
  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1975
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       2392
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       31

Device Model:     WDC WD20EARS-22MVWB0
Serial Number:    WD-WCAZA3935061
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       1
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       17
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1

Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA5742681
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       8

trurl · May 19, 2018

Another approach would be to create a new array with only the good disks, sync parity, then see if you can mount the bad disks outside the array (another thing that is much simpler in V6) and try to copy their contents. That has the advantage of getting the good disks protected, but it means you can't really rebuild any of the bad disks and will just have to hope you can read them well enough to get something off them.

Do you have any backups?

JorgeB · May 19, 2018

You should run an extended test on those WD disks with pending sectors, they can some time show false positives, i.e., the disks may be fine for now, the Hitachi is definitely failing.

Imba · May 19, 2018

Sigh, well if I replace the failing disk how would I go about it? Would it be the same steps to replace the other two drives?

Also, I didn't actually start the parity sync it did it on it's own when I replaced the usb and started it back up.

JorgeB · May 19, 2018

Parity isn't valid, best way forward is doing like trurl suggested, update unRAID them do a new config and copy the data from the failing disk(s).

trurl · May 19, 2018

4 hours ago, Imba said:

I didn't actually start the parity sync it did it on it's own when I replaced the usb and started it back up.

It must have seen super.dat that was copied without the array stopped and assumed unclean shutdown. Latest unRAID does a non-correcting parity check on unclean shutdown.

Can you add RAM? Maybe V6 NAS capability only would work with 1GB but it would be very tight.

Here is a link to the upgrading wiki:

https://lime-technology.com/wiki/index.php/Upgrading_to_UnRAID_v6

JorgeB · May 19, 2018

It was doing a parity sync, not a check, so something more serious happened, and with all the errors on disk2 it won't be valid anymore.

pwm · May 19, 2018

Older versions of unRAID was more interested in doing corrective parity sync. Didn't all versions before version 6 default to have the 'correcting' checkbox set even if someone wanted to manually start a parity scan?

JorgeB · May 20, 2018

18 hours ago, pwm said:

Older versions of unRAID was more interested in doing corrective parity sync.

I believe you mean check, sync are always write.

18 hours ago, pwm said:

Didn't all versions before version 6 default to have the 'correcting' checkbox set even if someone wanted to manually start a parity scan?

It still does, you need to uncheck the "write corrections to parity" box before starting a non correcting manual check, though on newer releases it does default to non correct after an unclean shutdown.

pwm · May 20, 2018

unRAID really should stay away from writing corrections unless the user more or less forces that operation. In case there is something wrong, the user should be given the full set of options of what steps to try to recover - which means the most recent parity must be left intact.

Imba · May 20, 2018

Ok so I'm confused as to what steps I should be taking, I can replace the failing hard drive and more than likely add more RAM. But I don't understand how to go about all this.

Upgrade before anything?

Replace drive first?

How to get info from the failing drives?

I'm sorry UnRAID seems to be beyond the limits of my usual comprehension.

trurl · May 21, 2018

On 5/18/2018 at 11:07 PM, trurl said:

Do you have any backups?

This is probably the first thing to consider. If you have any important and irreplaceable files that you don't have backed up then try to copy them from the server to your PC.

Imba · May 29, 2018

I don't have many overly important files on the server, but there are some things that I would like to save of course. So should I just start the server again, stop the sync, and try to pull from the failing drive? Or do I have to pull files in general (e.g. from shares as oppose to the drive that is failing).

JorgeB · May 29, 2018

Did you run an extended test on the drives with pending sectors to confirm if they are failing or not?

On 5/19/2018 at 7:41 AM, johnnie.black said:

You should run an extended test on those WD disks with pending sectors, they can some time show false positives, i.e., the disks may be fine for now, the Hitachi is definitely failing.

Imba · May 29, 2018

Hmm, how do I do that?

JorgeB · May 29, 2018

On the main page click on the disk, scroll down to Self Test section then click start on "SMART extended self-test"

Imba · June 7, 2018

Sigh, well it looks like I don't have that option.

trurl · June 7, 2018

52 minutes ago, Imba said:

Sigh, well it looks like I don't have that option.

I think you would have to do that from the command line in V5. Or if you have unMenu maybe it would have something for this.

I found this by searching the wiki:

https://lime-technology.com/wiki/Console_commands_for_hard_drives

Imba · June 7, 2018

I guess I'll have to use the command line, should I run the short or long test?

pwm · June 7, 2018

Long test - only that one will scan the surface.

Failing drive and ungodly long parity-sync

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived