Errors on disk 1 + sync errors during parity scan

DanTheMan827 · November 18, 2016

So, disk 1 in my array appears to be toast... that's a given...

The thing is, I have scheduled parity scans and it has detected and written found 30 sync errors...

Does that mean that at least part of the parity data is no longer valid?

I know the best option now would be to get a replacement drive ASAP but I'm not sure what to do at this point...

The errors are pending / reallocated sectors so yeah...

The parity sync is still running but at this point it's past the 3TB of the failing drive...

Of course this couldn't have happened NEXT week... that would have been too convenient...

I should also mention, I don't have a spare drive on hand...

JorgeB · November 18, 2016

The thing is, I have scheduled parity scans and it has detected and written 30 sync errors...

Does that mean that at least part of the parity data is no longer valid?

If you did a correcting parity check, then yeah, there will probably be some corrupt files on the rebuilt disk, this is why my scheduled parity checks are always non correcting.

JonathanM · November 18, 2016

If you did a correcting parity check, then yeah, there will probably be some corrupt files on the rebuilt disk, this is why my scheduled parity checks are always non correcting.

SHHHH!! Don't let Tom hear you say that.

DanTheMan827 · November 18, 2016

When it says 30 sync errors, does that mean that 30 sectors were invalid or is it some other kind of size?

Is there any way to find out what files would be affected?

Well... the good thing is that I've been running CrashPlan so all of my most important stuff is backed up...

The only things I might lose are backups of another functional computer and some other non-critical stuff... (probably a few ripped movies...)

If I try copying a corrupted file from the disk share will unraid report a disk read error or attempt to reconstruct from the (probably invalid) parity?

Is there any way to find out what files have the bad sectors? (probably not but it doesn't hurt to ask...)

JorgeB · November 18, 2016

SHHHH!! Don't let Tom hear you say that.

Yes, according to Tom this should not happen, but my comments are based on a test I made where I believe I proved this can happen.

JorgeB · November 18, 2016

Is there any way to find out what files would be affected?

Only after the rebuild and if you have checksums.

DanTheMan827 · November 18, 2016

Couple more questions... (I don't have checksums... :'()

I know it'd be risky but could I pull drive 1 and continue to run the array unprotected? (to avoid any more damage to the parity)

The way I see it is I could do it one of two ways...

Scan the drive in another computer to find unreadable files or rebuild the array using whatever parity data is available and then copy the contents off the drive to the while skipping unreadable files (to let me know what is corrupt)

Am I right in thinking this or is there some other way to go about it?

Also, any tools I should use? linux isn't really my forte so I'm learning as I go...

Also, does unraid log the position that was synced during the check?

Couldn't that info be used to find which file(s) are currently occupying that region of the disk (from another computer)

In theory everything that wasn't synced to the parity should still be valid right?

JorgeB · November 18, 2016

Parity won't be more damaged, just don't run another check and replace that disk as soon as possible, you can then compare files between the rebuilt and the old disk.

Also,with some files, like movies, a little corruption may not even be noticeble, just a little glitch during playback.

DanTheMan827 · November 18, 2016

So I could continue to use the array normally until the replacement?

Or should I drop the failed disk out in the meantime?

How would I compare the data between the drive and array when I replace it? recursive md5sum?

Would that work if it was unable to read the data due to bad sectors?

Sorry if I'm sounding very "noob" right about now... I just don't want to risk a 3TB drive worth of data that is almost full...

And where do I change parity sync on power loss to parity check only? that's actually what it's running from... normally I had a sync run Monday at 4AM... (this is the first time one has ever written anything)

I guess once everything is normal again I should probably re-format my disks one-by-one to btrfs (for the hashing) huh... or is there some way to convert XFS?

DanTheMan827 · November 18, 2016

You know... I just realized... it says "Parity-Check in progress" does this mean that it didn't actually write any parity?

Is that the default on unclean shutdown, to check instead of sync?

Would it say "Parity-Sync" if it were writing new parity?

JorgeB · November 18, 2016

So I could continue to use the array normally until the replacement?

Avoid using disk1, reads or writes, or it may get disabled.

Or should I drop the failed disk out in the meantime?

If you need to access disk1 contents probably best.

How would I compare the data between the drive and array when I replace it? recursive md5sum?

With a binary file compare utility, you can also copy all the files you can from the old disk, all files successfully copied should be OK, you should be able to copy most of them.

Would that work if it was unable to read the data due to bad sectors?

For those you'd use the rebuilt files, hopefully there are only a few, and those would be the only suspect ones.

And where do I change parity sync on power loss to parity check only? that's actually what it's running from... normally I had a sync run Monday at 4AM... (this is the first time one has ever written anything)

You can't, only the schedulle scheck can be changed, if this check was after an unclean shutdown it's possible that some or all of the sync errors are from that, upload your diagnostics if you haven's rebooted yet.

I guess once everything is normal again I should probably re-format my disks one-by-one to btrfs (for the hashing) huh... or is there some way to convert XFS?

I prefer XFS with the checksum plugin.

trurl · November 18, 2016

I prefer to use the term "parity sync" to mean a complete parity build, which just calculates parity from all the data disks and writes it without checking it, and the term "parity check" to mean comparing existing parity with the calculated parity. A correcting parity check will check and rewrite any parity that doesn't match the calculation. You can make the monthly parity check non-correcting, but I think the parity check you get from an unclean shutdown is always a correcting one.

DanTheMan827 · November 18, 2016

I prefer to use the term "parity sync" to mean a complete parity build, which just calculates parity from all the data disks and writes it without checking it, and the term "parity check" to mean comparing existing parity with the calculated parity. A correcting parity check will check and rewrite any parity that doesn't match the calculation. You can make the monthly parity check non-correcting, but I think the parity check you get from an unclean shutdown is always a correcting one.

I'm really hoping it was a non-correcting one...

In my push notifications the scheduled sync says "Parity sync: started" where this one it said "Parity check started" so hopefully it's non-correcting...

So I could continue to use the array normally until the replacement?

Avoid using disk1, reads or writes, or it may get disabled.

Or should I drop the failed disk out in the meantime?

If you need to access disk1 contents probably best.

How would I compare the data between the drive and array when I replace it? recursive md5sum?

With a binary file compare utility, you can also copy all the files you can from the old disk, all files successfully copied should be OK, you should be able to copy most of them.

Would that work if it was unable to read the data due to bad sectors?

For those you'd use the rebuilt files, hopefully there are only a few, and those would be the only suspect ones.

And where do I change parity sync on power loss to parity check only? that's actually what it's running from... normally I had a sync run Monday at 4AM... (this is the first time one has ever written anything)

You can't, only the schedulle scheck can be changed, if this check was after an unclean shutdown it's possible that some or all of the sync errors are from that, upload your diagnostics if you haven's rebooted yet.

I guess once everything is normal again I should probably re-format my disks one-by-one to btrfs (for the hashing) huh... or is there some way to convert XFS?

I prefer XFS with the checksum plugin.

Yeah, I kind of need disk1 since that has the docker data... (cache is only used for a VM)

It was caused by a host hard-lock when passing through a expansion PCIE USB 3 controller to the VM (not the unraid controller)

Is the power loss parity scan just a check instead of a sync? if that's the case I should be fine since the drive hadn't reported any errors during my last scheduled sync operation

I'm pretty sure the drive is toast though... http://i.imgur.com/ev1Xq1Z.png

JorgeB · November 18, 2016

Post your diagnostics

DanTheMan827 · November 18, 2016

Post your diagnostics

Really hope I can just buy a replacement drive and drop it into the array...

partially corrupted parity would not be nice...

unraid-diagnostics-20161118-1520_anonymized.zip

JorgeB · November 18, 2016

Well, this is interesting, it's running a non-correcting check after an unclean shutdown:

Nov 18 00:36:32 unraid kernel: mdcmd (42): check nocorrect

I believe this is new behavior, lucky you, because the sync errors are from the disk1 read errors.

DanTheMan827 · November 18, 2016

Well, this is interesting, it's running a non-correcting check after an unclean shutdown:
Nov 18 00:36:32 unraid kernel: mdcmd (42): check nocorrect
I believe this is new behavior, lucky you, because the sync errors are from the disk1 read errors.

definitely lucky!

I do have re-construct write on though... how would that work if a file were written to any drive position that shared the same as the bad sectors on disk1?

Would it drop the drive with the read error or something else?

Looks like a file checksum may still be in my future though even after I replace the disk...

JorgeB · November 18, 2016

If this was a correcting check, parity would be corrupted, eg:

Nov 18 11:37:25 unraid kernel: md: disk1 read error, sector=5481664448
Nov 18 11:37:25 unraid kernel: md: disk1 read error, sector=5481664456

Nov 18 11:37:25 unraid kernel: md: disk1 read error, sector=5481664464

Nov 18 11:37:25 unraid kernel: md: disk1 read error, sector=5481664472

Nov 18 11:37:39 unraid kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0

Nov 18 11:37:39 unraid kernel: ata2.00: irq_stat 0x40000001

Nov 18 11:37:39 unraid kernel: ata2.00: failed command: READ DMA EXT

Nov 18 11:37:39 unraid kernel: ata2.00: cmd 25/00:70:20:90:bb/00:02:46:01:00/e0 tag 4 dma 319488 in

Nov 18 11:37:39 unraid kernel: res 51/40:38:58:90:bb/00:02:46:01:00/06 Emask 0x9 (media error)

Nov 18 11:37:39 unraid kernel: ata2.00: status: { DRDY ERR }

Nov 18 11:37:39 unraid kernel: ata2.00: error: { UNC }

Nov 18 11:37:39 unraid kernel: ata2.00: configured for UDMA/133

Nov 18 11:37:39 unraid kernel: sd 4:0:0:0: [sdc] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08

Nov 18 11:37:39 unraid kernel: sd 4:0:0:0: [sdc] tag#4 Sense Key : 0x3 [current] [descriptor]

Nov 18 11:37:39 unraid kernel: sd 4:0:0:0: [sdc] tag#4 ASC=0x11 ASCQ=0x4

Nov 18 11:37:39 unraid kernel: sd 4:0:0:0: [sdc] tag#4 CDB: opcode=0x88 88 00 00 00 00 01 46 bb 90 20 00 00 02 70 00 00

Nov 18 11:37:39 unraid kernel: blk_update_request: I/O error, dev sdc, sector 5481664600

Nov 18 11:37:39 unraid kernel: md: recovery thread: P incorrect, sector=5481664528

Nov 18 11:37:39 unraid kernel: md: disk1 read error, sector=5481664536

Nov 18 11:37:39 unraid kernel: md: disk1 read error, sector=5481664544

Nov 18 11:37:39 unraid kernel: ata2: EH complete

JorgeB · November 18, 2016

Cancel the parity check, no point in continuing and replace that disk ASAP.

DanTheMan827 · November 18, 2016

Cancel the parity check, no point in continuing and replace that disk ASAP.

I plan on picking up a replacement from best buy after work but in the meantime should I stop the array and remove the failed disk from it through the web interface?

That would still let the array be used but be very slow correct?

JorgeB · November 18, 2016

With a single parity I would shut it down, but if you want to keep using it and read/write from disk1 then it's probably best to disable it, as long as all other disks are OK it will be slower and unprotected against another failure but usable.

Squid · November 18, 2016

Well, this is interesting, it's running a non-correcting check after an unclean shutdown:
Nov 18 00:36:32 unraid kernel: mdcmd (42): check nocorrect
I believe this is new behavior, lucky you, because the sync errors are from the disk1 read errors.

I asked Tom about this a long time ago, and he stated that regardless of the correct / nocorrect option, any read error is going to wind up automatically attempting to recreate that particular bit on the offending disk. Failure to recreate winds up dropping the disk.

DanTheMan827 · November 18, 2016

With a single parity I would shut it down, but if you want to keep using it and read/write from disk1 then it's probably best to disable it, as long as all other disks are OK it will be slower and unprotected against another failure but usable.

To rebuild, would I then just swap the drives and then assign the replacement?

JorgeB · November 18, 2016

Well, this is interesting, it's running a non-correcting check after an unclean shutdown:
Nov 18 00:36:32 unraid kernel: mdcmd (42): check nocorrect
I believe this is new behavior, lucky you, because the sync errors are from the disk1 read errors.
I asked Tom about this a long time ago, and he stated that regardless of the correct / nocorrect option, any read error is going to wind up automatically attempting to recreate that particular bit on the offending disk. Failure to recreate winds up dropping the disk.

Yes, but the issue here is if the parity is being incorrectly updated, and it's not because it's a non correcting check.

JorgeB · November 18, 2016

With a single parity I would shut it down, but if you want to keep using it and read/write from disk1 then it's probably best to disable it, as long as all other disks are OK it will be slower and unprotected against another failure but usable.

To rebuild, would I then just swap the drives and then assign the replacement?

Yes

Errors on disk 1 + sync errors during parity scan

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation