Monthly 5 Parity Errors


Recommended Posts

Hi All...

 

Ver: 6.1.9

 

Each month I have the UnRAID server run a check and fix on Parity. It seems that each month I have 5 errors and it fixes them.  If I run another check I have no errors. So it seems to fix them.  But then the next month, 5 again.

 

Anyway, here is the part of my syslog that does show 5 errors during the check.  All seem to be on ATA7. (?) 

 

Can someone please advise on what I should look at based on what the log may be saying. 

 

5 Data Drives and a Cache drive (and parity of course).  One add on PCIe.  Server Diag ZIP attached if needed.

 

Thank you.

 

Oct  1 00:00:01 Server logger: mover started
Oct  1 00:00:01 Server kernel: mdcmd (41): check 
Oct  1 00:00:01 Server kernel: md: recovery thread woken up ...
Oct  1 00:00:01 Server kernel: md: recovery thread checking parity...
Oct  1 00:01:21 Server sSMTP[28530]: SSL connection using ECDHE-RSA-AES256-GCM-SHA384
Oct  1 00:01:23 Server sSMTP[28530]: Authorization failed (535 Incorrect authentication data)
Oct  1 00:23:22 Server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct  1 00:23:22 Server kernel: ata7.00: failed command: SMART
Oct  1 00:23:22 Server kernel: ata7.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 1 pio 512 in
Oct  1 00:23:22 Server kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  1 00:23:22 Server kernel: ata7.00: status: { DRDY }
Oct  1 00:23:22 Server kernel: ata7: hard resetting link
Oct  1 00:23:22 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  1 00:23:27 Server kernel: ata7.00: qc timeout (cmd 0xec)
Oct  1 00:23:27 Server kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  1 00:23:27 Server kernel: ata7.00: revalidation failed (errno=-5)
Oct  1 00:23:27 Server kernel: ata7: hard resetting link
Oct  1 00:23:27 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  1 00:23:27 Server kernel: ata7.00: configured for UDMA/133
Oct  1 00:23:27 Server kernel: ata7: EH complete
Oct  1 00:38:45 Server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct  1 00:38:45 Server kernel: ata7.00: failed command: IDENTIFY DEVICE
Oct  1 00:38:45 Server kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 6 pio 512 in
Oct  1 00:38:45 Server kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  1 00:38:45 Server kernel: ata7.00: status: { DRDY }
Oct  1 00:38:45 Server kernel: ata7: hard resetting link
Oct  1 00:38:45 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  1 00:38:50 Server kernel: ata7.00: qc timeout (cmd 0xec)
Oct  1 00:38:50 Server kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  1 00:38:50 Server kernel: ata7.00: revalidation failed (errno=-5)
Oct  1 00:38:50 Server kernel: ata7: hard resetting link
Oct  1 00:38:50 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  1 00:38:50 Server kernel: ata7.00: configured for UDMA/133
Oct  1 00:38:50 Server kernel: ata7: EH complete
Oct  1 00:54:22 Server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct  1 00:54:22 Server kernel: ata7.00: failed command: IDENTIFY DEVICE
Oct  1 00:54:22 Server kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 21 pio 512 in
Oct  1 00:54:22 Server kernel:         res 40/00:01:00:00:00/00:00:00:00:00/e0 Emask 0x4 (timeout)
Oct  1 00:54:22 Server kernel: ata7.00: status: { DRDY }
Oct  1 00:54:22 Server kernel: ata7: hard resetting link
Oct  1 00:54:22 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  1 00:54:27 Server kernel: ata7.00: qc timeout (cmd 0xec)
Oct  1 00:54:27 Server kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  1 00:54:27 Server kernel: ata7.00: revalidation failed (errno=-5)
Oct  1 00:54:27 Server kernel: ata7: hard resetting link
Oct  1 00:54:27 Server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  1 00:54:37 Server kernel: ata7.00: qc timeout (cmd 0xec)
Oct  1 00:54:37 Server kernel: ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  1 00:54:37 Server kernel: ata7.00: revalidation failed (errno=-5)
Oct  1 00:54:37 Server kernel: ata7: limiting SATA link speed to 3.0 Gbps
Oct  1 00:54:37 Server kernel: ata7.00: limiting speed to UDMA/133:PIO3
Oct  1 00:54:37 Server kernel: ata7: hard resetting link
Oct  1 00:54:38 Server kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Oct  1 00:54:38 Server kernel: ata7.00: configured for UDMA/133
Oct  1 00:54:38 Server kernel: ata7: EH complete
Oct  1 03:26:23 Server kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct  1 03:26:23 Server kernel: ata7.00: failed command: IDENTIFY DEVICE
Oct  1 03:26:23 Server kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 30 pio 512 in
Oct  1 03:26:23 Server kernel:         res 40/00:01:00:00:00/00:00:00:00:00/e0 Emask 0x4 (timeout)
Oct  1 03:26:23 Server kernel: ata7.00: status: { DRDY }
Oct  1 03:26:23 Server kernel: ata7: hard resetting link
Oct  1 03:26:23 Server kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Oct  1 03:26:23 Server kernel: ata7.00: configured for UDMA/133
Oct  1 03:26:23 Server kernel: ata7: EH complete
Oct  1 03:37:36 Server kernel: md: correcting parity, sector=3519069768
Oct  1 03:37:36 Server kernel: md: correcting parity, sector=3519069776
Oct  1 03:37:36 Server kernel: md: correcting parity, sector=3519069784
Oct  1 03:37:36 Server kernel: md: correcting parity, sector=3519069792
Oct  1 03:37:36 Server kernel: md: correcting parity, sector=3519069800

 

PCI Devices...

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation H77 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)
01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9215 (rev 11)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 09)

SCSI Devices...

[0:0:0:0]    disk    Sony     Storage Media    0100  /dev/sda 
[2:0:0:0]    disk    ATA      ST4000VN000-1H41 SC43  /dev/sdb 
[3:0:0:0]    disk    ATA      ST4000VN000-1H41 SC43  /dev/sdc 
[4:0:0:0]    disk    ATA      ST4000VN000-1H41 SC43  /dev/sdd 
[5:0:0:0]    disk    ATA      ST4000VN000-1H41 SC43  /dev/sde 
[6:0:0:0]    disk    ATA      WDC WD30EZRX-00D 0A80  /dev/sdf 
[7:0:0:0]    disk    ATA      KINGSTON SV300S3 BBF0  /dev/sdg 
[8:0:0:0]    disk    ATA      ST4000VN000-1H41 SC46  /dev/sdh 

 

 

server-diagnostics-20161001-1035.zip

Link to comment

This problem seems to be resurfacing as of late.

 

In my case on my 6.1.9 machine, I also have continual 5 corrected errors *always* on the exact same sectors, but only under a certain circumstance.

 

The circumstance is that its the first parity check after a reboot.  Subsequent parity checks without a reboot will not pick up any errors.

 

This implies to me that its something in the BIOS/HBA's that are changing the values (HPA? - but no additional drives have an HPA partition on them that didn't already previously exist).

 

Just haven't had time to properly diagnose where the issue actually is (but I don't believe that in my case it's unRaid's fault)  And, since AFAIK no files are corrupted due to the 5 sectors changing at boot time, it hasn't been a priority for me (all my MD5s check out fine - which implies (not a given) that the problem is on the parity disk itself.

Link to comment

The ATA7 errors are from your cache SSD, so unrelated to the parity sync errors.

 

I would start by switching your parity disk from the Marvell controller to the unused onboard controller, leave only the cache disk there, and see if issue goes way.

 

I will try that.  Question in regards to that...Will I need to do anything with UnRAID for this switch of controllers for a drive?  Does it care or doe sit just look at the drive itself for configuration?

 

Thank you.

Link to comment

The ATA7 errors are from your cache SSD, so unrelated to the parity sync errors.

 

I would start by switching your parity disk from the Marvell controller to the unused onboard controller, leave only the cache disk there, and see if issue goes way.

 

Hi...Just wanted to mentioned that I made this change and we will see what happens.

 

BTW...You mentioned the ATA7 errors were on my cache drive...I do not think the drive is failing...but not really sure what the error is saying.  Cable issue?  Drive?  Something else?  (Or are these errors of no real concern. Docker is using the Cache drive.)

 

Thank you.

Link to comment

BTW...You mentioned the ATA7 errors were on my cache drive...I do not think the drive is failing...but not really sure what the error is saying.  Cable issue?  Drive?  Something else?  (Or are these errors of no real concern. Docker is using the Cache drive.)

 

Those were timeout errors, it then recovered, but naturally it's not good, some Marvell controllers have been having issues with unRAID v6, yours is one of those, if your parity issues are resolved you should replace it, that should also take care of the SSD errors, in the meantime you could replace the SATA cable to rule it out.

Link to comment
  • 7 months later...

I'm also having the same problem - in my logs, I see the exact same 5 sectors:

 

May  6 05:33:01 nasserv kernel: md: recovery thread: PQ corrected, sector=3519069768
May  6 05:33:01 nasserv kernel: md: recovery thread: PQ corrected, sector=3519069776
May  6 05:33:01 nasserv kernel: md: recovery thread: PQ corrected, sector=3519069784
May  6 05:33:01 nasserv kernel: md: recovery thread: PQ corrected, sector=3519069792
May  6 05:33:01 nasserv kernel: md: recovery thread: PQ corrected, sector=3519069800

 

However, I do not have any ATA link errors.

 

Is there a way to tell which drive those sectors belong to?Attached are my diagnostic logs. 

nasserv-diagnostics-20170506-1423-2.zip

Link to comment
Just now, johnnie.black said:

If you run another check without rebooting do you still get the same errors?

 

No it's fine afterwards - this seems to be a problem with the Marvell chipset ... I see that others with a SuperMicro SAS2LP have the problem. I have a Highpoint RocketRaid 2720SGL which has the same chipset.

 

I'm going to try to turn off IOMMU / VT-D and see if this fixes things.

Link to comment
1 minute ago, coolspot said:

 

No it's fine afterwards - this seems to be a problem with the Marvell chipset ... I see that others with a SuperMicro SAS2LP have the problem. I have a Highpoint RocketRaid 2720SGL which has the same chipset.

 

I'm going to try to turn off IOMMU / VT-D and see if this fixes things.

 

@Squid had the idea that these 5 repeating sectors after a reboot could be related to int13h being enable, on the SAS2LP bios disable it and see it it helps.

Link to comment
5 minutes ago, coolspot said:

 

No it's fine afterwards - this seems to be a problem with the Marvell chipset ... I see that others with a SuperMicro SAS2LP have the problem. I have a Highpoint RocketRaid 2720SGL which has the same chipset.

 

I'm going to try to turn off IOMMU / VT-D and see if this fixes things.

I just have no idea if the RocketRaid 2720SGL supports disabling int13h in it's bios.  But if it does, you would need to keep VT-D / IOMMU enabled in the computer's BIOS and just disable int13h in the HBA's bios.

 

It's only a theory.  Not tested at all, but perfectly safe to do.  (int13h enabled only allows drives attached to the HBA to be bootable)

Link to comment
  • 2 weeks later...

OK - So it's not just me then....  (I am relieved, kind of.)

 

I also have the SAS2LP in my main server, and every so often (like after I just upgraded to unRAID version 6.3.4) I see the same 5 errors when I do a parity check.

 

The problem that I have is that the errors are real and are not on the parity drive.  I have a number of files which consistently get trampled on when this happens.  How do I know this?  I use an rsync based script to do a binary compare of every file on my server against the backup server.  This doesn't take as long as one might think because I am able to run multiple compares in parallel (one instance be drive), so it takes about 12 hours or so.  The backup server does not use the SAS2LP controller and never fails its parity checks.   After I find the 5 error failure on the main server, I can restore the corrupted files from the backup server.  I then run a correcting parity check on the main server so that parity is then correct for the re-written data.  I then run another non-correcting parity check without error and then another binary compare between the servers with no errors being found.  Everything is then OK and remains OK until the next reboot.

 

So - is there anything that I can add to the discussion or do to help with diagnosis?  

 

Or is my only safe option to junk the SAS2LP and go back to the slower SASLP which has never showed the problem.

 

 

Link to comment
4 minutes ago, S80_UK said:

So - is there anything that I can add to the discussion or do to help with diagnosis?

 

Try this:

 

On 06/05/2017 at 7:37 PM, johnnie.black said:

@Squid had the idea that these 5 repeating sectors after a reboot could be related to int13h being enable, on the SAS2LP bios disable it and see it it helps.

 

Link to comment
11 minutes ago, S80_UK said:

The problem that I have is that the errors are real and are not on the parity drive.  I have a number of files which consistently get trampled on when this happens.

 

Forgot to add that if this also happens to the other users with the same issue it's much more serious than I though, is it always the same files or different ones?

  • Upvote 1
Link to comment

Many thanks for chiming in.  So far, it's always the same files.  One obvious one is a Blu-ray rip - Cast Away, it's around 38GB, so its an easy target.  When I compared the backup against the mains server copy I could see that quite a few sectors are getting overwritten - it may be possible for me to see if it's the same corruption each time, but it will take a while.  There's also a couple of Windows 10 automatic backup database files that get screwed up.  On the plus side, I can at least see where the errors are because I do have a 100% backup.

 

I shall try the disabling the interrupt in the BIOS when the machine has finished the current round of checking (not sure it it's currently enabled or not).  I will also need to restore the good files and then recheck everything before I can do a fresh power cycle, so it may be a while before I come back with any more details. 

  • Upvote 1
Link to comment

Update:  

 

So I restored the corrupted files.  I used the reconstruct method for parity to force it to be correct regardless of the previous state (just saved a bit of time).  I then ran a parity check without error.  i then did a full compare against the backup server - no errors.  Then I did another parity check - still no errors. So far, so good.

 

Next I rebooted the server to look at the interrupt option on the SAS2LP controller - it was enabled.  So I disabled it.  Then I powered off, rebooted, checked that the setting had been retained.  It had.  Power up - all looked OK.  Read and wrote a few files from my main PC.  OK.  Then last night I ran another parity check.

 

Guess what..?  5 errors.  Back again.  Interrupt off.  Damn!

 

May 22 02:48:13 Tower-V6 kernel: md: recovery thread: P incorrect, sector=1565565768
May 22 02:48:13 Tower-V6 kernel: md: recovery thread: P incorrect, sector=1565565776
May 22 02:48:13 Tower-V6 kernel: md: recovery thread: P incorrect, sector=1565565784
May 22 02:48:13 Tower-V6 kernel: md: recovery thread: P incorrect, sector=1565565792
May 22 02:48:13 Tower-V6 kernel: md: recovery thread: P incorrect, sector=1565565800

 

So now I am going to run another file compare against the backup to see what changed.  My guess it that it will be the same files as before.  Watch this space.

 

 

Edited by S80_UK
Link to comment

Yes - I may have had some other activity going on.  So now I want to try to narrow it down further.  So I just restarted from power off (Int still disabled) and then I started the file compare against the backup.  My plan was to complete that, then fix any broken files, then check parity (should be OK), then compare files (should still be OK) then shut down completely.  Follow that with a power up, check parity (might be broken).  I will try to avoid writing to any shares for the duration - shares that get written to by PC Windows backups have been set to read-only for now.

 

Full file compare takes about 14 hours.  Parity check takes about 7.5 hours.  So a couple more days.  I could shorten this by only checking the files that I suspect may get damaged but I'd like to know that nothing else is affected   Or I can only do the full binary compare after I have observed the parity errors.

Edited by S80_UK
Link to comment

A further update...   With the SAS2LP interrupt off 

 

Restarted.

Fixed the broken files from the backup with reconstruct On to force parity reconstruction for those sectors regardless of the previous parity data.

Switched reconstruct back to Auto (which is my normal setting).

Ran a complete parity check (was all OK)

Checked files that were previously getting corrupted were still OK by reading off the server and then doing a binary compare.

Shut down the server completely.

Power up.

Captured diagnostics just in case (attached if interested).

Ran another complete parity check.  (Good news - It was still all OK.)

 

So for the time being I shall do a parity check every few days.  I normally leave the server running 24/7, but I shall try an occasional restart just to see if I can force the error to come back.

 

But if this has solved it, then we are left with the conclusion that with the interrupt enabled, there can be a small number of destructive writes which need not be confined to the parity drive but which only show up during a parity check.  The prevailing view is normally that if there is the odd parity error then a parity correction is required, but that was not the case here   At least I have my backups, and no SAS2LP in that machine, just the slower SASLP version.  The attached file diffs.zip contains the report from a command line file compare of the backup version of the affected MKV file and the corrupted version.  I doubt that it's that useful, but it shows that most of the overwritten data was replaced with zeroes.

 

One passing thought - would a second parity drive have helped to catch this sooner, or have given more clarity as to where the error lay?  Or might it just have made life more complicated?

tower-v6-diagnostics-20170523-1012.zip

diffs.zip

Edited by S80_UK
Typos
  • Upvote 1
Link to comment
2 minutes ago, S80_UK said:

But if this has solved it, then we are left with the conclusion that with the interrupt enabled, there can be a small number of destructive writes which need not be confined to the parity drive but which only show up during a parity check.  The prevailing view is normally that if there is the odd parity error then a parity correction is required, but that was not the case here   At least I have my backups, and no SAS2LP in that machine, just the slower SASLP version.  The attached file diffs.zip contains the report from a command line file compare.  I doubt that its that useful, but it shows that most of the overwritten data was replaced with zeroes.

 

Thanks for testing and keep us updated, but the fact that these writes corrupt files is troubling, also, if parity is on the SAS2LP it's probably getting corrupted as well, I never had this problem with mine, although I always had vd-t disable on the servers they were used and just recently checked and Int13h is also disable on both, I must have disable it for another reason when I got them.

 

5 minutes ago, S80_UK said:

One passing thought - would a second parity drive have helped to catch this sooner, or have given more clarity as to where the error lay?  Or might it just have made life more complicated?

 

Not sure, but probably not, depends where each disk is connected.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.