[SOLVED] Filesystem errors during Parity Sync


Recommended Posts

On 6/4/2020 at 7:47 PM, johnsanc said:

the sector value increments by 8 on every line.

That's normal it's the first bit from each byte parity is checked for every standard 4k Linux block, each block has 8 sectors (with standard 512E drives).

 

On 6/4/2020 at 7:47 PM, johnsanc said:

I can deduce is that XFS repairs can somehow invalidate parity.

That's seams very unlike to me if not impossible (if xfs_repair was run correctly).

 

On 6/4/2020 at 7:47 PM, johnsanc said:

I will let this finish and run another check without rebooting.

Yes, please do that and post the diags.

Link to comment

Well here is an update so far... The data disks are done with the parity check, but its currently checking nothing because my parities are 12 TB and my largest data drive is 10 TB.

 

It looks like I had an IO_PAGE_FAULT error and then a few minutes later some XFS meta data errors, first on disk10 (which was still parity checking) then later on disk9 (which was already done checking). I can still access those disks and they are not emulated.

 

Looking back at my old logs, this happened before the last time I got XFS errors in the log. In all cases the IO_PAGE_FAULT came from my "ASM1062 Serial ATA Controller" which is onboard. Also not sure if related, but I noticed in the logs that the XFS issues appeared shortly after 5:00 AM in both this run on 6/5 and on 6/2 (within one minute).

 

So should I continue with another check?

Or should I try to do an XFS repair on the two disks that have issues?

Try to copy data and reformat those drives?

Something else to try to fix whatever the controller issue is?

Upgrade to 6.9 beta for better support for X570?

 

Any guidance on next steps is appreciated as always.

 

UPDATE:

This seems very much related to issues I was having before. X570 woes.

I also noticed that I forgot to add "iommu=pt avic=1" to my syslinux.cfg for Unraid GUI mode, which I am currently using.

tower-diagnostics-20200605-0905.zip

 

Edited by johnsanc
Link to comment

I stopped the remainder of the check, upgraded to 6.9-b1, rebooted, did an XFS check on disks9 and 10. Nothing seemed to indicate any issues as far as I can tell. I am now running another correcting parity check.

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 8
        - agno = 13
        - agno = 15
        - agno = 4
        - agno = 0
        - agno = 2
        - agno = 7
        - agno = 6
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 14
        - agno = 3
        - agno = 17
        - agno = 19
        - agno = 16
        - agno = 18
        - agno = 5
        - agno = 9
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

 

Link to comment
9 hours ago, johnsanc said:

In all cases the IO_PAGE_FAULT came from my "ASM1062 Serial ATA Controller" which is onboard.

A good catch, but this can't identify problem come from SATA controller or relate X570 platform. When I use SAS+Expander I haven't use AHCI anymore and no trouble for changing different platform. As you have so many disks and have SAS controller, you should invest on SAS+Expander solution. From below graph, some bottleneck also found ( flat ceiling at section 1 and 2 during access all disks )

 

image.png.93add4321c931fe6ffa7c909ac6305d2.png

 

 

I also note you have 64GB memory in 2666 speed setting, when I live with X370 with 64GB memory, the speed only stable at 2400 but for safety I run at 1866, pls check that too.

 

You have install "Enhanced log" plugin, do you proactively monitor those critical error message ?

Edited by Benson
Link to comment

I do have a SAS+Expander as well. 

LSI LSI00301 (9207-8i) + Intel RES2SV240NC

 

Interesting about the memory - It is ECC and straight from my motherboard's QVL for RAM.

Since I just upgraded to v6.9beta-1 I will let this parity check complete and monitor the logs for any more similar errors before I attempt to change any other settings.

Edited by johnsanc
Link to comment

Another quick update: My parity check started firing off corrections at about the 9.25 TB mark which is right about where I started getting the IO_PAGE_FAULT error the other day during my parity check.

 

So, after this ordeal I am left with a couple of takeaways:

  1. Its possible for Unraid to write bad parity and there is nothing in the Web UI that would indicate anything went wrong unless you look at the syslog.
  2. The "bad parity writing" issue starts with the lovely AMD IO_PAGE_FAULT error. In my case there were a few XFS errors after this and my log was not flooded... but the parity was indeed incorrect for every sector after that point.

So, although I think I have recovered from this, its a bit concerning that this is apparently a scenario that can write bad parity without the user knowing. This could leave someone with a completely unprotected array and they would not even know it until their next parity check.

Link to comment

It is always possible to have "bad" parity when there are hardware issues, as long as now you can run two (or more) consecutive parity checks without errors you should be fine, also make sure there are no more controller related errors, that's no an Unraid problem, it's a Ryzen problem (likely with older kernels only).

Link to comment

Yep I'm going to kick off another parity check to make sure there's zero errors.

 

Its not an Unraid problem per-se, but doesn't the behavior above indicate that Unraid does not re-read the sync correction it just made to ensure its valid?

If not it would be nice to have a "Parity Check with Validation" option.

Link to comment
16 hours ago, johnsanc said:

but doesn't the behavior above indicate that Unraid does not re-read the sync correction it just made to ensure its valid?

It doesn't, it assumes it was successful if the kernel doesn't spit out any errors, there should be one if the correction wasn't correctly written, there was recently a curious case also due to a controller problem:

 

Link to comment
  • JorgeB changed the title to [SOLVED] Filesystem errors during Parity Sync

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.