Jump to content

[SOLVED] Filesystem errors during Parity Sync


Recommended Posts

Posted (edited)

I am in the process of upgrading my parity drives to a larger capacity and I ran into an issue. Here are the series of events:

 

  1. I shutdown the array and turned off the server to replace Parity #1.
  2. Started the server and let the parity sync. This was successful.
  3. I shut down the array and attempted to swap the old Parity #2 with a new disk while the server was still on. Unraid recognized the new drive and things seemed just fine. When I attempted to start the array for the next parity sync I got read and write errors on disk7 and disk15. I know I was not writing any files during this time, so I'm not sure where the reads/writes were coming from.
  4. I tried to stop the array and parity sync as soon as I noticed errors, but the array did not seem to stop, so I chose to shutdown, which worked.
  5. When the server came back up the parity sync for Parity #2 started automatically.
  6. I stopped the parity sync because disk7 and disk15 were disabled.
  7. I tried to start the array in maintenance mode to do a XFS check on the two disks, however it couldn't do anything. I assume because I didn't have valid dual parity at the time since I was in the middle of a parity sync.
  8. I checked the SMART reports on both disks and things looked fine.
  9. At this point I figured I could either try to put my old parity drive back in and restore both disks, or move forward. Since parity sync was already started I assumed that my Parity #1 may have been damaged anyways and I no longer had my old Parity #1 disk since I pre-cleared it after the successful Parity #1 sync.
  10. I chose to do a new config, and trusted parity since I knew the disks weren't dying and I wasn't writing any files to the array at the time.
  11. I confirmed I could access the contents on disk7 and disk15 without issue. Note I did not attempt to do an XFS check at this time since I could access the files. In hindsight maybe I should have. (?)
  12. I then stopped the array to un-assign Parity #2 (which I know was not good parity since I stopped it earlier).
  13. I reassigned Parity #2 to start the sync.
  14. Things seemed to go well for several hours until I mistakenly tried to build hashes with the File Integrity Plugin.
  15. Shortly thereafter I started to receive XFS errors in the syslog and disk7 and disk15 were inaccessible via midnight commander but still showed green in the Unraid UI and Parity Sync kept going.
  16. The syslog is now showing errors about filesystem size every second, I assume because disk7 and disk15 are inaccessible due to FS errors.
  17. Currently Parity Sync is still running with about 50% left to go. All disks are still marked green and there is nothing in the Unraid UI that would indicate there are any issues. There are some writes to Parity #1 during the sync (about 14k so far), which I expected given the circumstances.

 

So my question is - where do I go from here?

Do I let the parity sync complete then try to do an XFS repair?

If the drives are inaccessible due to FS errors, is it normal that they are still green? Just seems odd that everything in UI seems fine, but there are definitely issues with the array in the current state.

 

XFS Errors:

May 30 14:19:15 Tower kernel: XFS (md15): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x224b896e0 len 8 error 117
May 30 14:19:15 Tower kernel: XFS (md15): Metadata corruption detected at xfs_da3_node_read_verify+0x106/0x138 [xfs], xfs_da3_node block 0x224b896e0 
May 30 14:19:15 Tower kernel: XFS (md15): Unmount and run xfs_repair
May 30 14:19:15 Tower kernel: XFS (md15): First 128 bytes of corrupted metadata buffer:
May 30 14:19:15 Tower kernel: 000000009be5df51: 8e 32 b3 8e 85 31 07 74 b0 e4 d5 75 30 df b5 66  .2...1.t...u0..f
May 30 14:19:15 Tower kernel: 000000001ca01a9b: 10 00 1f 09 0a b0 b0 0c 2d b0 9e 23 c6 27 21 fb  ........-..#.'!.
May 30 14:19:15 Tower kernel: 00000000c4409e1b: ac 65 36 7c 92 bd df 0b d6 f3 31 66 a3 28 9b 49  .e6|......1f.(.I
May 30 14:19:15 Tower kernel: 000000006ce0d9d1: db c3 35 b2 99 2c bc 00 d0 c4 87 c2 4f 13 29 1c  ..5..,......O.).
May 30 14:19:15 Tower kernel: 000000004b95f201: 10 f6 fe 58 69 df bf f5 0a 0c e8 86 36 0d 84 34  ...Xi.......6..4
May 30 14:19:15 Tower kernel: 0000000093039ca2: dc 14 0a f4 53 d3 06 a2 1c 48 40 8b 02 ec 13 20  ....S....H@.... 
May 30 14:19:15 Tower kernel: 00000000e1ecbb2f: 08 50 9f d6 f0 b8 66 a2 18 00 6f e3 f0 80 e1 e8  .P....f...o.....
May 30 14:19:15 Tower kernel: 0000000097002a7b: 60 3d 53 f8 d7 18 88 02 0a 10 d7 e1 b4 02 f3 15  `=S.............

 

FS Size Errors:

May 30 14:37:26 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/appdata
May 30 14:37:26 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/domains
May 30 14:37:26 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/iTunes
May 30 14:37:26 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/system
May 30 14:37:27 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/Applications
May 30 14:37:27 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/Backup
May 30 14:37:27 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/Concert Videos
May 30 14:37:27 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/Downloads
May 30 14:37:27 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/LaunchBox
May 30 14:37:27 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/Movies
May 30 14:37:27 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/Music
May 30 14:37:27 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/Standup Comedy
May 30 14:37:27 Tower emhttpd: error: get_fs_sizes, 6412: Input/output error (5): statfs: /mnt/user/TV Shows

 

In the future I certainly won't try swapping disks again while the array is off. Powering off is apparently much safer.

Also, the issue above felt like I was between a rock and hard place, and I'm sure I could have taken better steps - but its in the past and im just trying to save as much data as I can and eventually get things back into proper working order.

 

I can provide diagnostics if needed, but figured I would lay out the chain of events first since you don't run into this every day.

Edited by johnsanc
Posted (edited)

If array stop and hotplug disks, it shouldn't have problem, does new Parity #2 really a new disk ?

 

 

"14. Things seemed to go well for several hours until I mistakenly tried to build hashes with the File Integrity Plugin"

I don't think it is a wrong step, but a corrupt FS on disk7 and disk15 and hash result write back to those disk, then trouble will happen.

 

 

24 minutes ago, johnsanc said:

If the drives are inaccessible due to FS errors, is it normal that they are still green?

Someone say Unraid will drop disk only if write error.

 

 

26 minutes ago, johnsanc said:

Powering off is apparently much safer.

Yes. And much safe under maintenance mode, but this not practical.

 

 

Edited by Benson
Posted (edited)

Parity #2 was a brand new pre-cleared disk.

 

Disk 7 SAID is had an up-to-date Build in the File Integrity screen, but I'm not sure if that was accurate considering I later tried to build hashes for another disk that said the build was up-to-date that was already done calculating parity and it was definitely writing to the disk...

 

There is nothing specific in the syslog about writes via the File Integrity plugin and it did not generate any error logs. I just assume there was an attempt to write a hash to extended attributes which triggered the XFS errors. Or perhaps even the reads could trigger the XFS errors.

Edited by johnsanc
Posted
1 minute ago, johnsanc said:

Or perhaps even the reads could trigger the XFS errors.

Agree.

 

Issue seems relate hardware problem during hotplug rather then software issue.

Posted (edited)

Yeah most likely - At this point I'm just wondering if I should let the parity sync finish or if I should stop it and do another new config and try an XFS repair. I suppose theres no further damage that can be done by allowing parity sync to continue, but not sure if it's a waste of time. For example if I will need to do a complete new parity sync again anyways.

 

Do you think another parity sync would be required?

Or would an XFS repair after parity is done syncing suffice?

Edited by johnsanc
Posted (edited)
37 minutes ago, johnsanc said:

I suppose theres no further damage that can be done by allowing parity sync to continue

Agree. But hard to say stop sync or not, but I prefer continue sync.

 

Next should be repair FS, then sync again, then hash check any file corrupt.

 

In fact, you make me a bit scared, because next day will be monthly parity check day. I have a external enclosure which not part of array, it is 6TB disk size and disk check need 12hr. My parity was 10TB, check need 20hrs, so there are 8hrs different.

In the past, I will start paity check, and after 8hrs I will power on external ensclosure and check those disk .... so all checking task will finish almost in same time.

 

Yes, something like hotplug during array in mount state.

Edited by Benson
Posted (edited)

Thanks, thats what I was leaning toward.

Is another sync necessary after an XFS repair?

Wouldn't a parity check basically do the same thing and write any corrections to both parity drives?

I'm a bit hesitant to do a full parity sync on both parity drives at once.

 

That reminds me, I should disable my monthly parity check this time around just in case.

 

You also reminded me of a detail I left out. I had a USB external disk preclearing the entire time. One of the reasons I wanted to swap with the server still on was so I could let the preclear continue. I didn't realize that you could resume a preclear after a reboot, nor did I realize that stopping the array also stops a preclear on an unassigned device (which you can resume).

 

Edited by johnsanc
Posted
3 minutes ago, johnsanc said:

Is another sync necessary after an XFS repair?

Yes, because parity integrity always important, otherwise it lost its purpose.

 

5 minutes ago, johnsanc said:

Wouldn't a parity check basically do the same thing and write any corrections to both parity drives?

Yes, with correction, it is same as full write ( full sync ) and may less write wear out to disks. But logging may flood of error.

 

Posted
10 hours ago, johnsanc said:

Is another sync necessary after an XFS repair?

If the filesystem check/repair is done correctly (using the GUI or CLI) parity will be updated and kept in sync.

Posted

As of this morning I was able to do an XFS repair on the two disks and I can access the files again. I did see that there were writes to parity when I made the XFS repair. I am doing a parity check now just to make sure everything is in good working order before I start doing any new file writes to the array.

 

This was a bit of a scare, but Unraid did the right thing to minimize data loss at every step.

Also, there hasn't been a single XFS issue I haven't been able to recover from over the years. Shows how mature that file system is.

 

Thanks all for the help and guidance!

Posted (edited)

Quick update - More weirdness occurring. During my parity check Parity #1 has basically non-stop sync errors after the 2TB mark, almost as if the parity alignment was thrown off after my 2 TB disks.

 

I was watching my disk usage as parity check crossed over the 2TB mark and I noticed slow reads on only one of my 2TB drives for a few minutes after the other 2TB drives stopped reading. Even weirder is that Parity #2 has hardly any writes, implying there are far fewer sync issues with Parity #2.

 

There is nothing in the syslog indicating any errors.

 

Any ideas what could cause this?

Shouldn't my Parity Sync of Parity #2 also corrected Parity #1? (as outlined here:)

I just cant think of a reason why Parity #2 would appear good, but Parity #1 would not. If there were any parity issues I would have expected both parity drives to need updating.

 

Update: 107,625,991 sync errors and counting. I have no idea how Parity #1 could get THAT out of whack. After some quick math, it seems as if about every sector after the 2TB mark is triggering a sync correction. I understand Parity #2 could do that if I rearranged disks, but I am at a loss of understanding what could possibly cause that aside from some sort of a software "hiccup" that throws off the "P" calculation but not the "Q". @johnnie.black - I certainly welcome your opinions on this one. My google-fu may be weak, but I scoured the forums and could not find anything similar to this behavior.

 

For now I guess I'll let this run its course and run another Parity check after this completes.

 

Just for fun, you can see at around 18:00 when the 2TB mark was passed, then minutes after, massive slowdown to about 65mb/s as corrections to Parity #1 are written. It was during that spike that I saw reads to one of my 2TB which I don't think should have been occurring since parity check should have been past that point.

image.thumb.png.4d278427b7bb31f02540ca9d62d4fd09.png

Edited by johnsanc
Posted
11 hours ago, johnsanc said:

could possibly cause that aside from some sort of a software "hiccup" that throws off the "P" calculation but not the "Q".

Very unlikely that would happen, most likely explanation is that parity wasn't already valid, run another check now (non correcting), if there are more sync errors you likely have a hardware problem.

Posted (edited)

I'll run another non-correcting parity check tomorrow night if this completes by then. As of about 5:30 AM on the nose I am now getting writes to Parity #2 for what looks like every sector. This is exactly when my SSD Trim schedule is set to run... is there anyway that can be related?

 

Only thing I can think of now of what might have caused this is that I needed to a reboot with no array configuration at all before the XFS repair because for some reason I still couldn't select disk7 or disk15 after doing a New Config. Once everything was booted back up I manually reassigned my disks and confirmed I could still access everything.

 

Either way, I still see zero errors in the syslog unless they are being suppressed due to the P+Q corrections.

Have you ever seen hardware errors that resulted in zero errors in the syslog?

 

Given the odd chain of events outlined in my first post, I guess maybe it can be chalked up to that. I'll just let this complete then take it form there, as long as im not seeing errors in the syslog and I can still access my files I guess I should just let this run its course. It will be particularly interesting to see what happens after the 10TB mark since my new parity drives are 12TB but none of my data drives are that large yet.

Edited by johnsanc
Posted
35 minutes ago, johnsanc said:

As of about 5:30 AM on the nose I am now getting writes to Parity #2 for what looks like every sector. This is exactly when my SSD Trim schedule is set to run... is there anyway that can be related?

Can't see how, but if there are writes to parity2 there should be the same ones to parity, though you can't look at the the number of writes stats to compare, since they can vary wildly from device to device, even when doing the same activity, when can look at the speeds stats, those are reliable.

Posted (edited)

Speed stats are dead even to both Parity drives.

I'm hoping this check gets things back in order. I'm just racking my brain as to how the parity could be so incorrect considering I haven't written anything to the array aside from a few hashes from File Integrity plugin and an XFS repair. It feels like my parity should have been mostly correct before and now this correcting check is actually breaking the parity. The only other explanations I can think of are:

  1. Previous parity syncs were bad - but that doesn't really make sense since my initial build of Parity #1 was as clean as possible and had no writes that I can recall when building Parity #2 (Aside from those outlined in my first post)
  2. Software bug somewhere since there is zero hardware related errors in the logs as far as I can tell

 

Its an eerie coincidence that the sync corrections both align to the minute with other events that occurred:

  • Parity #1 sync corrections: At the border of the 2TB drives where there appeared to be extra reads to one drive that shouldn't have been. This was also at 18:00 on the nose, but I don't have anything scheduled to happen at that specific time as far as I know.
  • Parity #2 sync corrections: Exactly at 5:30 AM which coincides when SSD Trim was executed.

 

I suppose I will know much more after the 10 TB mark, and especially after I try a non-correcting parity check after this completes.

image.thumb.png.ec1b2dbc65770a3cf1b6920667026d07.png

Edited by johnsanc
Posted (edited)

The parity check crossed over the 10 TB mark, which is the size of my largest data drive right now. There are no more writes to either parity drive, which is what I would have expected.

 

Once this completes in a few hours I will try a non-correcting parity check.

 

UPDATE: After the correcting parity check completed I immediately put the array into maintenance mode and did an XFS check on all of my disks. Two other disks had XFS errors. All 4 disks that had errors over the course of this ordeal were on my ASMedia controller. My parity disks are NOT on that controller. The XFS repair attempt caused the disks to be kicked from the array. I did another new config and manually changed all disks back to "XFS" (no idea why unraid did not remember that setting....).

 

After powering down and reseating all drives on that controller I booted back up and ran another XFS check on all drives - No issues this time. I finally started up the array in normal mode and a non-correcting parity check is currently running. There are few corrections detected, but I expected those due to the XFS repair that was running while the disks were kicked from the array. I plan on letting this check run past the 2 TB mark since thats where things went awry last time... more to come.

Edited by johnsanc
Posted (edited)

OK another update. My non-correcting parity check went over the 2TB mark with only a few sync errors. I stopped this check and restarted with a correcting parity check. Now again deep into the check after about 5TB or so I see every sector triggering a parity correction. Again there are ZERO issues in the syslog.

 

Does anyone have an explanation for this? Its concerning that I keep getting these sync corrections but there is no info in the syslog that would suggest any issues what-so-ever.

 

I did notice that the parity corrections seem to coincide with network activity, which happened to be a download from my Nextcloud.

Considering there are no errors in the syslog it almost seems like the parity check/sync is very fragile and can be easily thrown off. Either that or there is something weird going on with my SATA controller and its returning incorrect data that is going unreported in the syslog.

 

This one is a mystery and any theories around what is happening are appreciated.

 

image.thumb.png.6e3a458e6c1d325b93143206076b797c.png

Edited by johnsanc
Posted
8 hours ago, johnsanc said:

My non-correcting parity check went over the 2TB mark with only a few sync errors.

The only acceptable result is 0 errors.

 

8 hours ago, johnsanc said:

Considering there are no errors in the syslog it almost seems like the parity check/sync is very fragile and can be easily thrown off.

That's pretty much impossible (without errors on the syslog), either you have a hardware problem or something is assessing a disk outside parity, e.g. by accessing /dev/sdX instead of /dev/mdX

Posted

Perhaps this is more of an academic or mathematical question: but what can cause a parity check to start doing parity corrections to every sector?
Thats what I don't understand.

 

Take this chain of events:

  • Do a correcting parity check, with no syslog errors - Everything seems good at this point.
  • Mount array in maintenance mode to do XFS checks
  • Two disks get kicked from the array during the XFS repair
  • Do a New Config and Trust Parity
  • Reboot the server to reseat all drives and check cable connections
  • Start array and do a correcting parity check

In this case the only writes would have been during the XFS repair. And I would have expected a few parity corrections because of it. How is it possible that the parity could have multiple terabytes of sync errors? It's simply not possible unless there is something else going on with how parity works that I haven't seen explained anywhere. There must be something that can invalidate parity past a certain point instead of simply on a sector by sector basis.

 

The only other explanation is a hardware error that goes completely unreported in the syslog.

Posted
14 minutes ago, johnsanc said:

Perhaps this is more of an academic or mathematical question: but what can cause a parity check to start doing parity corrections to every sector?
Thats what I don't understand.

 

Take this chain of events:

  • Do a correcting parity check, with no syslog errors - Everything seems good at this point.
  • Mount array in maintenance mode to do XFS checks
  • Two disks get kicked from the array during the XFS repair
  • Do a New Config and Trust Parity
  • Reboot the server to reseat all drives and check cable connections
  • Start array and do a correcting parity check

In this case the only writes would have been during the XFS repair. And I would have expected a few parity corrections because of it. How is it possible that the parity could have multiple terabytes of sync errors? It's simply not possible unless there is something else going on with how parity works that I haven't seen explained anywhere. There must be something that can invalidate parity past a certain point instead of simply on a sector by sector basis.

 

The only other explanation is a hardware error that goes completely unreported in the syslog.

When you did the New Config did you use the option to keep current assignments?     Just asking as the parity2 calculation uses disk slot  number as one of its inputs so moving any drive to a different slot would invalidate parity2 and could thus cause this sort of effect.    Parity1 is not affected by which slot drives are in so is not affected if you re-order drives.

Posted (edited)

Yes all disks are in the same position every time. In this case there are both P and Q sync corrections, both starting at the same place.

In my earlier chain of events the P corrections started happening first then P+Q corrections a few TB later.

 

It really doesn't make sense to me given my current understanding of how parity works.

 

I do not have the diagnostics for the most recent run, but a few posts back I provided two diagnostics that also should have exhibited the same excessive sync correction behavior.

 

I stopped my correcting parity check yesterday in the middle of the flood of sync corrections. I rechecked cables and rebooted and the correcting parity check is running again. So far 0 sync corrections. In a few hours it should reach the point where the sync corrections started yesterday. I am really curious to see what happens with the rest of this run. If I see that there is a range of time where excessive sync corrections happen then are resolved - then I think that suggests something went awry and unraid was making sync corrections it shouldnt have yesterday.

Edited by johnsanc
Posted (edited)

I've attached my syslog from the syslog server save from the past couple days. I've also attached my diagnostics as of this afternoon that shows the parity corrections.

 

As you can see below and in the logs attached, the sector value increments by 8 on every line.

Jun  4 13:18:43 Tower kernel: md: recovery thread: PQ corrected, sector=12860152832
Jun  4 13:18:43 Tower kernel: md: recovery thread: PQ corrected, sector=12860152840
Jun  4 13:18:43 Tower kernel: md: recovery thread: PQ corrected, sector=12860152848
Jun  4 13:18:43 Tower kernel: md: recovery thread: PQ corrected, sector=12860152856
Jun  4 13:18:43 Tower kernel: md: recovery thread: PQ corrected, sector=12860152864

 

To summarize my various parity checks from the last few days starting with the parity check I let make millions of corrections:

  • Check #1 - 05/31 11:32:34 --> 06/02 15:20:39
    • Correcting
    • I let it finish
    • Constant sync corrections after 2 TB. It was during this check that I had two disks that were still green, but were not accessible due to XFS errors. It was after this check that I did the XFS check/repairs on the two disks that were problematic.
  • Check #2 - 06/02 17:43:21 --> 06/03 00:16:56
    • Non-correcting
    • Canceled after 2 TB mark
    • There were no sync errors past the 2 TB mark like there was before, which lead me to believe that my last parity corrections were written correctly. There were some sync errors initially which I expected due to the writes from XFS repairs after the disks were kicked from the array.
  • Check #3 - 06/03 00:17:45 --> 06/03 18:25:37
    • Correcting
    • Stopped early
    • Constant sync corrections after about 5 TB. After stopping I did XFS checks on all disks. Two different disks required repairs. I also reseated my expander card and checked disk connections.
  • Check #4 - 06/03 19:55:33 --> CURRENTLY RUNNING
    • Correcting
    • Constant sync corrections about about 6 TB. I assume the corrections from Check #3 were good since the only parity corrections in this check were after the point I stopped Check #3

 

Based on this so far the only thing I can deduce is that XFS repairs can somehow invalidate parity... or my Check #1 wrote completely 100% incorrect parity after a certain point while the disks were green, but files inaccessible due to XFS errors.

 

I will let this finish and run another check without rebooting. It'll take awhile if the parity corrections continue at ~50mb/s for the next 3 TB or so.

 

syslog.log

tower-diagnostics-20200604-1346.zip

Edited by johnsanc

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...