Jump to content

Some shares drop soon after array starts


Wolfe

Recommended Posts

I noticed a few days ago that 4 of my 12 shares were dropping occasionally - always the same 4 shares and always simultaneously.  At first it might be hours before they dropped but today it only takes a minute or less.  They always reappear after stopping and starting the array.

 

This is Unraid 6.9.2 with all plugins and dockers updated.

 

I read through a number of threads about shares dropping and tried some of the suggestions.

 

SYMPTOMS:
• Disc 2 says "too many files" when browsed from Main or Shares.
• Disc 2 is red in Midnight Commander, which reports "cannot read directory contents".
• Can't write to any of the 12 shares regardless of permissions.
• Time Machine backups stopped working on 2 different MacBooks, each with its own UnRaid TimeMachine share. I think they stopped around the time when the delay before shares dropped got close to 1 minute.  Prior to that I could get them to continue making progress by stopping and restarting the array.
• Share settings: no common pattern for the 4 shares that go offline. This includes 2 TimeMachine shares, which both use Disc 2 only: One goes offline and the other does not.

STEPS TAKEN:
• Rebooting. Didn't help.

• 8+ hours researching, including this thread: https://forums.unraid.net/topic/61879-unraid-shares-disappear/
• Stopped all Dockers.
• Uninstalled Krusader (deprecated anyway, and it wasn't running).
• Fix Common Problems run. No useful discoveries.
• SMART short test completed without error.
• SMART extended self-test "Interrupted (host reset)", presumably because the shares dropped.
• Removed Disc 2 physically. Still had the same 4 shares drop offline about 1 minute after starting the array. (I did understand that Disc 2 would need to be reformatted and re-added because I started the array while it was offline, but with reason to suspect Disc 2 has issues and with 2 parity discs I wasn't concerned.)
• Preclear scan of Disc 2 begun but won't be finished for awhile.

 

I've uploaded my diagnostics, taken shortly after the shares dropped and while Disc 2 is still being precleared.

 

All suggestions are welcome.

Thanks in advance!

 

 

 

yountnas-diagnostics-20210911-1438, complete.zip

Edited by Wolfe
Replacing partial uploaded diagnostics with full file.
Link to comment

I read through that section of the manual. 

 

When I click on "Check", nothing happens. xfs_repair status remains "Not available", even minutes later with a refresh.

 

Details:

  • Since it's an XFS format, I started the array in maintenance mode. 
  • I tried stopping the array and starting in maintenance mode a second time with the same results.
  • Twice while looking at the Main tab I noticed that STOP was greyed out with a message of "Disabled, a BTRFS operation is running". That disappeared after a few seconds.
  • Disk 2 has the red x and is indicated as "not installed".  There is nothing that indicates to me that it is being emulated and no reads nor writes are displayed... I assume that's normal and is because it's in maintenance mode, but I could be wrong.

yountnas-diagnostics-20210911-1759.zip

Edited by Wolfe
Added diagnostic logs, more details.
Link to comment

BTW, thanks for including the parity.check.tuning plugin in your signature.  Nice utility. 

 

I'm particularly pleased to see "Resume parity checks on next array start".  Because of this current issue I'm troubleshooting I haven't been able to complete a parity check since it began, and it's long overdue!

Link to comment

Hi trurl,

I removed Disk 2 from the array by disconnecting it (while the computer was powered down) and then restarting the array because I had some reason to suspect it was the cause of the problem and was ignorant of the checking the file system (xfs_repair) option (and I knew I had 2 parity disks to rebuild the array with after the test).

 

The physical disk in question is ST10000NM0086-2AA101_ZA28QW6X

 

For the 1st set of diagnostics disk 2 was still disconnected.  For the 2nd set of diagnostics, it was connected but as an unassigned device undergoing a preclear (which is still underway).

 

The manual says "If the drive is marked as disable and being emulated then the check is run against the emulated drive and not the physical drive."

 

When I try to run the check, however, nothing happens (as noted in my reply to itempi).

Link to comment
5 hours ago, itimpi said:

If the preclear is still running this could be interfering with the file system check I think :( 

 

Even though Unraid doesn't know that the unassigned device used to be an array device?  Interesting.

 

I'll wait until the preclear is complete and then try again.  Pre-read is at 80% after 14.5 hours.  It'll be awhile!  Let me know if there's anything I can look into in the meantime.  Otherwise I'll just wait it out.

 

There's also the parity check happening, but that'll be done first: just 1 hour remaining.

 

Parity.check.tuning has already helped in this; after your first reply, it allowed me to restart the preclear after I restarted the raid (and the PC) to see if that helped get the xfs_repair check going.  Although the reboot didn't fix the xfs_repair at least I didn't loose hours of preclear progress!

Link to comment

63 hours later the preclear is complete.  Disk healthy, as anticipated.  I added it back to the array and started the array in maintenance mode.  Without starting a parity check I went right to xfs_repair.

 

This time the xfs_repair did start.

Two issues to address.

 

First, at the bottom of the screen, it says the following:

Quote

Array Started  •Fatal error: Cannot redeclare _() (previously declared in /usr/local/emhttp/plugins/parity.check.tuning/Legacy.php:6) in /usr/local/emhttp/plugins/dynamix/include/Translations.php on line 19

 

Second, the xfs_repair failed within 20 seconds:

I think this is the problem to address: "Inode allocation btrees are too corrupted, skipping phases 6 and 7".

 

I'll have some time to look into this in a few hours.

 

Here is the full report:

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
agi_freecount 62, counted 63 in ag 1
sb_ifree 260, counted 261
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
imap claims in-use inode 3575280897 is free, would correct imap
        - agno = 2
imap claims a free inode 5655149344 is in use, would correct imap and clear inode
imap claims a free inode 5655149345 is in use, would correct imap and clear inode
imap claims a free inode 5655149346 is in use, would correct imap and clear inode
imap claims a free inode 5655149347 is in use, would correct imap and clear inode
imap claims a free inode 5655149348 is in use, would correct imap and clear inode
imap claims a free inode 5655149349 is in use, would correct imap and clear inode
imap claims a free inode 5655149350 is in use, would correct imap and clear inode
imap claims a free inode 5655149351 is in use, would correct imap and clear inode
imap claims a free inode 5655149352 is in use, would correct imap and clear inode
imap claims a free inode 5655149353 is in use, would correct imap and clear inode
imap claims a free inode 5655149354 is in use, would correct imap and clear inode
imap claims a free inode 5655149355 is in use, would correct imap and clear inode
imap claims a free inode 5655149356 is in use, would correct imap and clear inode
imap claims a free inode 5655149357 is in use, would correct imap and clear inode
imap claims a free inode 5655149358 is in use, would correct imap and clear inode
imap claims a free inode 5655149359 is in use, would correct imap and clear inode
imap claims a free inode 5655149360 is in use, would correct imap and clear inode
imap claims a free inode 5655149361 is in use, would correct imap and clear inode
imap claims a free inode 5655149362 is in use, would correct imap and clear inode
imap claims a free inode 5655149363 is in use, would correct imap and clear inode
imap claims a free inode 5655149364 is in use, would correct imap and clear inode
imap claims a free inode 5655149365 is in use, would correct imap and clear inode
imap claims a free inode 5655149366 is in use, would correct imap and clear inode
imap claims a free inode 5655149367 is in use, would correct imap and clear inode
imap claims a free inode 5655149368 is in use, would correct imap and clear inode
imap claims a free inode 5655149369 is in use, would correct imap and clear inode
imap claims a free inode 5655149370 is in use, would correct imap and clear inode
imap claims a free inode 5655149371 is in use, would correct imap and clear inode
imap claims a free inode 5655149372 is in use, would correct imap and clear inode
imap claims a free inode 5655149373 is in use, would correct imap and clear inode
imap claims a free inode 5655149374 is in use, would correct imap and clear inode
imap claims a free inode 5655149375 is in use, would correct imap and clear inode
        - agno = 3
        - agno = 4
imap claims a free inode 9710099999 is in use, would correct imap and clear inode
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 2
        - agno = 0
        - agno = 3
        - agno = 4
        - agno = 6
        - agno = 7
        - agno = 5
entry "3751" at block 41 offset 640 in directory inode 5263491317 references free inode 5655149344
	would clear inode number in entry at offset 640...
entry "3752" at block 41 offset 656 in directory inode 5263491317 references free inode 5655149345
	would clear inode number in entry at offset 656...
entry "3753" at block 41 offset 672 in directory inode 5263491317 references free inode 5655149346
	would clear inode number in entry at offset 672...
entry "3754" at block 41 offset 688 in directory inode 5263491317 references free inode 5655149347
	would clear inode number in entry at offset 688...
entry "3755" at block 41 offset 704 in directory inode 5263491317 references free inode 5655149348
	would clear inode number in entry at offset 704...
entry "3756" at block 41 offset 720 in directory inode 5263491317 references free inode 5655149349
	would clear inode number in entry at offset 720...
entry "3757" at block 41 offset 736 in directory inode 5263491317 references free inode 5655149350
	would clear inode number in entry at offset 736...
entry "3758" at block 41 offset 752 in directory inode 5263491317 references free inode 5655149351
	would clear inode number in entry at offset 752...
entry "3759" at block 41 offset 768 in directory inode 5263491317 references free inode 5655149352
	would clear inode number in entry at offset 768...
entry "375a" at block 41 offset 784 in directory inode 5263491317 references free inode 5655149353
	would clear inode number in entry at offset 784...
entry "375b" at block 41 offset 800 in directory inode 5263491317 references free inode 5655149354
	would clear inode number in entry at offset 800...
entry "375c" at block 41 offset 816 in directory inode 5263491317 references free inode 5655149355
	would clear inode number in entry at offset 816...
entry "375d" at block 41 offset 832 in directory inode 5263491317 references free inode 5655149356
	would clear inode number in entry at offset 832...
entry "375e" at block 41 offset 848 in directory inode 5263491317 references free inode 5655149357
	would clear inode number in entry at offset 848...
entry "375f" at block 41 offset 864 in directory inode 5263491317 references free inode 5655149358
	would clear inode number in entry at offset 864...
entry "3760" at block 41 offset 880 in directory inode 5263491317 references free inode 5655149359
	would clear inode number in entry at offset 880...
entry "3761" at block 41 offset 896 in directory inode 5263491317 references free inode 5655149360
	would clear inode number in entry at offset 896...
entry "3762" at block 41 offset 912 in directory inode 5263491317 references free inode 5655149361
	would clear inode number in entry at offset 912...
entry "3763" at block 41 offset 928 in directory inode 5263491317 references free inode 5655149362
	would clear inode number in entry at offset 928...
entry "3764" at block 41 offset 944 in directory inode 5263491317 references free inode 5655149363
	would clear inode number in entry at offset 944...
entry "3765" at block 41 offset 960 in directory inode 5263491317 references free inode 5655149364
	would clear inode number in entry at offset 960...
entry "3766" at block 41 offset 976 in directory inode 5263491317 references free inode 5655149365
	would clear inode number in entry at offset 976...
entry "3767" at block 41 offset 992 in directory inode 5263491317 references free inode 5655149366
	would clear inode number in entry at offset 992...
entry "3768" at block 41 offset 1008 in directory inode 5263491317 references free inode 5655149367
	would clear inode number in entry at offset 1008...
entry "3769" at block 41 offset 1024 in directory inode 5263491317 references free inode 5655149368
	would clear inode number in entry at offset 1024...
entry "376a" at block 41 offset 1040 in directory inode 5263491317 references free inode 5655149369
	would clear inode number in entry at offset 1040...
entry "376b" at block 41 offset 1056 in directory inode 5263491317 references free inode 5655149370
	would clear inode number in entry at offset 1056...
entry "376c" at block 41 offset 1072 in directory inode 5263491317 references free inode 5655149371
	would clear inode number in entry at offset 1072...
entry "376d" at block 41 offset 1088 in directory inode 5263491317 references free inode 5655149372
	would clear inode number in entry at offset 1088...
entry "376e" at block 41 offset 1104 in directory inode 5263491317 references free inode 5655149373
	would clear inode number in entry at offset 1104...
entry "376f" at block 41 offset 1120 in directory inode 5263491317 references free inode 5655149374
	would clear inode number in entry at offset 1120...
entry "3770" at block 41 offset 1136 in directory inode 5263491317 references free inode 5655149375
	would clear inode number in entry at offset 1136...
        - agno = 8
entry "1CE0AA6D-6A65-460A-B9BC-06BDC8EE0C2B-7e97da036096de81d094354d3606ab98.lrprev" in shortform directory 9710099998 references free inode 9710099999
would have junked entry "1CE0AA6D-6A65-460A-B9BC-06BDC8EE0C2B-7e97da036096de81d094354d3606ab98.lrprev" in directory inode 9710099998
would have corrected i8 count in directory 9710099998 from 2 to 1
        - agno = 9
No modify flag set, skipping phase 5
Inode allocation btrees are too corrupted, skipping phases 6 and 7
No modify flag set, skipping filesystem flush and exiting.

 

Link to comment

Yes, that was clear to me, though my understanding is that it's good to run it with -n first and ask any questions about the results.

 

This "too corrupted" message makes me think it won't run properly even without the -n but I haven't had time to research it yet.

 

"Inode allocation btrees are too corrupted, skipping phases 6 and 7".

Link to comment

Done.

As near as I can tell xfs_repair was successful.  I didn't copy the results of the first repair, but I did run it twice more to be sure.  The subsequent runs didn't seem to find any problems.

 

The parity check and rebuild of precleared physical disk 2 completed successfully *before* I ran xfs_repair.

 

Shares are no longer dropping within 60 seconds.  I'll want to see them stay online for a few days to be more certain, but that problem appears to be solved.

 

My TimeMachine backups still aren't working, but I'll troubleshoot that next and post here or in a new thread if necessary.

 

I was able to write to one of the shares that previously wouldn't let me, despite permissions.  No doubt the xfs_repair is responsible for that fix.

 

One open issue that I only noticed a day or 2 ago but could have been there from the moment I installed parity.check.tuning is the following fatal error at the bottom of my screen.  I assume that's unrelated to my share dropping issue and I'm fine moving that to another thread.

 

 Array Started•Fatal error: Cannot redeclare _() (previously declared in /usr/local/emhttp/plugins/parity.check.tuning/Legacy.php:6) in /usr/local/emhttp/plugins/dynamix/include/Translations.php on line 19

 

Anything else I should do to confirm that my Unraid is now healthy?

Link to comment

4 of my shares say "some or all files unprotected" and have since before I posted.

 

Just clicked on Move to start the Mover (I have one cache disk).  Nothing appeared to happen except the parity.check.tuning message went away... I suspect that's because the error I quoted above is only happening on parity checks, so I'll run a parity check again.

 

How do I troubleshoot the Mover, or is it normal that I haven't seen anything yet (2 minutes later).

Link to comment
23 minutes ago, Wolfe said:

4 of my shares say "some or all files unprotected" and have since before I posted.

 

Just clicked on Move to start the Mover (I have one cache disk).  Nothing appeared to happen except the parity.check.tuning message went away... I suspect that's because the error I quoted above is only happening on parity checks, so I'll run a parity check again.

 

How do I troubleshoot the Mover, or is it normal that I haven't seen anything yet (2 minutes later).

Attach diagnostics next post

Link to comment

Looks like a parity.check.tuning I just applied took care of that issue.

 

Mover now appears to be working (waiting to see if it completes without errors).  It's problem was that I had told my Time Machine backups to stop using the cache as part of my troubleshooting but there was still Time Machine data on the cache. It's now back to using the cache for those and working fine on one MacBook but not the other.  I'll dig into that later.

 

It looks like my various issues are close to fully resolved, thanks to you fine folks.  I'll post again if more questions arise or if I conclude that everything is all wrapped up.

 

Any point in running xfs_repair on my other drives as a precautionary check despite no obvious symptoms?  I'm not aware of a downside other than the obvious that my array would be in maintenance mode while I ran those tests.

 

Any other due diligence/checkups I should work on?

Link to comment
  • 2 weeks later...

The share dropping appears to have stopped.

I have run xfs_repair on my three data disks and it looks to me like it's not finding any problems.

 

I still have some issues.

  • Most relevant to this thread, Mover does not appear to be working.  When I click it, nothing appears to happen and data remains on my cache disk.
  • My Time Machine backups are also not working.  It starts "preparing backup" and stops without an error message.  I've spent maybe 5 hours researching that with no luck so far.

 

Is it best to explore these here or start new threads?

Link to comment

The share dropping issue in the OP appears to have been resolved, as noted in my post last night.

 

The two current issues (Mover and Time Machine) may be related to issues with shares.

 

There are 2 Time Machine shares, neither of which is working now but both of which were working at the time of the OP, but the duration they would run was getting shorter as the share dropping issue got worse.  xfs_repair fixed the share dropping but the Time Machine backups still aren't working at all.  The shares involved are the following:

  • TimeMachineJWY-UnRaid
  • TimeMachineSAC-UnRaid

Both were set up to use the cache.  As an experiment, I changed the TimeMachineJWY share to no longer use the cache.  It still can't get past "preparing files", suggesting that the cache and mover aren't involved in the Time Machine problem.

 

The MacBook using the TimeMachineSAC share does write some data to the cache despite also failing at the "preparing files" stage.  Running Mover manually does remove the "some or all files unprotected" flag from that share.

 

The "some or all files unprotected" flag remains on the appdata, system and cache shares.

 

I started running Parity Check Tuning (fully updated) part way through this forum thread but doubt it's related to the current issues.

 

Mover was set to every 3 hours with no logging.  I have now enabled logging, rebooted my Unraid computer, tried Time Machine on both machines, run Mover and generated new logs.  Same behavior as described above.

 

yountnas-diagnostics-20210925-1454.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...