[6.11.5] Unable To Stop Array To Fix Disabled Disks


elogg
Go to solution Solved by trurl,

Recommended Posts

Have 2 disabled disks, but am unable to stop the array to address. Clicking "Stop" + "Proceed" outputs the following message but seems to have no other effect. 

 

nginx: 2023/01/11 22:43:30 [error] 8673#8673: *4428546 connect() to unix:/var/run/emhttpd.socket failed (11: Resource temporarily unavailable) while connecting to upstream, client: 10.20.0.86, server: , request: "POST /update.htm HTTP/2.0", upstream: "http://unix:/var/run/emhttpd.socket:/update.htm", host: "10.20.0.10:4443", referrer: "https://10.20.0.10:4443/Main"

 

So leading up to this...I have 2 cache ssd's in btrfs single mode. Attempted to upgrade the smaller of the drive with a larger version, but it ended up being defective. Should have tested it first, but is going back for RMA now. Put the old ssd back in, and the cache pool was rebalanced when the array started back up. Sometime during that process 2 disks from the array were disabled for read issues, which I'm assuming are cabling issues as they're basically new. I've stopped all dockers and shut down the VM's. Unsure what could be preventing the array from stopping, as I'm not seeing any disk or share unmount failures or anything.

diagnostics-20230111-2251.zip

Link to comment

Disks1,4 disconnected and reconnected as different devices so they were disabled. Disks look OK. Emulated disks mounted.

 

I don't see anything trying to unmount in syslog. Lots of log entries from this plugin

ca.mover.tuning.plg - 2022.04.13  (Up to date)

but after that finishes not much in syslog. Nothing about trying to stop or unmount or anything.

 

Not sure what this is about near end of syslog

Jan 11 22:42:32 Superbob  sshd[18993]: Accepted password for elliottlawson from 10.20.0.86 port 50735 ssh2

Normally only root user has access to the command line. Have you done something "custom" with that?

  • Like 1
Link to comment
Quote

Normally only root user has access to the command line. Have you done something "custom" with that?

I believe this from some ssh plugin I set up a few years ago…

 

Any thoughts on why disks (1 & 4) re-registered differently, they were not touched at all during the cache swap? I do see now that some of the drive letters changed around, but was under the impression that didn’t really matter as the disk identifier is used for matching.

 

I would really like to get a rebuild started, since the array is unprotected with 2 disks disabled. If there is no way of getting insight into what is causing the UI not to stop the array, is there an alternative safe method to stopping it?

Link to comment

Docker manager, VM manager, & mover logger have been disabled. 👍  Appreciate the advice on just going ahead with rebooting. UI initiated reboot was successful and the array stop/start functionality is now working!

 

Unfortunately, the rabbit hole is getting deeper...

Stopped the array, shutdown the server, and swapped the disks out (just incase the rebuild has issues). Set both disk slots to "no disk" and started the array, then stopped it, and formatted the two new disks via Unassigned Devices. Set the disks to their appropriate slots and started the array. 

 

This is where things went sideways... disk 4 started rebuilding, but disk 1 showed unmountable...? The odd thing is filesystem check in UD showed no issues after they were formatted.

image.thumb.png.66873a513993f4ee1e25c9645b7a0e72.png

 

Canceled the rebuild, stopped the array, and restarted in maintenance mode to run a filesystem status check. Which is reporting metadata issues.

image.thumb.png.335c94b9fd9f9d2e2347d30ec3a577d7.png

 

 

Not really sure what the best next step should be? Hoping to avoid any data loss at this point. Is it possible Disk 1 is actually faulty or could there be something else at play?

diagnostics-20230115-1551.zip

Link to comment
10 minutes ago, elogg said:

formatted the two new disks via Unassigned Devices

Totally pointless, but harmless if it was done before assigning them to the array. Rebuild completely overwrites the disk, doesn't matter at all if the disk was newly formatted to an empty filesystem, cleared, completely full of pron, whatever.

 

It always makes me nervous when anybody talks about formatting in a thread where they are replacing disks. I think many don't actually know what format does. Format means "write an empty filesystem (of whatever type) to this disk". That is what it has always meant in every OS you have ever used.

 

Fortunately, you didn't format a disk in the array. Many have made that mistake, thinking they can rebuild. Unraid treats that write operation just as it does any other, by updating parity. So after formatting a disk in the array, the only thing it can rebuild is an empty filesystem.

 

15 minutes ago, elogg said:

disk 1 showed unmountable...? The odd thing is filesystem check in UD showed no issues after they were formatted.

disk1 is an array disk. If the disk was disabled/emulated/rebuilding and unmountable, that has absolutely nothing to do with the physical disk you pointlessly formatted in UD. Good thing too, since your data is on that emulated/unmountable disk.

 

Did you do check filesystem from the webUI? Post the complete output from that.

  • Upvote 1
Link to comment

Yes, redundant format was prior to adding to the array. I learned that lesson the hard way a few years back, only ever format in the array when expanding now. Makes sense that the rebuild handles overwriting too. 

 

Ah, so original filesystem is recorded (assuming this is how emulation works?) and if something is off, it will carry over to the new disk?

 

Assuming the next step is performing a repair?

 

Last check was run from the webUI in maintenance mode (xfs). Here’s the full output from a run with the -nv flags

Phase 1 - find and verify superblock...
        - block cache size set to 1432488 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1859593 tail block 1859593
        - scan filesystem freespace and inode maps...
Metadata CRC error detected at 0x43d440, xfs_agf block 0x36949ffe1/0x200
agf has bad CRC for ag 15
Metadata CRC error detected at 0x44108d, xfs_bnobt block 0x36949ffe8/0x1000
btree block 15/1 is suspect, error -74
Metadata CRC error detected at 0x44108d, xfs_cntbt block 0x36949fff0/0x1000
btree block 15/2 is suspect, error -74
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
data fork in ino 16106127516 claims free block 2027533582
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
free space (15,14267749-14267758) only seen by one free space btree
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 9
        - agno = 4
        - agno = 15
        - agno = 7
        - agno = 1
        - agno = 8
        - agno = 10
        - agno = 6
        - agno = 12
        - agno = 11
        - agno = 5
        - agno = 13
        - agno = 14
        - agno = 3
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Sun Jan 15 19:29:52 2023

Phase		Start		End		Duration
Phase 1:	01/15 19:29:21	01/15 19:29:21
Phase 2:	01/15 19:29:21	01/15 19:29:22	1 second
Phase 3:	01/15 19:29:22	01/15 19:29:48	26 seconds
Phase 4:	01/15 19:29:48	01/15 19:29:48
Phase 5:	Skipped
Phase 6:	01/15 19:29:48	01/15 19:29:52	4 seconds
Phase 7:	01/15 19:29:52	01/15 19:29:52

Total run time: 31 seconds

 

 

Link to comment
1 hour ago, elogg said:

original filesystem is recorded (assuming this is how emulation works?)

The data for the missing disk can be accessed from the parity calculation by reading all other disks.

 

https://wiki.unraid.net/Manual/Overview#Parity-Protected_Array

 

If the emulated disk is unmountable, the rebuild will be unmountable. We try to repair unmountable before rebuild, especially if rebuilding to the same disk. Not as important if rebuilding to a new disk, since the original disk won't be overwritten.

Link to comment

Appreciate the explanation. 👍 

 

I am rebuilding to different disks (of the same sizes). Got in the habit of holding the originals aside, just to be safe. 

 

So running a xfs repair without any flags, appears to have cleared up the filesystem. It didn’t request a run with -L

 

How do you recommend I proceed? Despite having dual parity, thinking now I might prefer to rebuild one disk at a time if it’s not too late…

Since I canceled the first rebuild attempt, do I need to do the unassign, start array, stop, reassign, and start rebuild dance? Not sure if just exiting maintenance mode and starting the array will trigger a rebuild?

 

Output of run with -nv flags now shows

Phase 1 - find and verify superblock...
        - block cache size set to 1432488 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1859593 tail block 1859593
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 6
        - agno = 4
        - agno = 10
        - agno = 15
        - agno = 9
        - agno = 5
        - agno = 11
        - agno = 13
        - agno = 12
        - agno = 14
        - agno = 7
        - agno = 8
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - agno = 13
        - agno = 14
        - agno = 15
        - agno = 16
        - agno = 17
        - agno = 18
        - agno = 19
        - agno = 20
        - agno = 21
        - agno = 22
        - agno = 23
        - agno = 24
        - agno = 25
        - agno = 26
        - agno = 27
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Sun Jan 15 22:06:40 2023

Phase		Start		End		Duration
Phase 1:	01/15 22:06:10	01/15 22:06:11	1 second
Phase 2:	01/15 22:06:11	01/15 22:06:11
Phase 3:	01/15 22:06:11	01/15 22:06:37	26 seconds
Phase 4:	01/15 22:06:37	01/15 22:06:37
Phase 5:	Skipped
Phase 6:	01/15 22:06:37	01/15 22:06:40	3 seconds
Phase 7:	01/15 22:06:40	01/15 22:06:40

Total run time: 30 seconds

 

 

Link to comment

Have you checked your lost+found share? Surprised to see that it is on many different disks. That means you repaired many different disks at some point (or perhaps you redistributed that share to many different disks?).

 

lost+found is where filesystem repair puts things it couldn't figure out.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.