Jump to content

[solved] Unraid 6.9.2 Disks keep detaching


openam
Go to solution Solved by JorgeB,

Recommended Posts

I had disk 6 become detached a couple days ago. I forgot to grab diagnostics before that situation, but I was able to get it back by, disabling docker, removing disk 6 from the array, starting the array, stopping the array, and then re-adding disk 6. Then when I restarted the array it rebuilt disk 6 from the parity.

 

I re-enabled docker the next day, and my applications seemed like they were working fine. Then I got home from work, and noticed I had several errors.
- Alert [PALAZZO] - Disk 5 in error state (disk dsbl)
- Warning [PALAZZO] - Cache pool BTRFS missing device(s)
- Warning [PALAZZO] - array has errors
    Array has 3 disks with read errors

 

After stopping the array this time disk 3, 5, and 6 all showed as missing, so I needed to reboot to get SMART reports on them. The zip included diagnostics.zip, smart reports for 3, 5, and 6, and some screenshots. I rebooted it a couple times, and disk 5 still shows as "Device is disabled, contents emulated".

 

Should I update before trying to rebuild the array this time?

What should I do in the future to stop having the disks get detached?

Server Logs and Screenshots.zip

Edited by openam
change title
Link to comment

I just ran the check on disk 5, here's the entry from syslog

 

Oct 18 21:20:00 palazzo ool www[20601]: /usr/local/emhttp/plugins/dynamix/scripts/xfs_check 'start' '/dev/md5' 'WDC_WD80EMAZ-00WJTA0_1EHSHHLN' '-n'

 

Here's the output from the UI

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
agf_freeblks 257216, counted 257217 in ag 0
agi_freecount 95, counted 98 in ag 0 finobt
sb_ifree 684, counted 778
sb_fdblocks 899221751, counted 901244387
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
imap claims a free inode 161821052 is in use, would correct imap and clear inode
imap claims a free inode 161821053 is in use, would correct imap and clear inode
imap claims a free inode 161821054 is in use, would correct imap and clear inode
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 4
        - agno = 1
        - agno = 7
        - agno = 5
        - agno = 3
        - agno = 2
        - agno = 6
entry "The.redacted.filename" at block 11 offset 3336 in directory inode 4316428006 references free inode 161821052
	would clear inode number in entry at offset 3336...
No modify flag set, skipping phase 5
Inode allocation btrees are too corrupted, skipping phases 6 and 7
No modify flag set, skipping filesystem flush and exiting.

 

Does this mean I should re-run without the `-n` flag, so that it can try to fix?

Link to comment
  • openam changed the title to Unraid 6.9.2 Disks keep detaching

I ended up running the check again with the `-nv` option, because the docs indicated that I should. The docs indicate that it should recommend an action:

 

Quote

If however issues were found, the display of results will indicate the recommended action to take. Typically, that will involve repeating the command with a specific option, clearly stated, which you will type into the options box (including any hyphens, usually 2 leading hyphens).

 

Here is the output of the `-nv` check, and it still doesn't appear to have a recommended command.

 

Phase 1 - find and verify superblock...
        - block cache size set to 699552 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1078458 tail block 1078430
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
agf_freeblks 257216, counted 257217 in ag 0
agi_freecount 95, counted 98 in ag 0 finobt
sb_ifree 684, counted 778
sb_fdblocks 899221751, counted 901244387
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
imap claims a free inode 161821052 is in use, would correct imap and clear inode
imap claims a free inode 161821053 is in use, would correct imap and clear inode
imap claims a free inode 161821054 is in use, would correct imap and clear inode
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 7
        - agno = 6
        - agno = 4
        - agno = 2
        - agno = 5
        - agno = 1
entry "The.redacted.filename" at block 11 offset 3336 in directory inode 4316428006 references free inode 161821052
	would clear inode number in entry at offset 3336...
No modify flag set, skipping phase 5
Inode allocation btrees are too corrupted, skipping phases 6 and 7
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Wed Oct 18 22:28:36 2023

Phase		Start		End		Duration
Phase 1:	10/18 22:25:38	10/18 22:25:38
Phase 2:	10/18 22:25:38	10/18 22:25:38
Phase 3:	10/18 22:25:38	10/18 22:28:36	2 minutes, 58 seconds
Phase 4:	10/18 22:28:36	10/18 22:28:36
Phase 5:	Skipped
Phase 6:	Skipped
Phase 7:	Skipped

Total run time: 2 minutes, 58 seconds

 

Link to comment
Oct 17 07:01:12 palazzo kernel: mpt2sas_cm0: SAS host is non-operational !!!!

 

Issues with the HBA, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot.

 

Checking the filesystem won't re-enable a disk, you need to rebuild it, if rebuilding on top make sure the emulated disk is mounting and contents look correct.

  • Thanks 1
Link to comment

Thanks for your reply. I re-seated the HBA card, and all the sata connectors when I rebooted it initially, so that the disks would come back up, and I was able to get the SMART reports. My mobo only has one PCIe slot, so I just re-seated it. Maybe I should double check disk 5 connections, it might be the one disk that I didn't re-seat. It requires me to pull the case apart.

 

Is there anything I should do with that disk 5 with the `Inode allocation btrees are too corrupted` error, or does rebuilding it on top correct all those issues?

 

Sounds like next steps are:?

  1. Restart the array and let it emulate disk 5
  2. Check the contents of disk 5 to make sure they look alright
  3. Stop the array
  4. Remove disk 5 from the array
  5. Restart the array without disk 5
  6. Stop the array
  7. Re-add disk 5 to the array
  8. Restart it to have it start rebuilding

Steps 3-6 is how I originally had it rebuild disk 6, Is that the correct procedure?

Was there anything else I should with xfs_repair before, after, or in the middle of those steps? 

Link to comment
34 minutes ago, openam said:

Is there anything I should do with that disk 5 with the `Inode allocation btrees are too corrupted` error, or does rebuilding it on top correct all those issues?

Yes, run xfs_repair without -n or nothing will be done.

 

Steps look good, but run xfs_repair first and check contents after, also look for a lost+found folder.

 

 

  • Thanks 1
Link to comment

Just tried running xfs_repair with out the `-n` flag and got the following error

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

 

How do I go about mounting the drive? Should I start the array not in maintenance mode?

 

Looks like it's sdj, should I just ssh in, and run `mount /dev/sdj /mnt`, actually it looks like they are normally mounted to /mnt/disks, what would the command be, and then what's the command to unmount, and when do I do that? Does it just need to be mounted for a short time?

image.png.7ce8088c28123a0621c1400b07a3c77d.png

Edited by openam
added question, added screenshot, questions about mount command.
Link to comment

It looks like it finished with the xfs_repair.

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
agf_freeblks 257216, counted 257217 in ag 0
agi_freecount 95, counted 98 in ag 0 finobt
sb_ifree 684, counted 778
sb_fdblocks 899221751, counted 901244387
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
imap claims a free inode 161821052 is in use, correcting imap and clearing inode
cleared inode 161821052
imap claims a free inode 161821053 is in use, correcting imap and clearing inode
cleared inode 161821053
imap claims a free inode 161821054 is in use, correcting imap and clearing inode
cleared inode 161821054
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 4
        - agno = 6
        - agno = 2
        - agno = 5
        - agno = 1
        - agno = 7
        - agno = 3
entry "The.redacted.filename" at block 11 offset 3336 in directory inode 4316428006 references free inode 161821052
	clearing inode number in entry at offset 3336...
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
rebuilding directory inode 4316428006
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
resetting inode 4316428006 nlinks from 644 to 643
Maximum metadata LSN (1:1078445) is ahead of log (1:2).
Format log to cycle 4.
done

 

I started the array, and it looks like things are there. I don't see a `lost+found` folder looking through the unRAID ui, nor do I see it while ssh'd in, and running `ls -al /mnt/disk5`. Is there some some other place the lost+found folder is supposed to be located? The logs indicated "moving disconnected inodes to lost+found" but I can't see anything like that. Shall I just continue with steps 3-8 in my earlier post? 

 

Quote
  1. Restart the array and let it emulate disk 5
  2. Check the contents of disk 5 to make sure they look alright
  3. Stop the array
  4. Remove disk 5 from the array
  5. Restart the array without disk 5
  6. Stop the array
  7. Re-add disk 5 to the array
  8. Restart it to have it start rebuilding

 

Link to comment
7 minutes ago, openam said:

It looks like it finished with the xfs_repair.

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
agf_freeblks 257216, counted 257217 in ag 0
agi_freecount 95, counted 98 in ag 0 finobt
sb_ifree 684, counted 778
sb_fdblocks 899221751, counted 901244387
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
imap claims a free inode 161821052 is in use, correcting imap and clearing inode
cleared inode 161821052
imap claims a free inode 161821053 is in use, correcting imap and clearing inode
cleared inode 161821053
imap claims a free inode 161821054 is in use, correcting imap and clearing inode
cleared inode 161821054
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 4
        - agno = 6
        - agno = 2
        - agno = 5
        - agno = 1
        - agno = 7
        - agno = 3
entry "The.redacted.filename" at block 11 offset 3336 in directory inode 4316428006 references free inode 161821052
	clearing inode number in entry at offset 3336...
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
rebuilding directory inode 4316428006
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
resetting inode 4316428006 nlinks from 644 to 643
Maximum metadata LSN (1:1078445) is ahead of log (1:2).
Format log to cycle 4.
done

 

I started the array, and it looks like things are there. I don't see a `lost+found` folder looking through the unRAID ui, nor do I see it while ssh'd in, and running `ls -al /mnt/disk5`. Is there some some other place the lost+found folder is supposed to be located? The logs indicated "moving disconnected inodes to lost+found" but I can't see anything like that. Shall I just continue with steps 3-8 in my earlier post? 

 

 

If you have no lost+found folder on the drive being repaired then that is a good sign and suggests there were no files or folders for which the repair process could not find the corresponding directory entries.

  • Thanks 1
Link to comment

It's having a hard time stopping the array. I'm seeing the following in the syslog

Oct 19 12:57:01 palazzo emhttpd: shcmd (8162): umount /mnt/disk5
Oct 19 12:57:01 palazzo root: umount: /mnt/disk5: target is busy.
Oct 19 12:57:01 palazzo emhttpd: shcmd (8162): exit status: 32
Oct 19 12:57:01 palazzo emhttpd: Retry unmounting disk share(s)...
Oct 19 12:57:06 palazzo emhttpd: Unmounting disks...
Oct 19 12:57:06 palazzo emhttpd: shcmd (8163): umount /mnt/disk5
Oct 19 12:57:06 palazzo root: umount: /mnt/disk5: target is busy.
Oct 19 12:57:06 palazzo emhttpd: shcmd (8163): exit status: 32
Oct 19 12:57:06 palazzo emhttpd: Retry unmounting disk share(s)...

 

Looking to see if I could figure out what was causing it shows this

root@palazzo:/mnt# fuser -vc /mnt/disk5/
                     USER        PID ACCESS COMMAND
/mnt/disk5:          root     kernel mount /mnt/disk5
                     root      31637 ..c.. bash
root@palazzo:/mnt#

 

Should I just kill the 31637 PID? I think I was ssh'd in there looking at it when I started the shutdown, but cd'ing out and disconnecting ssh session didn't resolve it.

Link to comment

It appears that my disk 3 is having seek issues, and the rebuild is going to take forever. I'm guessing I'll just need to wait this out, and likely replace disk 3 when it's done. Is there any way to figure out what is slowing this all down? Would pause / resume help at all?

 

674241479_Screenshot2023-10-20at10_23_25AM.thumb.png.86f3f7c6abc225693c13874d8c07a83b.png

 

931002325_Screenshot2023-10-20at10_21_10AM.thumb.png.dd1c2b4629eaf3fb12af99688af95396.png

Link to comment

Well my UPS died, and the system is powered off. It wasn't even a power failure from the grid. The UPS just is non-responsive at this point.

(╯°□°)╯︵ ┻━┻

 

Guess I'll be going to the store tomorrow morning to get a new one, and starting over...

Edited by openam
Link to comment

Shut down the server, pulled that disk, and re-seated it again. It came back up, and I was able to get the smart info for it. The UI is showing the disk as disabled. 

 

image.thumb.png.bdc0943881ba54ba6f42ddd248afba11.png

 

Is there a way to make that disk be re-enabled, so I can continue trying to sync disk 5, or am I just going to have to start over?

That disk 6 is in a iStarUSA BPN-DE350SS-BLUE enclosure. I've ordered a new case that'll just fit the devices without this extra housing unit.

When you say power/connection problem, could it have to do with an under powered PSU? I do just have a CORSAIR CX Series CX500 PSU in there.

palazzo-diagnostics-20231024-0743.zip palazzo-smart-20231024-0742.zip

Edited by openam
Added question/info about PSU
Link to comment
  • Solution
15 minutes ago, openam said:

When you say power/connection problem, could it have to do with an under powered PSU? I do just have a CORSAIR CX Series CX500 PSU in there.

Could be a failing PSU, if not it should have enough power for those disks, could also just be a bad connection, with the power or SATA cable, if you are using power splitters can also be that.

 

16 minutes ago, openam said:

Is there a way to make that disk be re-enabled, so I can continue trying to sync disk 5, or am I just going to have to start over?

You can force enable disk6 to see if disk5 can still be emulated:

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Check all assignments and assign any missing disk(s) if needed
-IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked)
-Stop array
-Unassign disk5
-Start array (in normal mode now) and post new diags

Link to comment
  • openam changed the title to [solved] Unraid 6.9.2 Disks keep detaching

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...