openam Posted October 19, 2023 Share Posted October 19, 2023 (edited) I had disk 6 become detached a couple days ago. I forgot to grab diagnostics before that situation, but I was able to get it back by, disabling docker, removing disk 6 from the array, starting the array, stopping the array, and then re-adding disk 6. Then when I restarted the array it rebuilt disk 6 from the parity. I re-enabled docker the next day, and my applications seemed like they were working fine. Then I got home from work, and noticed I had several errors. - Alert [PALAZZO] - Disk 5 in error state (disk dsbl) - Warning [PALAZZO] - Cache pool BTRFS missing device(s) - Warning [PALAZZO] - array has errors Array has 3 disks with read errors After stopping the array this time disk 3, 5, and 6 all showed as missing, so I needed to reboot to get SMART reports on them. The zip included diagnostics.zip, smart reports for 3, 5, and 6, and some screenshots. I rebooted it a couple times, and disk 5 still shows as "Device is disabled, contents emulated". Should I update before trying to rebuild the array this time? What should I do in the future to stop having the disks get detached? Server Logs and Screenshots.zip Edited October 26, 2023 by openam change title Quote Link to comment
openam Posted October 19, 2023 Author Share Posted October 19, 2023 I just ran the check on disk 5, here's the entry from syslog Oct 18 21:20:00 palazzo ool www[20601]: /usr/local/emhttp/plugins/dynamix/scripts/xfs_check 'start' '/dev/md5' 'WDC_WD80EMAZ-00WJTA0_1EHSHHLN' '-n' Here's the output from the UI Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... agf_freeblks 257216, counted 257217 in ag 0 agi_freecount 95, counted 98 in ag 0 finobt sb_ifree 684, counted 778 sb_fdblocks 899221751, counted 901244387 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 imap claims a free inode 161821052 is in use, would correct imap and clear inode imap claims a free inode 161821053 is in use, would correct imap and clear inode imap claims a free inode 161821054 is in use, would correct imap and clear inode - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 4 - agno = 1 - agno = 7 - agno = 5 - agno = 3 - agno = 2 - agno = 6 entry "The.redacted.filename" at block 11 offset 3336 in directory inode 4316428006 references free inode 161821052 would clear inode number in entry at offset 3336... No modify flag set, skipping phase 5 Inode allocation btrees are too corrupted, skipping phases 6 and 7 No modify flag set, skipping filesystem flush and exiting. Does this mean I should re-run without the `-n` flag, so that it can try to fix? Quote Link to comment
openam Posted October 19, 2023 Author Share Posted October 19, 2023 I ended up running the check again with the `-nv` option, because the docs indicated that I should. The docs indicate that it should recommend an action: Quote If however issues were found, the display of results will indicate the recommended action to take. Typically, that will involve repeating the command with a specific option, clearly stated, which you will type into the options box (including any hyphens, usually 2 leading hyphens). Here is the output of the `-nv` check, and it still doesn't appear to have a recommended command. Phase 1 - find and verify superblock... - block cache size set to 699552 entries Phase 2 - using internal log - zero log... zero_log: head block 1078458 tail block 1078430 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... agf_freeblks 257216, counted 257217 in ag 0 agi_freecount 95, counted 98 in ag 0 finobt sb_ifree 684, counted 778 sb_fdblocks 899221751, counted 901244387 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 imap claims a free inode 161821052 is in use, would correct imap and clear inode imap claims a free inode 161821053 is in use, would correct imap and clear inode imap claims a free inode 161821054 is in use, would correct imap and clear inode - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 3 - agno = 7 - agno = 6 - agno = 4 - agno = 2 - agno = 5 - agno = 1 entry "The.redacted.filename" at block 11 offset 3336 in directory inode 4316428006 references free inode 161821052 would clear inode number in entry at offset 3336... No modify flag set, skipping phase 5 Inode allocation btrees are too corrupted, skipping phases 6 and 7 No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Wed Oct 18 22:28:36 2023 Phase Start End Duration Phase 1: 10/18 22:25:38 10/18 22:25:38 Phase 2: 10/18 22:25:38 10/18 22:25:38 Phase 3: 10/18 22:25:38 10/18 22:28:36 2 minutes, 58 seconds Phase 4: 10/18 22:28:36 10/18 22:28:36 Phase 5: Skipped Phase 6: Skipped Phase 7: Skipped Total run time: 2 minutes, 58 seconds Quote Link to comment
JorgeB Posted October 19, 2023 Share Posted October 19, 2023 Oct 17 07:01:12 palazzo kernel: mpt2sas_cm0: SAS host is non-operational !!!! Issues with the HBA, make sure it's well seated and sufficiently cooled, you can also try a different PCIe slot. Checking the filesystem won't re-enable a disk, you need to rebuild it, if rebuilding on top make sure the emulated disk is mounting and contents look correct. 1 Quote Link to comment
openam Posted October 19, 2023 Author Share Posted October 19, 2023 Thanks for your reply. I re-seated the HBA card, and all the sata connectors when I rebooted it initially, so that the disks would come back up, and I was able to get the SMART reports. My mobo only has one PCIe slot, so I just re-seated it. Maybe I should double check disk 5 connections, it might be the one disk that I didn't re-seat. It requires me to pull the case apart. Is there anything I should do with that disk 5 with the `Inode allocation btrees are too corrupted` error, or does rebuilding it on top correct all those issues? Sounds like next steps are:? Restart the array and let it emulate disk 5 Check the contents of disk 5 to make sure they look alright Stop the array Remove disk 5 from the array Restart the array without disk 5 Stop the array Re-add disk 5 to the array Restart it to have it start rebuilding Steps 3-6 is how I originally had it rebuild disk 6, Is that the correct procedure? Was there anything else I should with xfs_repair before, after, or in the middle of those steps? Quote Link to comment
JorgeB Posted October 19, 2023 Share Posted October 19, 2023 34 minutes ago, openam said: Is there anything I should do with that disk 5 with the `Inode allocation btrees are too corrupted` error, or does rebuilding it on top correct all those issues? Yes, run xfs_repair without -n or nothing will be done. Steps look good, but run xfs_repair first and check contents after, also look for a lost+found folder. 1 Quote Link to comment
openam Posted October 19, 2023 Author Share Posted October 19, 2023 (edited) Just tried running xfs_repair with out the `-n` flag and got the following error Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. How do I go about mounting the drive? Should I start the array not in maintenance mode? Looks like it's sdj, should I just ssh in, and run `mount /dev/sdj /mnt`, actually it looks like they are normally mounted to /mnt/disks, what would the command be, and then what's the command to unmount, and when do I do that? Does it just need to be mounted for a short time? Edited October 19, 2023 by openam added question, added screenshot, questions about mount command. Quote Link to comment
itimpi Posted October 19, 2023 Share Posted October 19, 2023 This is a very common message. You need to rerun the repair adding the -L option. 1 Quote Link to comment
openam Posted October 19, 2023 Author Share Posted October 19, 2023 It looks like it finished with the xfs_repair. Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... agf_freeblks 257216, counted 257217 in ag 0 agi_freecount 95, counted 98 in ag 0 finobt sb_ifree 684, counted 778 sb_fdblocks 899221751, counted 901244387 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 imap claims a free inode 161821052 is in use, correcting imap and clearing inode cleared inode 161821052 imap claims a free inode 161821053 is in use, correcting imap and clearing inode cleared inode 161821053 imap claims a free inode 161821054 is in use, correcting imap and clearing inode cleared inode 161821054 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 4 - agno = 6 - agno = 2 - agno = 5 - agno = 1 - agno = 7 - agno = 3 entry "The.redacted.filename" at block 11 offset 3336 in directory inode 4316428006 references free inode 161821052 clearing inode number in entry at offset 3336... Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... rebuilding directory inode 4316428006 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... resetting inode 4316428006 nlinks from 644 to 643 Maximum metadata LSN (1:1078445) is ahead of log (1:2). Format log to cycle 4. done I started the array, and it looks like things are there. I don't see a `lost+found` folder looking through the unRAID ui, nor do I see it while ssh'd in, and running `ls -al /mnt/disk5`. Is there some some other place the lost+found folder is supposed to be located? The logs indicated "moving disconnected inodes to lost+found" but I can't see anything like that. Shall I just continue with steps 3-8 in my earlier post? Quote Restart the array and let it emulate disk 5 Check the contents of disk 5 to make sure they look alright Stop the array Remove disk 5 from the array Restart the array without disk 5 Stop the array Re-add disk 5 to the array Restart it to have it start rebuilding Quote Link to comment
itimpi Posted October 19, 2023 Share Posted October 19, 2023 7 minutes ago, openam said: It looks like it finished with the xfs_repair. Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... agf_freeblks 257216, counted 257217 in ag 0 agi_freecount 95, counted 98 in ag 0 finobt sb_ifree 684, counted 778 sb_fdblocks 899221751, counted 901244387 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 imap claims a free inode 161821052 is in use, correcting imap and clearing inode cleared inode 161821052 imap claims a free inode 161821053 is in use, correcting imap and clearing inode cleared inode 161821053 imap claims a free inode 161821054 is in use, correcting imap and clearing inode cleared inode 161821054 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 4 - agno = 6 - agno = 2 - agno = 5 - agno = 1 - agno = 7 - agno = 3 entry "The.redacted.filename" at block 11 offset 3336 in directory inode 4316428006 references free inode 161821052 clearing inode number in entry at offset 3336... Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... rebuilding directory inode 4316428006 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... resetting inode 4316428006 nlinks from 644 to 643 Maximum metadata LSN (1:1078445) is ahead of log (1:2). Format log to cycle 4. done I started the array, and it looks like things are there. I don't see a `lost+found` folder looking through the unRAID ui, nor do I see it while ssh'd in, and running `ls -al /mnt/disk5`. Is there some some other place the lost+found folder is supposed to be located? The logs indicated "moving disconnected inodes to lost+found" but I can't see anything like that. Shall I just continue with steps 3-8 in my earlier post? If you have no lost+found folder on the drive being repaired then that is a good sign and suggests there were no files or folders for which the repair process could not find the corresponding directory entries. 1 Quote Link to comment
JorgeB Posted October 19, 2023 Share Posted October 19, 2023 8 minutes ago, openam said: I don't see a `lost+found` folder That's good news, the "moving disconnected inodes to lost+found" is always there. 8 minutes ago, openam said: Shall I just continue with steps 3-8 in my earlier post? Yep. 1 Quote Link to comment
openam Posted October 19, 2023 Author Share Posted October 19, 2023 It's having a hard time stopping the array. I'm seeing the following in the syslog Oct 19 12:57:01 palazzo emhttpd: shcmd (8162): umount /mnt/disk5 Oct 19 12:57:01 palazzo root: umount: /mnt/disk5: target is busy. Oct 19 12:57:01 palazzo emhttpd: shcmd (8162): exit status: 32 Oct 19 12:57:01 palazzo emhttpd: Retry unmounting disk share(s)... Oct 19 12:57:06 palazzo emhttpd: Unmounting disks... Oct 19 12:57:06 palazzo emhttpd: shcmd (8163): umount /mnt/disk5 Oct 19 12:57:06 palazzo root: umount: /mnt/disk5: target is busy. Oct 19 12:57:06 palazzo emhttpd: shcmd (8163): exit status: 32 Oct 19 12:57:06 palazzo emhttpd: Retry unmounting disk share(s)... Looking to see if I could figure out what was causing it shows this root@palazzo:/mnt# fuser -vc /mnt/disk5/ USER PID ACCESS COMMAND /mnt/disk5: root kernel mount /mnt/disk5 root 31637 ..c.. bash root@palazzo:/mnt# Should I just kill the 31637 PID? I think I was ssh'd in there looking at it when I started the shutdown, but cd'ing out and disconnecting ssh session didn't resolve it. Quote Link to comment
openam Posted October 19, 2023 Author Share Posted October 19, 2023 I just ran `kill 31637` and it let the array stop. Going to continue on... Quote Link to comment
openam Posted October 20, 2023 Author Share Posted October 20, 2023 It appears that my disk 3 is having seek issues, and the rebuild is going to take forever. I'm guessing I'll just need to wait this out, and likely replace disk 3 when it's done. Is there any way to figure out what is slowing this all down? Would pause / resume help at all? Quote Link to comment
openam Posted October 20, 2023 Author Share Posted October 20, 2023 palazzo-diagnostics-20231020-1504.zip Quote Link to comment
JorgeB Posted October 21, 2023 Share Posted October 21, 2023 No disk errors logged in the syslog so far, so for now I would let it run. 1 Quote Link to comment
openam Posted October 22, 2023 Author Share Posted October 22, 2023 (edited) Well my UPS died, and the system is powered off. It wasn't even a power failure from the grid. The UPS just is non-responsive at this point. (╯°□°)╯︵ ┻━┻ Guess I'll be going to the store tomorrow morning to get a new one, and starting over... Edited October 22, 2023 by openam Quote Link to comment
openam Posted October 23, 2023 Author Share Posted October 23, 2023 Well I got a new UPS yesterday, and got it charged up, and started running the sync again, and I just had another disk (disk 6) throw an error. Not sure what I do at this point. palazzo-diagnostics-20231023-1509.zip Quote Link to comment
JorgeB Posted October 24, 2023 Share Posted October 24, 2023 Looks more like a power/connection problem, but the disk dropped so there's no SMART Quote Link to comment
openam Posted October 24, 2023 Author Share Posted October 24, 2023 (edited) Shut down the server, pulled that disk, and re-seated it again. It came back up, and I was able to get the smart info for it. The UI is showing the disk as disabled. Is there a way to make that disk be re-enabled, so I can continue trying to sync disk 5, or am I just going to have to start over? That disk 6 is in a iStarUSA BPN-DE350SS-BLUE enclosure. I've ordered a new case that'll just fit the devices without this extra housing unit. When you say power/connection problem, could it have to do with an under powered PSU? I do just have a CORSAIR CX Series CX500 PSU in there. palazzo-diagnostics-20231024-0743.zip palazzo-smart-20231024-0742.zip Edited October 24, 2023 by openam Added question/info about PSU Quote Link to comment
Solution JorgeB Posted October 24, 2023 Solution Share Posted October 24, 2023 15 minutes ago, openam said: When you say power/connection problem, could it have to do with an under powered PSU? I do just have a CORSAIR CX Series CX500 PSU in there. Could be a failing PSU, if not it should have enough power for those disks, could also just be a bad connection, with the power or SATA cable, if you are using power splitters can also be that. 16 minutes ago, openam said: Is there a way to make that disk be re-enabled, so I can continue trying to sync disk 5, or am I just going to have to start over? You can force enable disk6 to see if disk5 can still be emulated: -Tools -> New Config -> Retain current configuration: All -> Apply -Check all assignments and assign any missing disk(s) if needed -IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked) -Stop array -Unassign disk5 -Start array (in normal mode now) and post new diags Quote Link to comment
openam Posted October 24, 2023 Author Share Posted October 24, 2023 Man there are some scary prompts going through that. palazzo-diagnostics-20231024-1257.zip Quote Link to comment
JorgeB Posted October 25, 2023 Share Posted October 25, 2023 Emulated disk5 is mounting, assuming contents look correct you can try rebuilding again. 1 Quote Link to comment
openam Posted October 26, 2023 Author Share Posted October 26, 2023 Awesome thank you! I'm back up and running. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.