MatrixMJK Posted December 20, 2022 Share Posted December 20, 2022 Hello, I have replaced two drives in my system to expand the array size. Last week I replaced one 10TB drive with an 8TB drive and completed the drive rebuild successfully. Then a day or so ago I replaced another drive with another new 14TB and started the rebuild. (Just a note that I followed the instructions for replacing with a larger drive at https://wiki.unraid.net/Manual/Storage_Management#Replacing_a_disk_to_increase_capacity). Some time in the past 12-15 hours, one parity drive had a bunch of errors and the drive I had originally replaced dropped out as well. The rebuild thinks it is still running and is reading from the remaining drives, but no writes are happening. I have enclosed a screenshot of what the array looks like and a diagnostic report from the same time. Let me know what action to take to save what I can. I don't think the parity or the other recently rebuilt drive are bad at this point, but of course that could be the case. Thanks for the assistance. tower-diagnostics-20221220-1513.zip Quote Link to comment
trurl Posted December 21, 2022 Share Posted December 21, 2022 Parity2, disk2 and disk10 all disconnected. Disk2 became disabled. Emulated disk2 and disk10 mount and have plenty of contents. SMART for all 3 disks looks fine, I didn't check the others. Do any of your other disks show SMART warnings on the Dashboard? I notice you have a RAID controller. RAID controllers are NOT recommended with Unraid for many reasons. But it looks like all the affected disks are on the LSI controller. Probably you disturbed connections when you replaced disk10. Since you have dual parity you can rebuild both disk2 and disk10. Shutdown, check connections, SATA and power, both ends, including splitters. You will have to start rebuilds over. Quote Link to comment
MatrixMJK Posted December 21, 2022 Author Share Posted December 21, 2022 Great, thanks for the info so quick! I thought that was the case but wanted to make sure from calm heads. The RAID card was for moving some data off of other drives and was never used for pool/array drives, I do know better than that. All pool drives are on the LSI controller. I have re-seated all power and SATA cables and powered up the server. Drive 10 shows as a yellow triangle and should be rebuilt when I start the array. Drive 2 shows as disabled, and I guess I will need to remove it from the pool and re-add it. But for now I will let drive 10 rebuild first, then take care of drive 2. Let me know if that is a sane course of action. Thanks again for the help. Matt Quote Link to comment
trurl Posted December 21, 2022 Share Posted December 21, 2022 2 minutes ago, MatrixMJK said: Let me know if that is a sane course of action. Yes that will be fine, maybe even safer since it will leave the other disk alone for now and we can see if things are working well. Quote Link to comment
trurl Posted December 21, 2022 Share Posted December 21, 2022 I will check back in the morning. Quote Link to comment
MatrixMJK Posted December 21, 2022 Author Share Posted December 21, 2022 Just a quick update: After starting the array, both drive 10 and 2 show as "Unmountable: wrong or no file system", but it looks to be rebuilding (reading from all other drives except 2 and 10) and writing to drive 10 as I would expect. I'll let it proceed and wait till morning to see if I should stop it and format the drive before starting the rebuild again. Quote Link to comment
JorgeB Posted December 21, 2022 Share Posted December 21, 2022 Format is never part of a rebuild, post new diags Quote Link to comment
trurl Posted December 21, 2022 Share Posted December 21, 2022 7 hours ago, MatrixMJK said: format the drive before starting the rebuild Format is a write operation. Unraid treats this write operation just as it does any other, by updating parity so it stays in sync. After formatting a disk in the array, parity agrees the disk has been formatted. Then the only thing it can rebuild is a formatted disk. 5 hours ago, JorgeB said: Format is never part of a rebuild, post new diags Quote Link to comment
trurl Posted December 21, 2022 Share Posted December 21, 2022 7 hours ago, MatrixMJK said: I'll let it proceed and wait till morning If the filesystems are unmountable something must be going wrong since they were OK in earlier diagnostics. Probably you have connection problems still, and it is making it impossible to accurately emulate the disks. Quote Link to comment
MatrixMJK Posted December 21, 2022 Author Share Posted December 21, 2022 OK, I have attached current diags (while it thinks the rebuild is running) and a screenshot. I can shutdown them machine and replace the SAS and power cables again if thats what you think I should do. tower-diagnostics-20221221-1216.zip Quote Link to comment
trurl Posted December 21, 2022 Share Posted December 21, 2022 I see you have ZFS plugin installed. How are you using that? Quote Link to comment
trurl Posted December 21, 2022 Share Posted December 21, 2022 Filesystem corruption on disks 2 and 10 as soon as they tried to mount on startup. Not clear there is any connection issues. I don't seen anything in syslog to indicate a rebuild going on, maybe I missed it because syslog is flooded with docker network entries for some reason. Post a screenshot of Main - Array Operation. Quote Link to comment
MatrixMJK Posted December 21, 2022 Author Share Posted December 21, 2022 Not currently using the ZFS plugin. I'll remove it if it declutters the output as well as stopping the docker service. Here is the Array Operation section: Quote Link to comment
MatrixMJK Posted December 21, 2022 Author Share Posted December 21, 2022 Here is a diag with docker disabled for a while that may be easier to look at. tower-diagnostics-20221221-1539 stopped docker - ZFS removed but not rebooted yet.zip Quote Link to comment
trurl Posted December 22, 2022 Share Posted December 22, 2022 9 hours ago, trurl said: anything in syslog to indicate a rebuild Here it is Dec 20 22:00:04 Tower kernel: md: recovery thread: recon D10 ... No I/O errors during rebuild, might as well let it complete and try to repair filesystems when done with disk10 rebuild. We can try to repair the emulated filesystem on the other disk before rebuilding it. Quote Link to comment
MatrixMJK Posted December 22, 2022 Author Share Posted December 22, 2022 I did shut the machine down to change the SAS and power cables to see if it changed anything, but it did not. I restarted the rebuild a bit ago. Also the ZFS plugin is uninstalled and the docker service is stopped. Another diag in case you want to check anything since the reboot. I will let the rebuild run now, lost some time doing the reboot. I'll update as I see anything. Thanks! tower-diagnostics-20221221-2222.zip Quote Link to comment
JorgeB Posted December 22, 2022 Share Posted December 22, 2022 Could have checked filesystem before re-starting the rebuild, but since it's been a few hours might as well let it finish now. Quote Link to comment
MatrixMJK Posted December 31, 2022 Author Share Posted December 31, 2022 (edited) OK, drives did not really re-build. Guessed that would be the outcome from the indicators. Ordered two new HD's. Shut the system down and replaced the two 'unmountable' drives with new ones and powered up the server. After starting up I set the two new drives to the drive #2 and #10 spots (for the missing ones) and started the array. Those two spots still show 'unmountable'. Should I format (I know, not from the "main' UnRAID menu) or take other action to prepare the two new drives and re-start the rebuild? I have always pre-cleared drives before using them in the array, so not sure if the format step is needed since these were not pre-cleared. Attached is a new diag with the new drives after starting the array. Thanks, Matt tower-diagnostics-20221230-1856 two new drives.zip Edited December 31, 2022 by MatrixMJK Quote Link to comment
JorgeB Posted December 31, 2022 Share Posted December 31, 2022 Format is never part of a rebuild, check filesystem on both disks. Quote Link to comment
trurl Posted December 31, 2022 Share Posted December 31, 2022 10 hours ago, MatrixMJK said: set the two new drives to the drive #2 and #10 spots (for the missing ones) and started the array So it is now rebuilding to new disks. Not clear there was anything wrong with the original disks you already rebuilt. Those were unmountable as expected since you were rebuilding unmountable filesystems instead of On 12/22/2022 at 3:42 AM, JorgeB said: checked filesystem before re-starting the rebuild If you had waited for advice, we would have told you to check filesystem on those original rebuilt disks. Now, I guess you can check filesystem on those new disks when they finish rebuilding the unmountable filesystems. (Technically, you could check filesystem on the disks while they rebuild but let's just keep it simple.) Quote Link to comment
MatrixMJK Posted December 31, 2022 Author Share Posted December 31, 2022 9 hours ago, JorgeB said: Format is never part of a rebuild As I said, I know a format is not part of the recovery, I was asking if any action needed to happen to the new drives to get them to not show as mountable. 6 hours ago, trurl said: Not clear there was anything wrong with the original disks you already rebuilt. I replaced them to make sure it was not a drive failure since they were brand new and had not been pre-cleared. I have placed the two previous drives in another UnRAID system and after trying to mount them it said they were successfully repaired. So once the new drives are done rebuilding I will try to check the filesystem on them. Since it was still showing "Unmountable" I was worried it was a hardware problem. Quote Link to comment
trurl Posted December 31, 2022 Share Posted December 31, 2022 32 minutes ago, MatrixMJK said: any action needed to happen to the new drives to get them to not show as mountable. For a drive to show as mountable, it needs to contain a mountable filesystem. If you wanted to test a new disk, you could have done preclear. But that is unrelated to whether it becomes mountable or not (assuming it isn't a bad disk). Formatting it outside the array or in another system just writes an empty filesystem (of whatever type, such as NTFS or XFS) to it. That is all Format means. Irrelevant since it is going to be completely overwritten by rebuild anyway. Rebuild overwrites the entire disk with the contents of the emulated disk. Those contents ARE the filesystem. If the emulated disk contains an unmountable filesystem, the rebuild will. If the emulated disk is a mountable filesystem, the rebuild will be a mountable filesystem (assuming everything works as intended of course). Doesn't matter at all whether the rebuilding disk was previously clear, freshly formatted in another system, completely full of porn, whatever. It is completely overwritten by rebuild. If you are rebuilding an unmountable filesystem, you will have an unmountable filesystem that needs repair to (hopefully) make it mountable. The reason we prefer to repair the emulated filesystem before rebuilding on top of the original disk is to avoid overwriting that original disk with a filesystem that needs repair. When rebuilding to a new disk instead of on top of the original, it isn't as important since you still have the contents of the original to try to recover something from. Quote Link to comment
trurl Posted December 31, 2022 Share Posted December 31, 2022 5 minutes ago, trurl said: Formatting it outside the array And formatting a disk in the array updates parity just like any other write operation does. So then all that can be rebuilt is an empty filesystem. You apparently knew to NOT do that, but maybe not exactly why. Quote Link to comment
MatrixMJK Posted December 31, 2022 Author Share Posted December 31, 2022 Thank you. I'm sorry I was not clear about that. In my experience with multi-drive file systems, if re-building does not fix the filesystem then we look to hardware failures so I was doing what I could to rule out drives and cable failures. In my trouble shooting I lost sight of the fact that in UnRAID each drive has it's own filesystem and may be able to be repaired individually. I should have paid more attention to that fact. Sorry to have led myself down too many holes. I am guessing that once this rebuild is complete a filesystem check/repair will be all it needs. Quote Link to comment
MatrixMJK Posted January 2, 2023 Author Share Posted January 2, 2023 OK, the rebuild finished still showing 'Unmountable' on drives 2 and 10 as expected. Below is the output from the XFS repair commands. Already ran repair and it fixed the superblock on each drive but now each drive suggests to flush/clear the log. Just want to make sure that is the logical next step for the drives. I also attached a new diag. Drive 2: Phase 1 - find and verify superblock... - block cache size set to 4461768 entries Phase 2 - using internal log - zero log... zero_log: head block 1569181 tail block 1569177 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... sb_fdblocks 1515438556, counted 1517878137 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 8 - agno = 3 - agno = 5 - agno = 6 - agno = 7 - agno = 9 - agno = 10 - agno = 12 - agno = 11 - agno = 4 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Mon Jan 2 00:09:41 2023 Phase Start End Duration Phase 1: 01/02 00:09:37 01/02 00:09:37 Phase 2: 01/02 00:09:37 01/02 00:09:37 Phase 3: 01/02 00:09:37 01/02 00:09:40 3 seconds Phase 4: 01/02 00:09:40 01/02 00:09:40 Phase 5: Skipped Phase 6: 01/02 00:09:40 01/02 00:09:41 1 second Phase 7: 01/02 00:09:41 01/02 00:09:41 Total run time: 4 seconds Drive 10: Phase 1 - find and verify superblock... - block cache size set to 4461496 entries Phase 2 - using internal log - zero log... zero_log: head block 2923446 tail block 2923158 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... sb_icount 1139648, counted 1139776 sb_ifree 7136, counted 7106 sb_fdblocks 1532612926, counted 1493815556 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 data fork in ino 6867620687 claims free block 858452628 data fork in ino 6867620690 claims free block 858454913 data fork in ino 6867620692 claims free block 858452629 data fork in ino 6867620693 claims free block 858452630 data fork in ino 6867620694 claims free block 858452531 data fork in ino 6867620695 claims free block 858452532 data fork in ino 6867620696 claims free block 858452533 data fork in ino 6867620698 claims free block 858452709 imap claims a free inode 6867620699 is in use, would correct imap and clear inode imap claims a free inode 6867620700 is in use, would correct imap and clear inode imap claims a free inode 6867620701 is in use, would correct imap and clear inode imap claims a free inode 6867620702 is in use, would correct imap and clear inode imap claims a free inode 6867620703 is in use, would correct imap and clear inode data fork in ino 6867620704 claims free block 858454946 data fork in ino 6867620705 claims free block 858456138 data fork in ino 6867620706 claims free block 858456141 data fork in ino 6867620707 claims free block 858456144 data fork in ino 6867620708 claims free block 858456147 data fork in ino 6867620709 claims free block 858456150 data fork in ino 6867620710 claims free block 858456153 data fork in ino 6867620711 claims free block 858456156 data fork in ino 6867620712 claims free block 858456159 data fork in ino 6867620716 claims free block 858456162 data fork in ino 6867620717 claims free block 858456165 data fork in ino 6867620718 claims free block 858452453 imap claims a free inode 6867620720 is in use, would correct imap and clear inode data fork in ino 6867620721 claims free block 858456174 data fork in ino 6867620722 claims free block 858452213 - agno = 4 - agno = 5 data fork in ino 10955870368 claims free block 1369483800 data fork in ino 10955870374 claims free block 1369483813 data fork in ino 10955870376 claims free block 1369483801 data fork in ino 10955870377 claims free block 1369483802 data fork in ino 10955870378 claims free block 1369483803 data fork in ino 10955870379 claims free block 1369483804 data fork in ino 10955870382 claims free block 1369483790 data fork in ino 10955870383 claims free block 1369483805 data fork in ino 10955870384 claims free block 1369483806 data fork in ino 10955870387 claims free block 1369483841 data fork in ino 10955870388 claims free block 1369483856 data fork in ino 10955870389 claims free block 1369483859 data fork in ino 10955870390 claims free block 1369483862 data fork in ino 10955870391 claims free block 1369483865 data fork in ino 10955870392 claims free block 1369483870 data fork in ino 10955870393 claims free block 1369483873 data fork in ino 10955870394 claims free block 1369483876 data fork in ino 10955870395 claims free block 1369483879 data fork in ino 10955870397 claims free block 1369483882 data fork in ino 10955870398 claims free block 1369483885 data fork in ino 10955870399 claims free block 1369483809 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 5 - agno = 7 - agno = 12 - agno = 4 - agno = 6 - agno = 1 - agno = 8 - agno = 9 - agno = 11 - agno = 10 - agno = 3 entry "SABnzbd_nzf_shh61478" at block 0 offset 424 in directory inode 6867620687 references free inode 6867620699 would clear inode number in entry at offset 424... entry "SABnzbd_nzf_zeq12jhb" at block 0 offset 456 in directory inode 6867620687 references free inode 6867620700 would clear inode number in entry at offset 456... entry "SABnzbd_nzf_f6_nso0u" at block 0 offset 488 in directory inode 6867620687 references free inode 6867620701 would clear inode number in entry at offset 488... entry "SABnzbd_nzf__vf54y6w" at block 0 offset 520 in directory inode 6867620687 references free inode 6867620702 would clear inode number in entry at offset 520... entry "SABnzbd_nzf_91fq5y_c" at block 0 offset 552 in directory inode 6867620687 references free inode 6867620703 would clear inode number in entry at offset 552... entry "SABnzbd_nzf_p31093y_" at block 0 offset 1000 in directory inode 6867620687 references free inode 6867620720 would clear inode number in entry at offset 1000... No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 entry "SABnzbd_nzf_shh61478" in directory inode 6867620687 points to free inode 6867620699, would junk entry entry "SABnzbd_nzf_zeq12jhb" in directory inode 6867620687 points to free inode 6867620700, would junk entry entry "SABnzbd_nzf_f6_nso0u" in directory inode 6867620687 points to free inode 6867620701, would junk entry entry "SABnzbd_nzf__vf54y6w" in directory inode 6867620687 points to free inode 6867620702, would junk entry entry "SABnzbd_nzf_91fq5y_c" in directory inode 6867620687 points to free inode 6867620703, would junk entry entry "SABnzbd_nzf_p31093y_" in directory inode 6867620687 points to free inode 6867620720, would junk entry bad hash table for directory inode 6867620687 (no data entry): would rebuild would rebuild directory inode 6867620687 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Mon Jan 2 00:16:02 2023 Phase Start End Duration Phase 1: 01/02 00:13:54 01/02 00:13:54 Phase 2: 01/02 00:13:54 01/02 00:13:55 1 second Phase 3: 01/02 00:13:55 01/02 00:15:00 1 minute, 5 seconds Phase 4: 01/02 00:15:00 01/02 00:15:01 1 second Phase 5: Skipped Phase 6: 01/02 00:15:01 01/02 00:16:02 1 minute, 1 second Phase 7: 01/02 00:16:02 01/02 00:16:02 Total run time: 2 minutes, 8 seconds tower-diagnostics-20230102-0017.zip Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.