Jump to content

Two drives have errors during an in-progress drive rebuild


Go to solution Solved by JorgeB,

Recommended Posts

Hello,

 

I have replaced two drives in my system to expand the array size. Last week I replaced one 10TB drive with an 8TB drive and completed the drive rebuild successfully. Then a day or so ago I replaced another drive with another new 14TB and started the rebuild. (Just a note that I followed the instructions for replacing with a larger drive at https://wiki.unraid.net/Manual/Storage_Management#Replacing_a_disk_to_increase_capacity).

 

Some time in the past 12-15 hours, one parity drive had a bunch of errors and the drive I had originally replaced dropped out as well. The rebuild thinks it is still running and is reading from the remaining drives, but no writes are happening.

 

I have enclosed a screenshot of what the array looks like and a diagnostic report from the same time.

 

Let me know what action to take to save what I can. I don't think the parity or the other recently rebuilt drive are bad at this point, but of course that could be the case.

 

Thanks for the assistance.

 

 

2022-12-20 15_16_22-Tower_Main.png

tower-diagnostics-20221220-1513.zip

Link to comment

Parity2, disk2 and disk10 all disconnected. Disk2 became disabled. Emulated disk2 and disk10 mount and have plenty of contents. SMART for all 3 disks looks fine, I didn't check the others. Do any of your other disks show SMART warnings on the Dashboard?

 

I notice you have a RAID controller. RAID controllers are NOT recommended with Unraid for many reasons.

 

But it looks like all the affected disks are on the LSI controller. Probably you disturbed connections when you replaced disk10.

 

Since you have dual parity you can rebuild both disk2 and disk10.

 

Shutdown, check connections, SATA and power, both ends, including splitters. You will have to start rebuilds over.

 

Link to comment

Great, thanks for the info so quick!

 

I thought that was the case but wanted to make sure from calm heads. The RAID card was for moving some data off of other drives and was never used for pool/array drives, I do know better than that. All pool drives are on the LSI controller.

 

I have re-seated all power and SATA cables and powered up the server.

Drive 10 shows as a yellow triangle and should be rebuilt when I start the array.

Drive 2 shows as disabled, and I guess I will need to remove it from the pool and re-add it. But for now I will let drive 10 rebuild first, then take care of drive 2.

Let me know if that is a sane course of action.

 

Thanks again for the help.

 

Matt

Link to comment

Just a quick update:

 

After starting the array, both drive 10 and 2 show as "Unmountable: wrong or no file system", but it looks to be rebuilding (reading from all other drives except 2 and 10) and writing to drive 10 as I would expect. I'll let it proceed and wait till morning to see if I should stop it and format the drive before starting the rebuild again.

 

Link to comment
7 hours ago, MatrixMJK said:

format the drive before starting the rebuild

Format is a write operation. Unraid treats this write operation just as it does any other, by updating parity so it stays in sync. After formatting a disk in the array, parity agrees the disk has been formatted. Then the only thing it can rebuild is a formatted disk.

 

5 hours ago, JorgeB said:

Format is never part of a rebuild, post new diags 

 

Link to comment

Filesystem corruption on disks 2 and 10 as soon as they tried to mount on startup. Not clear there is any connection issues.

 

I don't seen anything in syslog to indicate a rebuild going on, maybe I missed it because syslog is flooded with docker network entries for some reason.

 

Post a screenshot of Main - Array Operation.

Link to comment
9 hours ago, trurl said:

anything in syslog to indicate a rebuild

Here it is

Dec 20 22:00:04 Tower kernel: md: recovery thread: recon D10 ...

 

No I/O errors during rebuild, might as well let it complete and try to repair filesystems when done with disk10 rebuild. We can try to repair the emulated filesystem on the other disk before rebuilding it.

Link to comment

I did shut the machine down to change the SAS and power cables to see if it changed anything, but it did not.

 

I restarted the rebuild a bit ago. Also the ZFS plugin is uninstalled and the docker service is stopped.

 

Another diag in case you want to check anything since the reboot.

 

I will let the rebuild run now, lost some time doing the reboot.

 

I'll update as I see anything.

 

Thanks!

tower-diagnostics-20221221-2222.zip

Link to comment
  • 2 weeks later...

OK, drives did not really re-build. Guessed that would be the outcome from the indicators.

 

Ordered two new HD's. Shut the system down and replaced the two 'unmountable' drives with new ones and powered up the server. After starting up I set the two new drives to the drive #2 and #10 spots (for the missing ones) and started the array. Those two spots still show 'unmountable'.

 

Should I format (I know, not from the "main' UnRAID menu) or take other action to prepare the two new drives and re-start the rebuild?

 

I have always pre-cleared drives before using them in the array, so not sure if the format step is needed since these were not pre-cleared.

 

Attached is a new diag with the new drives after starting the array.

 

Thanks,

 

Matt

tower-diagnostics-20221230-1856 two new drives.zip

Edited by MatrixMJK
Link to comment
10 hours ago, MatrixMJK said:

set the two new drives to the drive #2 and #10 spots (for the missing ones) and started the array

So it is now rebuilding to new disks.

 

Not clear there was anything wrong with the original disks you already rebuilt. Those were unmountable as expected since you were rebuilding unmountable filesystems instead of

On 12/22/2022 at 3:42 AM, JorgeB said:

checked filesystem before re-starting the rebuild

 

If you had waited for advice, we would have told you to check filesystem on those original rebuilt disks.

 

Now, I guess you can check filesystem on those new disks when they finish rebuilding the unmountable filesystems. (Technically, you could check filesystem on the disks while they rebuild but let's just keep it simple.)

Link to comment
9 hours ago, JorgeB said:

Format is never part of a rebuild

As I said, I know a format is not part of the recovery, I was asking if any action needed to happen to the new drives to get them to not show as mountable.

 

6 hours ago, trurl said:

Not clear there was anything wrong with the original disks you already rebuilt.

I replaced them to make sure it was not a drive failure since they were brand new and had not been pre-cleared.

 

I have placed the two previous drives in another UnRAID system and after trying to mount them it said they were successfully repaired. So once the new drives are done rebuilding I will try to check the filesystem on them.

 

Since it was still showing "Unmountable" I was worried it was a hardware problem.

Link to comment
32 minutes ago, MatrixMJK said:

any action needed to happen to the new drives to get them to not show as mountable.

For a drive to show as mountable, it needs to contain a mountable filesystem.

 

If you wanted to test a new disk, you could have done preclear. But that is unrelated to whether it becomes mountable or not (assuming it isn't a bad disk).

 

Formatting it outside the array or in another system just writes an empty filesystem (of whatever type, such as NTFS or XFS) to it. That is all Format means. Irrelevant since it is going to be completely overwritten by rebuild anyway.

 

Rebuild overwrites the entire disk with the contents of the emulated disk. Those contents ARE the filesystem. If the emulated disk contains an unmountable filesystem, the rebuild will. If the emulated disk is a mountable filesystem, the rebuild will be a mountable filesystem (assuming everything works as intended of course).

 

Doesn't matter at all whether the rebuilding disk was previously clear, freshly formatted in another system, completely full of porn, whatever. It is completely overwritten by rebuild. If you are rebuilding an unmountable filesystem, you will have an unmountable filesystem that needs repair to (hopefully) make it mountable.

 

The reason we prefer to repair the emulated filesystem before rebuilding on top of the original disk is to avoid overwriting that original disk with a filesystem that needs repair. When rebuilding to a new disk instead of on top of the original, it isn't as important since you still have the contents of the original to try to recover something from.

 

 

Link to comment

Thank you. I'm sorry I was not clear about that. 

 

In my experience with multi-drive file systems, if re-building does not fix the filesystem then we look to hardware failures so I was doing what I could to rule out drives and cable failures.

 

In my trouble shooting I lost sight of the fact that in UnRAID each drive has it's own filesystem and may be able to be repaired individually. I should have paid more attention to that fact.

 

Sorry to have led myself down too many holes. I am guessing that once this rebuild is complete a filesystem check/repair will be all it needs.

Link to comment

OK, the rebuild finished still showing 'Unmountable' on drives 2 and 10 as expected. Below is the output from the XFS repair commands. Already ran repair and it fixed the superblock on each drive but now each drive suggests to flush/clear the log. Just want to make sure that is the logical next step for the drives. I also attached a new diag.

 

Drive 2:

Phase 1 - find and verify superblock...
        - block cache size set to 4461768 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 1569181 tail block 1569177
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
sb_fdblocks 1515438556, counted 1517878137
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 8
        - agno = 3
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 9
        - agno = 10
        - agno = 12
        - agno = 11
        - agno = 4
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Mon Jan  2 00:09:41 2023

Phase		Start		End		Duration
Phase 1:	01/02 00:09:37	01/02 00:09:37
Phase 2:	01/02 00:09:37	01/02 00:09:37
Phase 3:	01/02 00:09:37	01/02 00:09:40	3 seconds
Phase 4:	01/02 00:09:40	01/02 00:09:40
Phase 5:	Skipped
Phase 6:	01/02 00:09:40	01/02 00:09:41	1 second
Phase 7:	01/02 00:09:41	01/02 00:09:41

Total run time: 4 seconds

 

Drive 10:

Phase 1 - find and verify superblock...
        - block cache size set to 4461496 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 2923446 tail block 2923158
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
sb_icount 1139648, counted 1139776
sb_ifree 7136, counted 7106
sb_fdblocks 1532612926, counted 1493815556
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
data fork in ino 6867620687 claims free block 858452628
data fork in ino 6867620690 claims free block 858454913
data fork in ino 6867620692 claims free block 858452629
data fork in ino 6867620693 claims free block 858452630
data fork in ino 6867620694 claims free block 858452531
data fork in ino 6867620695 claims free block 858452532
data fork in ino 6867620696 claims free block 858452533
data fork in ino 6867620698 claims free block 858452709
imap claims a free inode 6867620699 is in use, would correct imap and clear inode
imap claims a free inode 6867620700 is in use, would correct imap and clear inode
imap claims a free inode 6867620701 is in use, would correct imap and clear inode
imap claims a free inode 6867620702 is in use, would correct imap and clear inode
imap claims a free inode 6867620703 is in use, would correct imap and clear inode
data fork in ino 6867620704 claims free block 858454946
data fork in ino 6867620705 claims free block 858456138
data fork in ino 6867620706 claims free block 858456141
data fork in ino 6867620707 claims free block 858456144
data fork in ino 6867620708 claims free block 858456147
data fork in ino 6867620709 claims free block 858456150
data fork in ino 6867620710 claims free block 858456153
data fork in ino 6867620711 claims free block 858456156
data fork in ino 6867620712 claims free block 858456159
data fork in ino 6867620716 claims free block 858456162
data fork in ino 6867620717 claims free block 858456165
data fork in ino 6867620718 claims free block 858452453
imap claims a free inode 6867620720 is in use, would correct imap and clear inode
data fork in ino 6867620721 claims free block 858456174
data fork in ino 6867620722 claims free block 858452213
        - agno = 4
        - agno = 5
data fork in ino 10955870368 claims free block 1369483800
data fork in ino 10955870374 claims free block 1369483813
data fork in ino 10955870376 claims free block 1369483801
data fork in ino 10955870377 claims free block 1369483802
data fork in ino 10955870378 claims free block 1369483803
data fork in ino 10955870379 claims free block 1369483804
data fork in ino 10955870382 claims free block 1369483790
data fork in ino 10955870383 claims free block 1369483805
data fork in ino 10955870384 claims free block 1369483806
data fork in ino 10955870387 claims free block 1369483841
data fork in ino 10955870388 claims free block 1369483856
data fork in ino 10955870389 claims free block 1369483859
data fork in ino 10955870390 claims free block 1369483862
data fork in ino 10955870391 claims free block 1369483865
data fork in ino 10955870392 claims free block 1369483870
data fork in ino 10955870393 claims free block 1369483873
data fork in ino 10955870394 claims free block 1369483876
data fork in ino 10955870395 claims free block 1369483879
data fork in ino 10955870397 claims free block 1369483882
data fork in ino 10955870398 claims free block 1369483885
data fork in ino 10955870399 claims free block 1369483809
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 5
        - agno = 7
        - agno = 12
        - agno = 4
        - agno = 6
        - agno = 1
        - agno = 8
        - agno = 9
        - agno = 11
        - agno = 10
        - agno = 3
entry "SABnzbd_nzf_shh61478" at block 0 offset 424 in directory inode 6867620687 references free inode 6867620699
	would clear inode number in entry at offset 424...
entry "SABnzbd_nzf_zeq12jhb" at block 0 offset 456 in directory inode 6867620687 references free inode 6867620700
	would clear inode number in entry at offset 456...
entry "SABnzbd_nzf_f6_nso0u" at block 0 offset 488 in directory inode 6867620687 references free inode 6867620701
	would clear inode number in entry at offset 488...
entry "SABnzbd_nzf__vf54y6w" at block 0 offset 520 in directory inode 6867620687 references free inode 6867620702
	would clear inode number in entry at offset 520...
entry "SABnzbd_nzf_91fq5y_c" at block 0 offset 552 in directory inode 6867620687 references free inode 6867620703
	would clear inode number in entry at offset 552...
entry "SABnzbd_nzf_p31093y_" at block 0 offset 1000 in directory inode 6867620687 references free inode 6867620720
	would clear inode number in entry at offset 1000...
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
entry "SABnzbd_nzf_shh61478" in directory inode 6867620687 points to free inode 6867620699, would junk entry
entry "SABnzbd_nzf_zeq12jhb" in directory inode 6867620687 points to free inode 6867620700, would junk entry
entry "SABnzbd_nzf_f6_nso0u" in directory inode 6867620687 points to free inode 6867620701, would junk entry
entry "SABnzbd_nzf__vf54y6w" in directory inode 6867620687 points to free inode 6867620702, would junk entry
entry "SABnzbd_nzf_91fq5y_c" in directory inode 6867620687 points to free inode 6867620703, would junk entry
entry "SABnzbd_nzf_p31093y_" in directory inode 6867620687 points to free inode 6867620720, would junk entry
bad hash table for directory inode 6867620687 (no data entry): would rebuild
would rebuild directory inode 6867620687
        - agno = 4
        - agno = 5
        - agno = 6
        - agno = 7
        - agno = 8
        - agno = 9
        - agno = 10
        - agno = 11
        - agno = 12
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Mon Jan  2 00:16:02 2023

Phase		Start		End		Duration
Phase 1:	01/02 00:13:54	01/02 00:13:54
Phase 2:	01/02 00:13:54	01/02 00:13:55	1 second
Phase 3:	01/02 00:13:55	01/02 00:15:00	1 minute, 5 seconds
Phase 4:	01/02 00:15:00	01/02 00:15:01	1 second
Phase 5:	Skipped
Phase 6:	01/02 00:15:01	01/02 00:16:02	1 minute, 1 second
Phase 7:	01/02 00:16:02	01/02 00:16:02

Total run time: 2 minutes, 8 seconds

 

2023-01-02 00_18_54-Tower_Main.png

tower-diagnostics-20230102-0017.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...