Read Errors While Rebuilding to Upgrade a Disk


Recommended Posts

I bought a 4TB disk to replace the oldest and smallest disk in my array.

I did this by unassigning the old drive, shutting down, physically replacing the disk, assigning the new disk to that slot, and then initiating a rebuild.

While this rebuild was happening, I got read errors on another 4TB disk. The rebuild process is now stuck at 99.9%.

That disk had read errors before, but I had foolishly ignored it since SMART showed no bad values or issues with extended tests and I replaced the corrupted files afterwards.

 

The way I see it, there are two possibilities:

1. Revert my replacement back to the 1TB disk, and then use the new disk to replace the disk with errors.

2. Accept the loss of some data and recover it from another source. I don't know what it is yet though.

 

I wouldn't know how to do either of those things though.

Also, is it even necessary for the rebuild to go past 1TB?

 

(I reset the stats at some point to see if that might do something out of desperation.)

image.thumb.png.44a560a89fb025e82efd27a1e13020ea.png

image.png.99df77c579692c6122887220925d1324.png

bigbong-diagnostics-20220625-0642.zip

Link to comment

Then I think your best bet it to use old disk2 and force Unraid to rebuild disk4, parity still won't be 100% in sync, so some fs corruption is possible, you can do that by:

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Check all assignments and assign any missing disk(s) if needed, including old disk2 and new disk4, replacement disk4 should be same size or larger than the old one
-IMPORTANT - Check both "parity is already valid" and "maintenance mode" and start the array (note that the GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the checkbox, but it won't be as long as it's checked)
-Stop array
-Unassign disk4
-Start array (in normal mode now), ideally the emulated disk4 will now mount and contents look correct, if it doesn't you should run a filesystem check on the emulated disk
-If the emulated disk mounts and contents look correct stop the array
-Re-assign the disk4 and start array to begin.

 

For now keep old disk4 intact in case it's needed.

Link to comment

I shut down and restarted to swap the drives back, and I forgot that this would cause the array to start normally once it started. As a result, a number of writes were made although I didn't change any data myself (docker containers?).

I proceeded with the process anyways. All data contents appear good.

 

When unassigned from the array, the contents of disk 4 do not appear within the pool.

Additionally, old disk 4 is now "Unmountable" although it can still be mounted and read with unassigned devices plugin.

 

I don't know if it can rebuild disk 4 to a replacement in this state, although I suppose it could have a chance at recovering errored-out data as opposed to simply copying old disk 4 to the replacement.

Link to comment
6 minutes ago, 404fox said:

When unassigned from the array, the contents of disk 4 do not appear within the pool.

Do you mean the emulated disk4 is unmountable? If yes:

1 hour ago, JorgeB said:

ideally the emulated disk4 will now mount and contents look correct, if it doesn't you should run a filesystem check on the emulated disk

 

If it's not that post new diags.

Link to comment

No signs of a valid filesystem being found on the emulated disk4, that's very strange assuming parity was valid, try running a manual check to see if a backup superblock is found, but not very optimistic, with the array started in maintenance mode type:

 

xfs_repair -v /dev/md4

 

 

Link to comment
root@BigBong:~# xfs_repair -v /dev/md4
Phase 1 - find and verify superblock...
bad primary superblock - inconsistent filesystem geometry information !!!

attempting to find secondary superblock...
.found candidate secondary superblock...
verified secondary superblock...
writing modified primary superblock
        - block cache size set to 735560 entries
sb realtime bitmap inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129
resetting superblock realtime bitmap inode pointer to 129
sb realtime summary inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130
resetting superblock realtime summary inode pointer to 130
Phase 2 - using internal log
        - zero log...
zero_log: head block 3035810 tail block 3035810
        - scan filesystem freespace and inode maps...
sb_icount 0, counted 1504704
sb_ifree 0, counted 5195
sb_fdblocks 976277679, counted 40592957
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 3
        - agno = 2
        - agno = 1
Phase 5 - rebuild AG headers and trees...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Note - stripe unit (0) and width (0) were copied from a backup superblock.
Please reset with mount -o sunit=<value>,swidth=<value> if necessary

        XFS_REPAIR Summary    Sun Jun 26 07:52:29 2022

Phase           Start           End             Duration
Phase 1:        06/26 07:48:35  06/26 07:48:36  1 second
Phase 2:        06/26 07:48:36  06/26 07:48:38  2 seconds
Phase 3:        06/26 07:48:38  06/26 07:50:43  2 minutes, 5 seconds
Phase 4:        06/26 07:50:43  06/26 07:50:44  1 second
Phase 5:        06/26 07:50:44  06/26 07:50:46  2 seconds
Phase 6:        06/26 07:50:46  06/26 07:52:28  1 minute, 42 seconds
Phase 7:        06/26 07:52:28  06/26 07:52:28

Total run time: 3 minutes, 53 seconds
done

 

Here is the result. It worked, and the emulated contents can be read without issue.

Should I go for rebuilding to the replacement disk?

Link to comment
  • 1 year later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.