Data rebuild runs over?!

doron · July 21, 2022

Hi folks,

I'm recovering from a mishap with two red x HDDs. (Thank God and @limetech for dual parity!!)

Two parity drives of 12TB. One of them is red-x'ed.

Five data drives of 4TB and one of 12TB. One of the 4TB was red-x'ed.

Started data rebuild for the 4TB. Parity2 is still disabled.

Data rebuild seems to have covered the entire 4TB of the target drive, but the drive's icon is still orange and "rebuild" continues (just reading, not writing anything). It is now at 117% and happily chewing on. It appears to want to go all the way to 12TB (shudder), although the rebuilt data drive

I've never seen this before. Is this a known issue? Surely it's not how it's supposed to work...

If I stop the array and try to mark the disk as "parity presumed valid" (I guess I'd need "new config" for that?) - would that work?

Any feedback would be appreciated. This seems to be an odd behavior - new to me - and since I'm now with no protection (remember, two red x's) I don't want to break anything.

Unraid 6.10.2.

trurl · July 21, 2022

6 minutes ago, doron said:

Two parity drives of 12TB. One of them is red-x'ed

I assume it is rebuilding the disabled parity drive unless you have unassigned it.

doron · July 21, 2022

1 minute ago, trurl said:

I assume it is rebuilding the disabled parity drive unless you have unassigned it.

Thanks. That drive is disabled. And it doesn't in fact write anything to it. So certainly not really rebuilding it.

But wait. You may be on to something here. When the data rebuild started, parity2 was not disabled. I started rebuiding both, assuming the double issue was a controller / cabling mishap (which I still think it might have been, but that's beside the point). Parity2 red-x'ed again during the rebuild.

So we might be looking at a corner case bug, where data rebuild is not aware that its 2nd target is disabled, and thinks it needs to complete the run. Plausible?

Anyway, the situation now is that I believe disk3 is already fully rebuilt, and parity2 is not being rebuilt, so the process is chewing in vain. What would be the least-risky procedure to bring the array again to being protected? I thought of:

- Stop the rebuild

- Bring down array

- New config without parity2, and "parity assumed valid"

- When all is green, insert parity2 and rebuild it.

Is this a good way? Is there a better way (e.g. can I tell emhttp now that disk3 is valid, in spite of the allegedly incomplete rebuild)?

trurl · July 21, 2022

attach diagnostics to your NEXT post in this thread

trurl · July 21, 2022

Jul 21 10:48:16 Tower kernel: md: recovery thread: recon D3 Q ...

So it is indeed rebuilding parity2

doron · July 21, 2022

9 minutes ago, trurl said:
Jul 21 10:48:16 Tower kernel: md: recovery thread: recon D3 Q ...
So it is indeed rebuilding parity2

Yes, that's what it started doing, but Q is DSBL right now so it just reads stuff and writes nothing.

Recommendation? (a) Just wait for it to finish doing nothing for the next ~20 hours and then rebuild Q or (b) do a New Config assuming D3 is indeed good vis-a-vis P (taking Q out of the game)?

BTW I do think this is a bug worth addressing. But currently I'm with the operational questions.

trurl · July 21, 2022

Diagnostics just shows it invalid, not disabled. It will be invalid until rebuild finishes.

Post a screenshot of Main - Array Devices.

doron · July 21, 2022

13 minutes ago, trurl said:

Diagnostics just shows it invalid, not disabled. It will be invalid until rebuild finishes.

D3 is the one that's invalid (as you say, this is expected). Q is disabled (DISK_DSBL).

13 minutes ago, trurl said:

Post a screenshot of Main - Array Devices.

Here. Thanks.

Edited July 21, 2022 by doron

trurl · July 21, 2022

Just above the View column is a toggle to switch from write speed to total writes. I never care about speed so always leave that the other way. Wonder what yours shows if you toggle it.

23 minutes ago, doron said:

D3 is the one that's invalid. Q is disabled (DISK_DSBL)

That is indeed what diagnostics have.

I have finally had some time to look further into syslog. Maybe if you had posted diagnostics to the forum others would have already seen this

Jul 21 10:46:08 Tower kernel: md: import_slot: 3 empty
Jul 21 10:46:08 Tower kernel: md: import_slot: 29 empty
Jul 21 10:47:40 Tower kernel: md: import_slot: 3 replaced
Jul 21 10:47:40 Tower kernel: md: import_slot: 29 replaced
Jul 21 10:47:54 Tower kernel: mdcmd (35): start RECON_DISK

these just show emulated disk3 mounts

Jul 21 10:48:09 Tower emhttpd: shcmd (558): mount -t xfs -o noatime,nouuid /dev/mapper/md3 /mnt/disk3
Jul 21 10:48:09 Tower kernel: XFS (dm-2): Mounting V5 Filesystem
Jul 21 10:48:10 Tower emhttpd: shcmd (559): xfs_growfs /mnt/disk3
Jul 21 10:48:10 Tower kernel: XFS (dm-2): Ending clean mount

Jul 21 10:48:16 Tower kernel: md: recovery thread: recon D3 Q ...

then we get this and more which disabled parity2 again

Jul 21 10:49:01 Tower kernel: md: disk29 write error, sector=482368

43 minutes ago, doron said:

currently I'm with the operational questions

An even more important question is why parity2 write failed. I must confess I don't know much about SAS drives and how to interpret their SMART reports, but these other things in diagnostics make me suspect connection or controller issue

Jul 21 11:09:59 Tower kernel: sd 4:0:5:0: device_block, handle(0x000c)
Jul 21 11:10:01 Tower kernel: sd 4:0:5:0: device_unblock and setting to running, handle(0x000c)
Jul 21 11:11:50 Tower kernel: sd 4:0:5:0: Power-on or device reset occurred

[4:0:5:0]    disk    HGST     HUH721212AL4200  A3D0  /dev/sdl   /dev/sg12

Perhaps others on the forum would have more to say if you had posted diagnostics.

doron · July 21, 2022

Thanks @trurl. Much appreciated and apologies for that.

Your findings are aligned with the initial description (btw, being author of SAS Spindown plugin, I have some mileage of intimacy with these drives 🙂 ).

Trying to prioritize the burning situation here, I'd like to put aside, for the moment, the reason for parity2's failure (I'll deal with it later), and focus on the operational, Unraid-specific question: At this state, am I right in assuming disk3 is indeed fully rebuilt? Will stopping the futile "rebuild" process right now, doing a New Config without parity2 and with "parity assumed valid", return the array to a stable, protected state?

(then, I will deal with parity2 and probably rebuild it, but I will have a protected array).

Thanks!

trurl · July 21, 2022

My guess is disk3 has been rebuilt, but if you stop it now it will probably want to rebuild both again.

You can do what you suggest and check if all looks good. If not there are ways to force it to rebuild disk3 again.

trurl · July 21, 2022

You should unassign parity2 until you are ready to rebuild it.

Data rebuild runs over?!

Recommended Posts

doron

Link to comment

trurl

Link to comment

doron

Link to comment

trurl

Link to comment

trurl

Link to comment

doron

Link to comment

trurl

Link to comment

doron

Link to comment

trurl

Link to comment

doron

Link to comment

trurl

Link to comment

trurl

Link to comment

Join the conversation