doron Posted July 21, 2022 Share Posted July 21, 2022 Hi folks, I'm recovering from a mishap with two red x HDDs. (Thank God and @limetech for dual parity!!) Two parity drives of 12TB. One of them is red-x'ed. Five data drives of 4TB and one of 12TB. One of the 4TB was red-x'ed. Started data rebuild for the 4TB. Parity2 is still disabled. Data rebuild seems to have covered the entire 4TB of the target drive, but the drive's icon is still orange and "rebuild" continues (just reading, not writing anything). It is now at 117% and happily chewing on. It appears to want to go all the way to 12TB (shudder), although the rebuilt data drive I've never seen this before. Is this a known issue? Surely it's not how it's supposed to work... If I stop the array and try to mark the disk as "parity presumed valid" (I guess I'd need "new config" for that?) - would that work? Any feedback would be appreciated. This seems to be an odd behavior - new to me - and since I'm now with no protection (remember, two red x's) I don't want to break anything. Unraid 6.10.2. Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 6 minutes ago, doron said: Two parity drives of 12TB. One of them is red-x'ed I assume it is rebuilding the disabled parity drive unless you have unassigned it. Quote Link to comment
doron Posted July 21, 2022 Author Share Posted July 21, 2022 1 minute ago, trurl said: I assume it is rebuilding the disabled parity drive unless you have unassigned it. Thanks. That drive is disabled. And it doesn't in fact write anything to it. So certainly not really rebuilding it. But wait. You may be on to something here. When the data rebuild started, parity2 was not disabled. I started rebuiding both, assuming the double issue was a controller / cabling mishap (which I still think it might have been, but that's beside the point). Parity2 red-x'ed again during the rebuild. So we might be looking at a corner case bug, where data rebuild is not aware that its 2nd target is disabled, and thinks it needs to complete the run. Plausible? Anyway, the situation now is that I believe disk3 is already fully rebuilt, and parity2 is not being rebuilt, so the process is chewing in vain. What would be the least-risky procedure to bring the array again to being protected? I thought of: - Stop the rebuild - Bring down array - New config without parity2, and "parity assumed valid" - When all is green, insert parity2 and rebuild it. Is this a good way? Is there a better way (e.g. can I tell emhttp now that disk3 is valid, in spite of the allegedly incomplete rebuild)? Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 attach diagnostics to your NEXT post in this thread Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 Jul 21 10:48:16 Tower kernel: md: recovery thread: recon D3 Q ... So it is indeed rebuilding parity2 Quote Link to comment
doron Posted July 21, 2022 Author Share Posted July 21, 2022 9 minutes ago, trurl said: Jul 21 10:48:16 Tower kernel: md: recovery thread: recon D3 Q ... So it is indeed rebuilding parity2 Yes, that's what it started doing, but Q is DSBL right now so it just reads stuff and writes nothing. Recommendation? (a) Just wait for it to finish doing nothing for the next ~20 hours and then rebuild Q or (b) do a New Config assuming D3 is indeed good vis-a-vis P (taking Q out of the game)? BTW I do think this is a bug worth addressing. But currently I'm with the operational questions. Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 Diagnostics just shows it invalid, not disabled. It will be invalid until rebuild finishes. Post a screenshot of Main - Array Devices. Quote Link to comment
doron Posted July 21, 2022 Author Share Posted July 21, 2022 (edited) 13 minutes ago, trurl said: Diagnostics just shows it invalid, not disabled. It will be invalid until rebuild finishes. D3 is the one that's invalid (as you say, this is expected). Q is disabled (DISK_DSBL). 13 minutes ago, trurl said: Post a screenshot of Main - Array Devices. Here. Thanks. Edited July 21, 2022 by doron Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 Just above the View column is a toggle to switch from write speed to total writes. I never care about speed so always leave that the other way. Wonder what yours shows if you toggle it. 23 minutes ago, doron said: D3 is the one that's invalid. Q is disabled (DISK_DSBL) That is indeed what diagnostics have. I have finally had some time to look further into syslog. Maybe if you had posted diagnostics to the forum others would have already seen this Jul 21 10:46:08 Tower kernel: md: import_slot: 3 empty Jul 21 10:46:08 Tower kernel: md: import_slot: 29 empty Jul 21 10:47:40 Tower kernel: md: import_slot: 3 replaced Jul 21 10:47:40 Tower kernel: md: import_slot: 29 replaced Jul 21 10:47:54 Tower kernel: mdcmd (35): start RECON_DISK these just show emulated disk3 mounts Jul 21 10:48:09 Tower emhttpd: shcmd (558): mount -t xfs -o noatime,nouuid /dev/mapper/md3 /mnt/disk3 Jul 21 10:48:09 Tower kernel: XFS (dm-2): Mounting V5 Filesystem Jul 21 10:48:10 Tower emhttpd: shcmd (559): xfs_growfs /mnt/disk3 Jul 21 10:48:10 Tower kernel: XFS (dm-2): Ending clean mount Jul 21 10:48:16 Tower kernel: md: recovery thread: recon D3 Q ... then we get this and more which disabled parity2 again Jul 21 10:49:01 Tower kernel: md: disk29 write error, sector=482368 43 minutes ago, doron said: currently I'm with the operational questions An even more important question is why parity2 write failed. I must confess I don't know much about SAS drives and how to interpret their SMART reports, but these other things in diagnostics make me suspect connection or controller issue Jul 21 11:09:59 Tower kernel: sd 4:0:5:0: device_block, handle(0x000c) Jul 21 11:10:01 Tower kernel: sd 4:0:5:0: device_unblock and setting to running, handle(0x000c) Jul 21 11:11:50 Tower kernel: sd 4:0:5:0: Power-on or device reset occurred [4:0:5:0] disk HGST HUH721212AL4200 A3D0 /dev/sdl /dev/sg12 Perhaps others on the forum would have more to say if you had posted diagnostics. 1 Quote Link to comment
doron Posted July 21, 2022 Author Share Posted July 21, 2022 Thanks @trurl. Much appreciated and apologies for that. Your findings are aligned with the initial description (btw, being author of SAS Spindown plugin, I have some mileage of intimacy with these drives 🙂 ). Trying to prioritize the burning situation here, I'd like to put aside, for the moment, the reason for parity2's failure (I'll deal with it later), and focus on the operational, Unraid-specific question: At this state, am I right in assuming disk3 is indeed fully rebuilt? Will stopping the futile "rebuild" process right now, doing a New Config without parity2 and with "parity assumed valid", return the array to a stable, protected state? (then, I will deal with parity2 and probably rebuild it, but I will have a protected array). Thanks! Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 My guess is disk3 has been rebuilt, but if you stop it now it will probably want to rebuild both again. You can do what you suggest and check if all looks good. If not there are ways to force it to rebuild disk3 again. 1 Quote Link to comment
trurl Posted July 21, 2022 Share Posted July 21, 2022 You should unassign parity2 until you are ready to rebuild it. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.