• [6.9.2] Ironwolf Drive Disablement and Dual Parity Rebuild Hangs


    Pauven
    • Minor

    As a long time Unraid user (over a decade now, and loving it!), I rarely have issues (glossing right over those Ryzen teething issues).  It is with that perspective that I want to report that there are major issues with 6.9.2.

     

    I'd been hanging on to 6.8.3, avoiding the 6.9.x series as the bug reports seemed scary.  I read up on 6.9.2 and finally decided that with two dot.dot patches it was time to try it.  My main concern was that my two 8 TB Seagate Ironwolf drives might experience this issue: 

     

     

    I had a series of unfortunate events that makes it extremely difficult to figure out what transpired, and in what order, so I'll just lay it all out.  I'd been running 6.9.2 for almost a week, and I felt I was in the clear.  I hadn't noticed any drives going offline.

     

    Two nights ago (4/27), somehow my power strip turned off - either circuit protection kicked in, or possibly a dog stepped on the power button, regardless, I didn't discover this before my UPS was depleted and the server shut itself down.

     

    Yesterday, after getting the server started up again, I was surprised to see my two Ironwolf drives had the red X's next to them, indicating they were disabled.  I troubleshot this for a while, finding nothing in the logs, so it's possible that a Mover I kicked off manually yesterday (which would have been writing to these two drives) caused them to go offline on spin-up (according to the issue linked above), but that the subsequent power failure caused me to lose the logs of this event. [NOTE: I've since discovered that the automatic powerdown from the UPS failure was forced, which triggered diagnostics, and those logs were lost after all - diagnostics attached!!!]

     

    I was concerned that the Mover task had only written the latest data to the simulated array, so a rebuild seemed the right path forward to ensure I didn't lose any data.  I had to jump through hoops to get Unraid to attempt to rebuild parity to these two drives - apparently you have to un-select them, start/stop the array, then re-select them, before Unraid will give the option to rebuild.  Just a critique from a long-time user, this was not obvious and seems like there should be a button to force a drive back into the array without all these obstacles.  Anyways, now to the real troubles.  Luckily, I only have two Ironwolf drives, and with my dual parity (thanks LimeTech!!!), this was a recoverable situation.

     

    The rebuild only made it to about 46 GB before stopping.  It appeared that Unraid thought the rebuild was still progressing, but obviously it was stalled.  I quickly scanned through the log, finding no errors but lots of warnings related to the swapper being tainted.  At this point, I discovered that even thought the GUI was responsive (nice work GUI gang!), the underlying system was pretty much hung.  I couldn't pause or cancel the data rebuild, I couldn't powerdown or reboot, not through the GUI, and not through the command line.  Issuing a command in the terminal would hang the terminal.  Through the console I issues a powerdown, and it said it was doing it forcefully after awhile, but hung on collecting diagnostics.  I finally resorted to the 10-second power button press to force the server off (and diagnostics are missing).

     

    I decided that the issue could be those two Ironwolf drives, and since I had two brand new Exos drives of the same capacity, I swapped those in and started the data rebuild with those instead.  I tried this twice, and the rebuild never made it further than about 1% (an ominous 66.6 GB was the max rebuilt). 

     

    At this point, I really didn't know if I had an actual hardware failure (the power strip issue was still in my thoughts), or software issue, but with a dual-drive failure and a fully unprotected 87 TB array, I felt more pressure to quickly resolve the issue rather than gather more diagnostics (sorry not sorry). So I rolled back to 6.8.3 (so glad I made that flash backup, really wish there was a restore function), and started the data rebuild again last night.

     

    This morning, the rebuild is still running great after 11 hours.  It's at 63% complete, and should wrap up in about 6.5 hours based on history.  So something changed between 6.8.3 and 6.9.2 that is causing this specific scenario to fail.

     

    I know a dual-drive rebuild is a pretty rare event, and I don't know if it has received adequate testing on 6.9.x.  While the Seagate Ironwolf drive issue is bad enough, that's a known issue with multiple topics and possible workarounds.  But the complete inability to rebuild data to two drives simultaneously seems like a new and very big issue, and this issue persisted even after removing the Ironwolf drives.

     

    I will tentatively offer that I may have done a single drive rebuild, upgrading a drive from 3TB to an 8TB Ironwolf, on 6.9.2.  Honestly, I can't recall now if I did this before upgrading to 6.9.2 or after, but I'm pretty sure it was after.  So on my system, I believe I was able to perform a single drive rebuild, and only the dual-drive rebuild was failing. 

     

    I know we always get in trouble for not including Diagnostics, so I am including a few files: 

     

    The 20210427-2133 diagnostics are from the forced powerdown two nights ago, on 6.9.2, when the UPS ran out of juice, and before I discovered that the two Ironwolf drives were disabled.  Note, they might be disabled already in these diags, no idea of what to look for in there.

     

    The 20210420-1613 diagnostics is from 6.8.3, the day before I upgraded to 6.9.2.  I think I hit the diagnostics button by accident.  Figured it won't hurt to include it.

     

    And finally the 20210429-0923, is from right now, after downgrading to 6.8.3, and with the rebuild still in progress.

     

    Paul

    tower-diagnostics-20210427-2133.zip tower-diagnostics-20210429-0923.zip tower-diagnostics-20210420-1613.zip

    • Thanks 1



    User Feedback

    Recommended Comments

    Pretty sure that won't be a general problem, but I've seen multiple Ryzen users with issues completing a parity check due to various call traces on v6.9.x, probably something to do with the new kernel and the Unraid driver, but without the diags from when it crashed it's just a guess.

     

    Quote

    really wish there was a restore function

    There already is one:

     

    imagem.thumb.png.6a1e9d83bdca178b529359a3c1f8a544.png

     

    Link to comment
    26 minutes ago, JorgeB said:

    There already is one:

     

    Not the restore Unraid version feature (which I used) but rather a restore flash drive from backup.  I had to manually copy some config files from the flash drive backup to get 6.8.3 working correctly.  It took me a while to figure out which files needed restoring.  Some type of automation here would have been nice.  Really cool if it was integrated into the restore Unraid version feature - it could prompt to optionally restore certain files from an existing flash drive backup.

     

     

    28 minutes ago, JorgeB said:

    I've seen multiple Ryzen users with issues completing a parity check due to various call traces on v6.9.x, probably something to do with the new kernel and the Unraid driver, but without the diags from when it crashed it's just a guess.

     

    That could certainly be the issue.  But no way I'm going back to 6.9.2 on my production server to gather diags once it fails.  I'm still 4 hours away from a full recovery, and I'm not into S&M.  I know it's my personal perspective, but I feel that if 6.9.x issues as bad as this, it shouldn't be considered "stable".  I wasn't gearing up for a testing run, I was upgrading my production server to a "stable" dot-dot-two release, with a reasonable expectation that the kinks were worked out, and with no awareness that I could be signing up for data loss.  I was completely unprepared to deal with these issues, and my main goal was simply surviving.

    Link to comment
    2 hours ago, Pauven said:

    but rather a restore flash drive from backup. 

    Ahh, OK.

     

    3 hours ago, JorgeB said:

    Pretty sure that won't be a general problem

    Just did a dual disk rebuild on my work server using v6.9.2 without issues, so it confirms it's not a general problem, I suspect it's what I wrote above.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.