• [6.9.2] Ironwolf Drive Disablement and Dual Parity Rebuild Hangs


    Pauven
    • Minor

    As a long time Unraid user (over a decade now, and loving it!), I rarely have issues (glossing right over those Ryzen teething issues).  It is with that perspective that I want to report that there are major issues with 6.9.2.

     

    I'd been hanging on to 6.8.3, avoiding the 6.9.x series as the bug reports seemed scary.  I read up on 6.9.2 and finally decided that with two dot.dot patches it was time to try it.  My main concern was that my two 8 TB Seagate Ironwolf drives might experience this issue: 

     

     

    I had a series of unfortunate events that makes it extremely difficult to figure out what transpired, and in what order, so I'll just lay it all out.  I'd been running 6.9.2 for almost a week, and I felt I was in the clear.  I hadn't noticed any drives going offline.

     

    Two nights ago (4/27), somehow my power strip turned off - either circuit protection kicked in, or possibly a dog stepped on the power button, regardless, I didn't discover this before my UPS was depleted and the server shut itself down.

     

    Yesterday, after getting the server started up again, I was surprised to see my two Ironwolf drives had the red X's next to them, indicating they were disabled.  I troubleshot this for a while, finding nothing in the logs, so it's possible that a Mover I kicked off manually yesterday (which would have been writing to these two drives) caused them to go offline on spin-up (according to the issue linked above), but that the subsequent power failure caused me to lose the logs of this event. [NOTE: I've since discovered that the automatic powerdown from the UPS failure was forced, which triggered diagnostics, and those logs were lost after all - diagnostics attached!!!]

     

    I was concerned that the Mover task had only written the latest data to the simulated array, so a rebuild seemed the right path forward to ensure I didn't lose any data.  I had to jump through hoops to get Unraid to attempt to rebuild parity to these two drives - apparently you have to un-select them, start/stop the array, then re-select them, before Unraid will give the option to rebuild.  Just a critique from a long-time user, this was not obvious and seems like there should be a button to force a drive back into the array without all these obstacles.  Anyways, now to the real troubles.  Luckily, I only have two Ironwolf drives, and with my dual parity (thanks LimeTech!!!), this was a recoverable situation.

     

    The rebuild only made it to about 46 GB before stopping.  It appeared that Unraid thought the rebuild was still progressing, but obviously it was stalled.  I quickly scanned through the log, finding no errors but lots of warnings related to the swapper being tainted.  At this point, I discovered that even thought the GUI was responsive (nice work GUI gang!), the underlying system was pretty much hung.  I couldn't pause or cancel the data rebuild, I couldn't powerdown or reboot, not through the GUI, and not through the command line.  Issuing a command in the terminal would hang the terminal.  Through the console I issues a powerdown, and it said it was doing it forcefully after awhile, but hung on collecting diagnostics.  I finally resorted to the 10-second power button press to force the server off (and diagnostics are missing).

     

    I decided that the issue could be those two Ironwolf drives, and since I had two brand new Exos drives of the same capacity, I swapped those in and started the data rebuild with those instead.  I tried this twice, and the rebuild never made it further than about 1% (an ominous 66.6 GB was the max rebuilt). 

     

    At this point, I really didn't know if I had an actual hardware failure (the power strip issue was still in my thoughts), or software issue, but with a dual-drive failure and a fully unprotected 87 TB array, I felt more pressure to quickly resolve the issue rather than gather more diagnostics (sorry not sorry). So I rolled back to 6.8.3 (so glad I made that flash backup, really wish there was a restore function), and started the data rebuild again last night.

     

    This morning, the rebuild is still running great after 11 hours.  It's at 63% complete, and should wrap up in about 6.5 hours based on history.  So something changed between 6.8.3 and 6.9.2 that is causing this specific scenario to fail.

     

    I know a dual-drive rebuild is a pretty rare event, and I don't know if it has received adequate testing on 6.9.x.  While the Seagate Ironwolf drive issue is bad enough, that's a known issue with multiple topics and possible workarounds.  But the complete inability to rebuild data to two drives simultaneously seems like a new and very big issue, and this issue persisted even after removing the Ironwolf drives.

     

    I will tentatively offer that I may have done a single drive rebuild, upgrading a drive from 3TB to an 8TB Ironwolf, on 6.9.2.  Honestly, I can't recall now if I did this before upgrading to 6.9.2 or after, but I'm pretty sure it was after.  So on my system, I believe I was able to perform a single drive rebuild, and only the dual-drive rebuild was failing. 

     

    I know we always get in trouble for not including Diagnostics, so I am including a few files: 

     

    The 20210427-2133 diagnostics are from the forced powerdown two nights ago, on 6.9.2, when the UPS ran out of juice, and before I discovered that the two Ironwolf drives were disabled.  Note, they might be disabled already in these diags, no idea of what to look for in there.

     

    The 20210420-1613 diagnostics is from 6.8.3, the day before I upgraded to 6.9.2.  I think I hit the diagnostics button by accident.  Figured it won't hurt to include it.

     

    And finally the 20210429-0923, is from right now, after downgrading to 6.8.3, and with the rebuild still in progress.

     

    Paul

    tower-diagnostics-20210427-2133.zip tower-diagnostics-20210429-0923.zip tower-diagnostics-20210420-1613.zip

    • Thanks 1



    User Feedback

    Recommended Comments

    Pretty sure that won't be a general problem, but I've seen multiple Ryzen users with issues completing a parity check due to various call traces on v6.9.x, probably something to do with the new kernel and the Unraid driver, but without the diags from when it crashed it's just a guess.

     

    Quote

    really wish there was a restore function

    There already is one:

     

    imagem.thumb.png.6a1e9d83bdca178b529359a3c1f8a544.png

     

    Link to comment
    26 minutes ago, JorgeB said:

    There already is one:

     

    Not the restore Unraid version feature (which I used) but rather a restore flash drive from backup.  I had to manually copy some config files from the flash drive backup to get 6.8.3 working correctly.  It took me a while to figure out which files needed restoring.  Some type of automation here would have been nice.  Really cool if it was integrated into the restore Unraid version feature - it could prompt to optionally restore certain files from an existing flash drive backup.

     

     

    28 minutes ago, JorgeB said:

    I've seen multiple Ryzen users with issues completing a parity check due to various call traces on v6.9.x, probably something to do with the new kernel and the Unraid driver, but without the diags from when it crashed it's just a guess.

     

    That could certainly be the issue.  But no way I'm going back to 6.9.2 on my production server to gather diags once it fails.  I'm still 4 hours away from a full recovery, and I'm not into S&M.  I know it's my personal perspective, but I feel that if 6.9.x issues as bad as this, it shouldn't be considered "stable".  I wasn't gearing up for a testing run, I was upgrading my production server to a "stable" dot-dot-two release, with a reasonable expectation that the kinks were worked out, and with no awareness that I could be signing up for data loss.  I was completely unprepared to deal with these issues, and my main goal was simply surviving.

    Link to comment
    2 hours ago, Pauven said:

    but rather a restore flash drive from backup. 

    Ahh, OK.

     

    3 hours ago, JorgeB said:

    Pretty sure that won't be a general problem

    Just did a dual disk rebuild on my work server using v6.9.2 without issues, so it confirms it's not a general problem, I suspect it's what I wrote above.

    Link to comment

    This appears to still be an issue for me.  Need help to move forward.

     

    Quick recap:  Last year I upgraded to 6.9.2 and had issues with Seagate IronWolf (actually Exos) drives, plus the issue described here.  I thought it was all related.  I ended up rolling back to 6.8.3, and the issues went away.

     

    A little over a month ago, 6.8.3 stopped working correctly for me, I believe due to an incompatible Unassigned Devices update.  About a week ago I decided to apply the Seagate drive fix (disabling EPC) and try upgrading to 6.9.2 again.  I thought everything was successful.  Multiple spin-ups/spin-downs, a record-fast parity check, and a perfectly working GUI and Dockers and VM's, I thought I was in the clear.

     

    Which brings us to today.  Being the 1st of the month, the parity check kicked off at 2am.  When I woke up this morning, I found that parity check progress was stalled at 0.1% after 6+ hours, and several hours later I can confirm it's not moving.  In general, the GUI feels responsive, letting me browse around, but I noticed that the Dashboard presents no data, the drive temps don't appear to be updating, and the CPU/MB temp and fan speeds are wrong and frozen.

     

    I connected to the Terminal and ran an mdcmd status to see if the parity check was actually running, but the mdResyncPos is frozen at 9283712.  Best I can tell, it seems like Unraid is frozen, even though the GUI isn't hung.

     

    First things first, I decided to run diagnostics.  An hour later, it still reads "Starting diagnostics collection...".

     

     

    On 4/29/2021 at 11:14 AM, JorgeB said:

    I've seen multiple Ryzen users with issues completing a parity check due to various call traces on v6.9.x, probably something to do with the new kernel and the Unraid driver, but without the diags from when it crashed it's just a guess.

     

    JorgeB is right.  I checked Unraid's System Log, and it is full of Call Trace errors:

     

    Apr  1 10:53:07 Tower nginx: 2022/04/01 10:53:07 [error] 9804#9804: *2157788 upstream timed out (110: Connection timed out) while reading upstream, client: 192.168.1.218, server: , request: "GET /Dashboard HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "tower", referrer: "http://tower/Main"
    Apr  1 10:53:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
    Apr  1 10:53:19 Tower kernel: Call Trace:
    Apr  1 10:56:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
    Apr  1 10:56:19 Tower kernel: Call Trace:
    Apr  1 10:59:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
    Apr  1 10:59:19 Tower kernel: Call Trace:
    Apr  1 11:02:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
    Apr  1 11:02:19 Tower kernel: Call Trace:
    Apr  1 11:05:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
    Apr  1 11:05:19 Tower kernel: Call Trace:
    Apr  1 11:08:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
    Apr  1 11:08:19 Tower kernel: Call Trace:

     

    Since running Diagnostics didn't work, I'm not sure what my next step should be.  Do I need to gather more info, or is the issue already confirmed as a Ryzen on Linux Kernel related issue?  Are there any solutions?

    Link to comment

    I've been searching the forum, trying to see if any other users have the same issue.  I do see plenty of call trace reports, but so far none have matched mine.

     

    My log just keeps repeating the same info over and over.  What I posted above was just the errors, here's the full detail for a complete error segment:

     

    Apr  1 13:50:20 Tower kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    Apr  1 13:50:20 Tower kernel: rcu: 	10-....: (2 GPs behind) idle=61e/1/0x4000000000000002 softirq=118394498/118394499 fqs=10418875 
    Apr  1 13:50:20 Tower kernel: 	(detected by 8, t=42541182 jiffies, g=291127741, q=55535900)
    Apr  1 13:50:20 Tower kernel: Sending NMI from CPU 8 to CPUs 10:
    Apr  1 13:50:20 Tower kernel: NMI backtrace for cpu 10
    Apr  1 13:50:20 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
    Apr  1 13:50:20 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Professional Gaming, BIOS P4.80 07/18/2018
    Apr  1 13:50:20 Tower kernel: RIP: 0010:mvs_slot_complete+0x31/0x45f [mvsas]
    Apr  1 13:50:20 Tower kernel: Code: 00 00 41 56 41 55 41 54 55 53 89 c3 48 6b cb 58 48 83 ec 18 89 44 24 10 83 c8 ff 89 74 24 14 4c 8d 34 0f 4d 8b be 08 fd 00 00 <4d> 85 ff 0f 84 16 04 00 00 49 83 bf e8 00 00 00 00 0f 84 08 04 00
    Apr  1 13:50:20 Tower kernel: RSP: 0018:ffffc900003c0e78 EFLAGS: 00000286
    Apr  1 13:50:20 Tower kernel: RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000000
    Apr  1 13:50:20 Tower kernel: RDX: 0000000000000000 RSI: 0000000000010000 RDI: ffff888138a80000
    Apr  1 13:50:20 Tower kernel: RBP: ffff888138a80000 R08: 0000000000000001 R09: ffffffffa02eda65
    Apr  1 13:50:20 Tower kernel: R10: 00000000d007f000 R11: ffff8881049a9800 R12: 0000000000000000
    Apr  1 13:50:20 Tower kernel: R13: 0000000000000000 R14: ffff888138a80000 R15: 0000000000000000
    Apr  1 13:50:20 Tower kernel: FS:  0000000000000000(0000) GS:ffff888fdee80000(0000) knlGS:0000000000000000
    Apr  1 13:50:20 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Apr  1 13:50:20 Tower kernel: CR2: 00000000002a925a CR3: 0000000281a36000 CR4: 00000000003506e0
    Apr  1 13:50:20 Tower kernel: Call Trace:
    Apr  1 13:50:20 Tower kernel: <IRQ>
    Apr  1 13:50:20 Tower kernel: mvs_int_rx+0x85/0xf1 [mvsas]
    Apr  1 13:50:20 Tower kernel: mvs_int_full+0x1e/0xa4 [mvsas]
    Apr  1 13:50:20 Tower kernel: mvs_94xx_isr+0x4d/0x60 [mvsas]
    Apr  1 13:50:20 Tower kernel: mvs_tasklet+0x87/0xa8 [mvsas]
    Apr  1 13:50:20 Tower kernel: tasklet_action_common.isra.0+0x66/0xa3
    Apr  1 13:50:20 Tower kernel: __do_softirq+0xc4/0x1c2
    Apr  1 13:50:20 Tower kernel: asm_call_irq_on_stack+0x12/0x20
    Apr  1 13:50:20 Tower kernel: </IRQ>
    Apr  1 13:50:20 Tower kernel: do_softirq_own_stack+0x2c/0x39
    Apr  1 13:50:20 Tower kernel: __irq_exit_rcu+0x45/0x80
    Apr  1 13:50:20 Tower kernel: common_interrupt+0x119/0x12e
    Apr  1 13:50:20 Tower kernel: asm_common_interrupt+0x1e/0x40
    Apr  1 13:50:20 Tower kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8
    Apr  1 13:50:20 Tower kernel: Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5
    Apr  1 13:50:20 Tower kernel: RSP: 0018:ffffc9000016fea0 EFLAGS: 00000246
    Apr  1 13:50:20 Tower kernel: RAX: ffff888fdeea2380 RBX: 0000000000000002 RCX: 000000000000001f
    Apr  1 13:50:20 Tower kernel: RDX: 0000000000000000 RSI: 00000000238d7f23 RDI: 0000000000000000
    Apr  1 13:50:20 Tower kernel: RBP: ffff888105d0d800 R08: 00028e38c0a3fe38 R09: 00028e3ab9ddf5c0
    Apr  1 13:50:20 Tower kernel: R10: 0000000000000045 R11: 071c71c71c71c71c R12: 00028e38c0a3fe38
    Apr  1 13:50:20 Tower kernel: R13: ffffffff820c8c40 R14: 0000000000000002 R15: 0000000000000000
    Apr  1 13:50:20 Tower kernel: cpuidle_enter_state+0x101/0x1c4
    Apr  1 13:50:20 Tower kernel: cpuidle_enter+0x25/0x31
    Apr  1 13:50:20 Tower kernel: do_idle+0x1a6/0x214
    Apr  1 13:50:20 Tower kernel: cpu_startup_entry+0x18/0x1a
    Apr  1 13:50:20 Tower kernel: secondary_startup_64_no_verify+0xb0/0xbb

     

    Comparing with others, and taking a closer look at the output in my log, I'm noticing a few too many [mvsas] related entries.  That's for my Marvel based Highpoint 2760A 24-port SAS controller.

     

    For years Fix Common Problems has been warning me about my Marvel based controller, but I ignore those warnings since I've never had any issues with it since I bought it in 2013.  Almost 9 years of trouble-free operation all the way through 6.8.3.

     

    Maybe I'm jumping to conclusions and the issue is something else.  Can anyone tell?

    Link to comment
    4 minutes ago, Pauven said:

    That's for my Marvel based Highpoint 2760A 24-port SAS controller.

    Yep, same driver as the SASLP and SAS2LP, and known to be problematic, I would recommend replacing with a LSI if that's a possibility.

    Link to comment

    Thanks JorgeB.  I've followed your advice and ripped out the Highpoint 2760A.  I installed a couple Dell H310's, combined with 8 SATA ports on my motherboard, to get back to 24 ports.

     

    So far it's been smooth sailing, but my Call Trace problems don't usually crop up for a couple weeks, so I'm not in the clear yet.  Fingers crossed.

    Edited by Pauven
    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.