[6.9.2] Ironwolf Drive Disablement and Dual Parity Rebuild Hangs

JorgeB · April 29, 2021

Pretty sure that won't be a general problem, but I've seen multiple Ryzen users with issues completing a parity check due to various call traces on v6.9.x, probably something to do with the new kernel and the Unraid driver, but without the diags from when it crashed it's just a guess.

Quote

really wish there was a restore function

There already is one:

Pauven · April 29, 2021

26 minutes ago, JorgeB said:

There already is one:

Not the restore Unraid version feature (which I used) but rather a restore flash drive from backup. I had to manually copy some config files from the flash drive backup to get 6.8.3 working correctly. It took me a while to figure out which files needed restoring. Some type of automation here would have been nice. Really cool if it was integrated into the restore Unraid version feature - it could prompt to optionally restore certain files from an existing flash drive backup.

28 minutes ago, JorgeB said:

I've seen multiple Ryzen users with issues completing a parity check due to various call traces on v6.9.x, probably something to do with the new kernel and the Unraid driver, but without the diags from when it crashed it's just a guess.

That could certainly be the issue. But no way I'm going back to 6.9.2 on my production server to gather diags once it fails. I'm still 4 hours away from a full recovery, and I'm not into S&M. I know it's my personal perspective, but I feel that if 6.9.x issues as bad as this, it shouldn't be considered "stable". I wasn't gearing up for a testing run, I was upgrading my production server to a "stable" dot-dot-two release, with a reasonable expectation that the kinks were worked out, and with no awareness that I could be signing up for data loss. I was completely unprepared to deal with these issues, and my main goal was simply surviving.

JorgeB · April 29, 2021

2 hours ago, Pauven said:

but rather a restore flash drive from backup.

Ahh, OK.

3 hours ago, JorgeB said:

Pretty sure that won't be a general problem

Just did a dual disk rebuild on my work server using v6.9.2 without issues, so it confirms it's not a general problem, I suspect it's what I wrote above.

Pauven · April 1, 2022

This appears to still be an issue for me. Need help to move forward.

Quick recap: Last year I upgraded to 6.9.2 and had issues with Seagate IronWolf (actually Exos) drives, plus the issue described here. I thought it was all related. I ended up rolling back to 6.8.3, and the issues went away.

A little over a month ago, 6.8.3 stopped working correctly for me, I believe due to an incompatible Unassigned Devices update. About a week ago I decided to apply the Seagate drive fix (disabling EPC) and try upgrading to 6.9.2 again. I thought everything was successful. Multiple spin-ups/spin-downs, a record-fast parity check, and a perfectly working GUI and Dockers and VM's, I thought I was in the clear.

Which brings us to today. Being the 1st of the month, the parity check kicked off at 2am. When I woke up this morning, I found that parity check progress was stalled at 0.1% after 6+ hours, and several hours later I can confirm it's not moving. In general, the GUI feels responsive, letting me browse around, but I noticed that the Dashboard presents no data, the drive temps don't appear to be updating, and the CPU/MB temp and fan speeds are wrong and frozen.

I connected to the Terminal and ran an mdcmd status to see if the parity check was actually running, but the mdResyncPos is frozen at 9283712. Best I can tell, it seems like Unraid is frozen, even though the GUI isn't hung.

First things first, I decided to run diagnostics. An hour later, it still reads "Starting diagnostics collection...".

On 4/29/2021 at 11:14 AM, JorgeB said:

I've seen multiple Ryzen users with issues completing a parity check due to various call traces on v6.9.x, probably something to do with the new kernel and the Unraid driver, but without the diags from when it crashed it's just a guess.

JorgeB is right. I checked Unraid's System Log, and it is full of Call Trace errors:

Apr  1 10:53:07 Tower nginx: 2022/04/01 10:53:07 [error] 9804#9804: *2157788 upstream timed out (110: Connection timed out) while reading upstream, client: 192.168.1.218, server: , request: "GET /Dashboard HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "tower", referrer: "http://tower/Main"
Apr  1 10:53:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
Apr  1 10:53:19 Tower kernel: Call Trace:
Apr  1 10:56:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
Apr  1 10:56:19 Tower kernel: Call Trace:
Apr  1 10:59:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
Apr  1 10:59:19 Tower kernel: Call Trace:
Apr  1 11:02:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
Apr  1 11:02:19 Tower kernel: Call Trace:
Apr  1 11:05:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
Apr  1 11:05:19 Tower kernel: Call Trace:
Apr  1 11:08:19 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
Apr  1 11:08:19 Tower kernel: Call Trace:

Since running Diagnostics didn't work, I'm not sure what my next step should be. Do I need to gather more info, or is the issue already confirmed as a Ryzen on Linux Kernel related issue? Are there any solutions?

Pauven · April 1, 2022

I've been searching the forum, trying to see if any other users have the same issue. I do see plenty of call trace reports, but so far none have matched mine.

My log just keeps repeating the same info over and over. What I posted above was just the errors, here's the full detail for a complete error segment:

Apr  1 13:50:20 Tower kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Apr  1 13:50:20 Tower kernel: rcu: 	10-....: (2 GPs behind) idle=61e/1/0x4000000000000002 softirq=118394498/118394499 fqs=10418875 
Apr  1 13:50:20 Tower kernel: 	(detected by 8, t=42541182 jiffies, g=291127741, q=55535900)
Apr  1 13:50:20 Tower kernel: Sending NMI from CPU 8 to CPUs 10:
Apr  1 13:50:20 Tower kernel: NMI backtrace for cpu 10
Apr  1 13:50:20 Tower kernel: CPU: 10 PID: 0 Comm: swapper/10 Tainted: G S                5.10.28-Unraid #1
Apr  1 13:50:20 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Professional Gaming, BIOS P4.80 07/18/2018
Apr  1 13:50:20 Tower kernel: RIP: 0010:mvs_slot_complete+0x31/0x45f [mvsas]
Apr  1 13:50:20 Tower kernel: Code: 00 00 41 56 41 55 41 54 55 53 89 c3 48 6b cb 58 48 83 ec 18 89 44 24 10 83 c8 ff 89 74 24 14 4c 8d 34 0f 4d 8b be 08 fd 00 00 <4d> 85 ff 0f 84 16 04 00 00 49 83 bf e8 00 00 00 00 0f 84 08 04 00
Apr  1 13:50:20 Tower kernel: RSP: 0018:ffffc900003c0e78 EFLAGS: 00000286
Apr  1 13:50:20 Tower kernel: RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000000
Apr  1 13:50:20 Tower kernel: RDX: 0000000000000000 RSI: 0000000000010000 RDI: ffff888138a80000
Apr  1 13:50:20 Tower kernel: RBP: ffff888138a80000 R08: 0000000000000001 R09: ffffffffa02eda65
Apr  1 13:50:20 Tower kernel: R10: 00000000d007f000 R11: ffff8881049a9800 R12: 0000000000000000
Apr  1 13:50:20 Tower kernel: R13: 0000000000000000 R14: ffff888138a80000 R15: 0000000000000000
Apr  1 13:50:20 Tower kernel: FS:  0000000000000000(0000) GS:ffff888fdee80000(0000) knlGS:0000000000000000
Apr  1 13:50:20 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  1 13:50:20 Tower kernel: CR2: 00000000002a925a CR3: 0000000281a36000 CR4: 00000000003506e0
Apr  1 13:50:20 Tower kernel: Call Trace:
Apr  1 13:50:20 Tower kernel: <IRQ>
Apr  1 13:50:20 Tower kernel: mvs_int_rx+0x85/0xf1 [mvsas]
Apr  1 13:50:20 Tower kernel: mvs_int_full+0x1e/0xa4 [mvsas]
Apr  1 13:50:20 Tower kernel: mvs_94xx_isr+0x4d/0x60 [mvsas]
Apr  1 13:50:20 Tower kernel: mvs_tasklet+0x87/0xa8 [mvsas]
Apr  1 13:50:20 Tower kernel: tasklet_action_common.isra.0+0x66/0xa3
Apr  1 13:50:20 Tower kernel: __do_softirq+0xc4/0x1c2
Apr  1 13:50:20 Tower kernel: asm_call_irq_on_stack+0x12/0x20
Apr  1 13:50:20 Tower kernel: </IRQ>
Apr  1 13:50:20 Tower kernel: do_softirq_own_stack+0x2c/0x39
Apr  1 13:50:20 Tower kernel: __irq_exit_rcu+0x45/0x80
Apr  1 13:50:20 Tower kernel: common_interrupt+0x119/0x12e
Apr  1 13:50:20 Tower kernel: asm_common_interrupt+0x1e/0x40
Apr  1 13:50:20 Tower kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8
Apr  1 13:50:20 Tower kernel: Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5
Apr  1 13:50:20 Tower kernel: RSP: 0018:ffffc9000016fea0 EFLAGS: 00000246
Apr  1 13:50:20 Tower kernel: RAX: ffff888fdeea2380 RBX: 0000000000000002 RCX: 000000000000001f
Apr  1 13:50:20 Tower kernel: RDX: 0000000000000000 RSI: 00000000238d7f23 RDI: 0000000000000000
Apr  1 13:50:20 Tower kernel: RBP: ffff888105d0d800 R08: 00028e38c0a3fe38 R09: 00028e3ab9ddf5c0
Apr  1 13:50:20 Tower kernel: R10: 0000000000000045 R11: 071c71c71c71c71c R12: 00028e38c0a3fe38
Apr  1 13:50:20 Tower kernel: R13: ffffffff820c8c40 R14: 0000000000000002 R15: 0000000000000000
Apr  1 13:50:20 Tower kernel: cpuidle_enter_state+0x101/0x1c4
Apr  1 13:50:20 Tower kernel: cpuidle_enter+0x25/0x31
Apr  1 13:50:20 Tower kernel: do_idle+0x1a6/0x214
Apr  1 13:50:20 Tower kernel: cpu_startup_entry+0x18/0x1a
Apr  1 13:50:20 Tower kernel: secondary_startup_64_no_verify+0xb0/0xbb

Comparing with others, and taking a closer look at the output in my log, I'm noticing a few too many [mvsas] related entries. That's for my Marvel based Highpoint 2760A 24-port SAS controller.

For years Fix Common Problems has been warning me about my Marvel based controller, but I ignore those warnings since I've never had any issues with it since I bought it in 2013. Almost 9 years of trouble-free operation all the way through 6.8.3.

Maybe I'm jumping to conclusions and the issue is something else. Can anyone tell?

JorgeB · April 1, 2022

4 minutes ago, Pauven said:

That's for my Marvel based Highpoint 2760A 24-port SAS controller.

Yep, same driver as the SASLP and SAS2LP, and known to be problematic, I would recommend replacing with a LSI if that's a possibility.

Pauven · April 9, 2022

Thanks JorgeB. I've followed your advice and ripped out the Highpoint 2760A. I installed a couple Dell H310's, combined with 8 SATA ports on my motherboard, to get back to 24 ports.

So far it's been smooth sailing, but my Call Trace problems don't usually crop up for a couple weeks, so I'm not in the clear yet. Fingers crossed.

Edited April 9, 2022 by Pauven

[6.9.2] Ironwolf Drive Disablement and Dual Parity Rebuild Hangs

User Feedback

Recommended Comments

JorgeB 7511

Link to comment

Pauven 124

Link to comment

JorgeB 7511

Link to comment

Pauven 124

Link to comment

Pauven 124

Link to comment

JorgeB 7511

Link to comment

Pauven 124

Link to comment

Join the conversation