mattekure Posted November 13, 2019 Share Posted November 13, 2019 I have had multiple hard crashes over the last 2 days. After rebooting, I am seeing parity errors and my parity check speed is extraordinarily slow. it started at 150 Mb/s and is now dropped to 4 Mb/s. I have attached diagnostics, but I am at a loss. tower-diagnostics-20191113-1312.zip Quote Link to comment
JorgeB Posted November 13, 2019 Share Posted November 13, 2019 Unraid crashed during the parity check: Nov 13 08:00:42 Tower kernel: RIP: 0010:handle_stripe+0x634/0x12ed [md_mod] This is very unusual, and could point to a hardware issue, try running memtest. Quote Link to comment
mattekure Posted November 13, 2019 Author Share Posted November 13, 2019 Ok, rebooting to run memtest now. Quote Link to comment
mattekure Posted November 13, 2019 Author Share Posted November 13, 2019 3 hrs in. 2 full passes of memtest86 have completed with 0 errors. Quote Link to comment
JorgeB Posted November 13, 2019 Share Posted November 13, 2019 24H is recommended for memtest, still it will only be conclusive if errors are detected. But you can try rebooting and running another parity check, if the same happens it's likely hardware related, but possibly not bad RAM. Quote Link to comment
mattekure Posted November 13, 2019 Author Share Posted November 13, 2019 Ok. I'll leave it running overnight and check back tomorrow. Thank you for your assistance with this. Quote Link to comment
mattekure Posted November 14, 2019 Author Share Posted November 14, 2019 I let the memtest86 run for about 18 hrs. Then got up in the middle of the night and rebooted, starting a parity check. This time the parity check did not appear to slow down like it did last time. I did set up a remote syslog server so I have a copy if it does hard crash again. I'm not sure what else I can do until it happens again. tower-diagnostics-20191114-0731.zip Quote Link to comment
John_M Posted November 14, 2019 Share Posted November 14, 2019 You have a Marvell-based disk controller 03:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller [1b4b:9235] (rev 11) It's based on the 9235 chip, which seems much less troublesome than the RAID-enabled 9230, which has caused a lot of problems recently for a number of people. I've actually got the 9235 in one of my servers and it has never caused me any problems but different people's experiences vary so much. Interestingly, the 9235 isn't included on the original list, while the 9230 is: Quote Link to comment
mattekure Posted November 14, 2019 Author Share Posted November 14, 2019 Yes, I purchased it several years ago before reading about the issues it can cause. I have been using that card since 2017 without issue, but I realize the potential for problems now. I plan on replacing it with a LSI Logic SAS 9207-8i. I purchased one off ebay, but it was faulty so I had to return it. I will purchase another and swap the card out. Quote Link to comment
mattekure Posted November 15, 2019 Author Share Posted November 15, 2019 It crashed again last night, sometime between 11p-12a. Unfortunately, my remote syslog didnt capture anything. everything was going fine with regular messages up to 11, then nothing after. Any thoughts as to what to test next? Quote Link to comment
John_M Posted November 15, 2019 Share Posted November 15, 2019 10 minutes ago, mattekure said: Any thoughts as to what to test next? Try not using the Marvell controller. Quote Link to comment
mattekure Posted November 15, 2019 Author Share Posted November 15, 2019 Not an option for right now. I have the replacement ordered, but it wont arrive until next week. Quote Link to comment
mattekure Posted November 15, 2019 Author Share Posted November 15, 2019 (edited) Parity check speed dropping rapidly to nothing. just noticed this in the log. Nov 15 07:56:25 Tower kernel: nginx[8137]: segfault at 200000000010 ip 00000000004247a3 sp 00007ffd0e5d1f90 error 4 in nginx[420000+101000] Nov 15 07:56:25 Tower kernel: Code: 1f 84 00 00 00 00 00 48 8b 7b 08 48 85 ff 74 05 e8 62 c6 ff ff 48 8b 1b 48 85 db 75 ea 48 8b 5d 10 eb 0b 0f 1f 40 00 48 89 dd <48> 8b 5b 10 48 89 ef e8 41 c6 ff ff 48 85 db 75 ec 48 83 c4 08 5b Nov 15 07:56:25 Tower nginx: 2019/11/15 07:56:25 [alert] 8136#8136: worker process 8137 exited on signal 11 EDIT* I'm fairly sure this segfault caused the parity check speed to drop to nothing. I stopped and restarted the parity check, and speeds jumped right back up to 150Mb/s. tower-diagnostics-20191115-1259.zip Edited November 15, 2019 by mattekure Quote Link to comment
mattekure Posted November 15, 2019 Author Share Posted November 15, 2019 Ugh this is getting frustrating. It crashed again only 10% of the way into a parity check. I had the GUI up on a monitor with the log showing and nothing came up. GUI is totally frozen, wont accept mouse/keyboard input, network down. So I am down to some kind of hardware issue. Power? Mobo/CPU? Quote Link to comment
mattekure Posted November 15, 2019 Author Share Posted November 15, 2019 Nov 15 09:46:53 Tower kernel: BUG: unable to handle kernel paging request at 0000200000000010 Nov 15 09:46:53 Tower kernel: PGD 0 P4D 0 Nov 15 09:46:53 Tower kernel: Oops: 0000 [#1] SMP PTI Nov 15 09:46:53 Tower kernel: CPU: 8 PID: 13121 Comm: sensors Tainted: P O 4.19.56-Unraid #1 Nov 15 09:46:53 Tower kernel: Hardware name: MSI MS-7885/X99A SLI PLUS(MS-7885), BIOS 1.E0 06/15/2018 Nov 15 09:46:53 Tower kernel: RIP: 0010:__lookup_mnt+0x3e/0x5a Nov 15 09:46:53 Tower kernel: Code: 48 01 d0 48 89 c2 48 d3 ea 48 01 d0 48 8b 15 a5 c2 d4 00 23 05 b3 c2 d4 00 48 8d 04 c2 48 8b 10 31 c0 48 85 d2 74 1e 48 89 d0 <48> 8b 48 10 48 8d 51 20 48 39 d7 75 06 48 39 70 18 74 08 48 8b 00 Nov 15 09:46:53 Tower kernel: RSP: 0018:ffffc90020ae7c08 EFLAGS: 00010206 Nov 15 09:46:53 Tower kernel: RAX: 0000200000000000 RBX: ffffc90020ae7ca8 RCX: ffffa8889bcae180 Nov 15 09:46:53 Tower kernel: RDX: ffffa8889bcae1a0 RSI: ffff88889f0613c0 RDI: ffff88889bcae1a0 Nov 15 09:46:53 Tower kernel: RBP: ffffc90020ae7d60 R08: 61c8864680b583eb R09: 000000005eef0496 Nov 15 09:46:53 Tower kernel: R10: ffffc90020ae7c4c R11: fffffffffc5cec29 R12: ffffc90020ae7ca0 Nov 15 09:46:53 Tower kernel: R13: ffffc90020ae7c9c R14: 0000000000000001 R15: 0000000000200000 Nov 15 09:46:53 Tower kernel: FS: 00001494da9d3740(0000) GS:ffff88889fa00000(0000) knlGS:0000000000000000 Nov 15 09:46:53 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 15 09:46:53 Tower kernel: CR2: 0000200000000010 CR3: 0000000809ca4001 CR4: 00000000001606e0 Nov 15 09:46:53 Tower kernel: Call Trace: Nov 15 09:46:53 Tower kernel: __follow_mount_rcu+0x56/0xc0 Nov 15 09:46:53 Tower kernel: lookup_fast+0xfa/0x27a Nov 15 09:46:53 Tower kernel: walk_component+0xc2/0x249 Nov 15 09:46:53 Tower kernel: ? link_path_walk.part.8+0x1ed/0x42d Nov 15 09:46:53 Tower kernel: path_lookupat.isra.10+0x12c/0x1e7 Nov 15 09:46:53 Tower kernel: filename_lookup.part.18+0x69/0xcc Nov 15 09:46:53 Tower kernel: ? _cond_resched+0x1b/0x1e Nov 15 09:46:53 Tower kernel: ? kmem_cache_alloc+0x30/0xf3 Nov 15 09:46:53 Tower kernel: ? getname_flags+0x44/0x14c Nov 15 09:46:53 Tower kernel: user_statfs+0x3d/0x93 Nov 15 09:46:53 Tower kernel: __se_sys_statfs+0x20/0x4c Nov 15 09:46:53 Tower kernel: ? handle_mm_fault+0x158/0x1a7 Nov 15 09:46:53 Tower kernel: ? __do_page_fault+0x379/0x40b Nov 15 09:46:53 Tower kernel: do_syscall_64+0x57/0xf2 Nov 15 09:46:53 Tower kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Nov 15 09:46:53 Tower kernel: RIP: 0033:0x1494dac27027 Nov 15 09:46:53 Tower kernel: Code: 44 00 00 48 8b 05 69 8e 0d 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 8e 0d 00 f7 d8 64 89 01 48 Nov 15 09:46:53 Tower kernel: RSP: 002b:00007ffeed1b9178 EFLAGS: 00000206 ORIG_RAX: 0000000000000089 Nov 15 09:46:53 Tower kernel: RAX: ffffffffffffffda RBX: 00007ffeed1b9498 RCX: 00001494dac27027 Nov 15 09:46:53 Tower kernel: RDX: 00001494dad03000 RSI: 00007ffeed1b9180 RDI: 00001494dad292a0 Nov 15 09:46:53 Tower kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 Nov 15 09:46:53 Tower kernel: R10: 0000000000000005 R11: 0000000000000206 R12: 0000000000000000 Nov 15 09:46:53 Tower kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Nov 15 09:46:53 Tower kernel: Modules linked in: xfs md_mod nct6775 hwmon_vid nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel drm_kms_helper kvm drm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc mxm_wmi aesni_intel aes_x86_64 crypto_simd cryptd agpgart e1000e i2c_i801 i2c_core glue_helper intel_cstate syscopyarea sysfillrect sysimgblt fb_sys_fops intel_uncore ahci pcc_cpufreq libahci wmi intel_rapl_perf button Nov 15 09:46:53 Tower kernel: CR2: 0000200000000010 Nov 15 09:46:53 Tower kernel: ---[ end trace 805b7d055c559bcf ]--- Nov 15 09:46:53 Tower kernel: RIP: 0010:__lookup_mnt+0x3e/0x5a Nov 15 09:46:53 Tower kernel: Code: 48 01 d0 48 89 c2 48 d3 ea 48 01 d0 48 8b 15 a5 c2 d4 00 23 05 b3 c2 d4 00 48 8d 04 c2 48 8b 10 31 c0 48 85 d2 74 1e 48 89 d0 <48> 8b 48 10 48 8d 51 20 48 39 d7 75 06 48 39 70 18 74 08 48 8b 00 Nov 15 09:46:53 Tower kernel: RSP: 0018:ffffc90020ae7c08 EFLAGS: 00010206 Nov 15 09:46:53 Tower kernel: RAX: 0000200000000000 RBX: ffffc90020ae7ca8 RCX: ffffa8889bcae180 Nov 15 09:46:53 Tower kernel: RDX: ffffa8889bcae1a0 RSI: ffff88889f0613c0 RDI: ffff88889bcae1a0 Nov 15 09:46:53 Tower kernel: RBP: ffffc90020ae7d60 R08: 61c8864680b583eb R09: 000000005eef0496 Nov 15 09:46:53 Tower kernel: R10: ffffc90020ae7c4c R11: fffffffffc5cec29 R12: ffffc90020ae7ca0 Nov 15 09:46:53 Tower kernel: R13: ffffc90020ae7c9c R14: 0000000000000001 R15: 0000000000200000 Nov 15 09:46:53 Tower kernel: FS: 00001494da9d3740(0000) GS:ffff88889fa00000(0000) knlGS:0000000000000000 Nov 15 09:46:53 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 15 09:46:53 Tower kernel: CR2: 0000200000000010 CR3: 0000000809ca4001 CR4: 00000000001606e0 Quote Link to comment
mattekure Posted November 16, 2019 Author Share Posted November 16, 2019 Well, the LSI 9207-8i shipped faster than expected. I've replaced the Marvell controller, reconnected everything and am running a parity check now. Quote Link to comment
mattekure Posted November 16, 2019 Author Share Posted November 16, 2019 Swapping out the Marvell for the 9207 didn't fix anything, it still crashed part way through the parity check. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.