Alex.vision Posted February 2, 2020 Share Posted February 2, 2020 Hello fellow UnRaiders, Issue My machine locks up randomly, never running for more than 2 or 3 days. I had originally thought it was related to my pi-hole docker, it seemed to produce the problem more frequently when I was running docker. However, with that disabled my machine still crashes. I can’t ping the system or reach any share. When I had the GUI enabled it would be completely frozen at the login screen. A few times I had been logged into the system and was running htop full screen, it would also be frozen, requiring a hard reset. System stability is so bad, I can start the computer in the morning and within a few hours it has locked up. Machine Specs MB: ASUS Prime X470-Pro (Bios Version 5406) CPU: AMD Ryzen 5 3600X (Stock speeds) (Stock Cooler) Ram: G.SKILL Ripjaws V Series 16GB (2 x 8GB) 288-Pin DDR4 SDRAM DDR4 3600 PSU: CORSAIR RM Series, RM850, 850 Watt HBA: AOC-SAS2LP-MV8 (I know it’s a Marvell based card; I’m swapping it for another LSI I have) External Das Card: LSI SAS 9212-4I4E Single Connected to 16 Bay Das. Drives: 22 8TB Data Drives, 1 8TB Parity and 1 8TB Cache Drive, mostly Seagate, some Western Digital. GPU: Gigabyte Video Card Graphics Cards GV-R523D3-1GL REV2.0 Attempted Fixes Ran in safe mode, no plugins, disabled docker, disabled VM’s. Disabled HT in bios. Disabled any extra mb ports, like serial and floppy support. Ran without the array started. I have run in GUI mode, GUI Safemode, no GUI and headless. I have run fix common problems, and other than the complaint about the Marvell, all seems well. No obvious issues. I haven't been able to run Memtest86 on this new board, for some reason it won't launch, it instantly reboots after I select it in the UnRaid boot menu. Unraid Versions I don’t keep the best records of the versions which work and which have problems. I can't remember what version I was on when this problem started after I built the thing. I do know I’m currently on 6.8.2 as of today, I was on 6.7.2 and worked my way up through each RC version up to 6.8.2 now. On each version I had crashes, but I just today went to 6.8.2 so it may or may not last. If (when) I crash on 6.8.2 I will post that current syslog too. Final Thoughts I’m attaching two photos of what displays on my monitor when the system locks up. I’m also attaching 2 diagnostic files, 1 from just after rebooting from a lockup (6.8.1). The second one is after my update to 6.8.2, I'm not expecting either log to have any big revelations, I think the photos of the screen might lead me in a better direction, if I knew what they meant. I'm at a complete loss here, and would love if someone could help me diagnose my problem. Alex.Vision Log 6.8.1 post Crash 1.zip Alex.Vision Log 6.8.2.zip Quote Link to comment
Squid Posted February 2, 2020 Share Posted February 2, 2020 3 minutes ago, Alex.vision said: I haven't been able to run Memtest86 on this new board, for some reason it won't launch, it instantly reboots after I select it in the UnRaid boot menu. Memtest won't boot via UEFI boot mode. You have to change the BIOS / boot order to boot via Legacy / normal from the USB for memtest to work. Quote Link to comment
Alex.vision Posted February 2, 2020 Author Share Posted February 2, 2020 Just now, Squid said: Memtest won't boot via UEFI boot mode. You have to change the BIOS / boot order to boot via Legacy / normal from the USB for memtest to work. Ah ok. Ill Reboot and start a Memtest Cycle Quote Link to comment
JorgeB Posted February 3, 2020 Share Posted February 3, 2020 10 hours ago, Alex.vision said: Ram: G.SKILL Ripjaws V Series 16GB (2 x 8GB) 288-Pin DDR4 SDRAM DDR4 3600 Don't overclock your RAM, it's known to cause stability issues with some Ryzen system, even data corruption, respect max speed based on config: Also look for "Power Supply Idle Control" (or similar) on the BIOS and set it to "typical current idle" (or similar) Quote Link to comment
Alex.vision Posted February 3, 2020 Author Share Posted February 3, 2020 6 hours ago, johnnie.black said: Don't overclock your RAM, it's known to cause stability issues with some Ryzen system, even data corruption, respect max speed based on config: Also look for "Power Supply Idle Control" (or similar) on the BIOS and set it to "typical current idle" (or similar) jonnie.black Thanks for the help, I'll change the ram settings and look for the power setting when I get done with work. I would have done it this morning but I wanted to let memtest run for a good 24 hours. Could you tell me where that Ram chart came from? I guess I thought that because it was 3600 speed ram it wasnt really overclocked, even though I had to use XMP or DOCP. I think I remember the old days when it said 2667, 3200(OC), 3600(OC) on the box. I'll do some googling on the "typical current idle" too. Thanks Quote Link to comment
JorgeB Posted February 3, 2020 Share Posted February 3, 2020 4 minutes ago, Alex.vision said: Could you tell me where that Ram chart came from? Some website that was reviwing 3rd gen Ryzen, don't remember exactly which, but they are the officially supported speeds, anything above that is an overclock. Quote Link to comment
Alex.vision Posted February 6, 2020 Author Share Posted February 6, 2020 Well it has been a few days, and I have been walking on eggshells when it comes to running my server. It seems that either turning off the ram overclock or changing the Power Supply Idle Control seems to have remedied my random reboot issue. Thank you @johnnie.black and @Squid for the troubleshooting help. I have 2 days and 16 hours of uptime, which is way better than it has been in the past. I still have docker disabled but i'll work things back in one step at a time. A functioning system minus a few features is way better than no system at all. Thanks for the help!! Quote Link to comment
Alex.vision Posted March 2, 2020 Author Share Posted March 2, 2020 OK, Round two. I made the suggested changes from above; I also changed out my Marvell based HBA to an LSI model. I also upgraded to version 6.8.2 with the same results. I had thought the problem might have been solved, but the past week and a half my server really struggles to maintain any uptime. It locks up constantly, but I think I finally managed to get a log file that might help. In my last attempt to fix my problem myself, I erased my flash drive and started over, importing just a few settings from the old one. I thought maybe something on it was causing problems. Apparently not. I’m uploading to log files, one mid lockup and one after I hard reset the system. I can see a bunch of information starting on line 2810, but I don’t know enough about Linux to say if it really is bug or what not. I really hope this is something I can fix. alex.vision (crash) media-syslog-20200227-0511.zip alex.vision (next boot) media-syslog-20200227-1636.zip Quote Link to comment
JorgeB Posted March 3, 2020 Share Posted March 3, 2020 Difficult to say for sure the crash looks hardware related, at least it doesn't point to any specific driver/module. Quote Link to comment
Alex.vision Posted March 4, 2020 Author Share Posted March 4, 2020 Hmm, ok, I have an Identical system, I can try swapping all my drives over to that one and try to narrow it down to hardware or hw incompatibility if that seems like a viable option. Quote Link to comment
Alex.vision Posted March 5, 2020 Author Share Posted March 5, 2020 Before I can change out all of the hardware to test the above hypothesis im trying to duplicate the data to another server. I strt the system and then intiate a transfer and see how much data can be pulled before it crashes. I just looked at my transfer and noticed it had stopped. So I logged into the web page, which displayed. I got three notifications about plugin updates, and when I clicked the plugins tab, the system refreshed to a blank page with the header still showing and the chrome busy icon in the tab. I quickly opened a terminal and tried to pull info from the syslog. root@Media:~# tail -f /var/log/syslog Mar 4 18:56:09 Media kernel: start_secondary+0x197/0x1b2 Mar 4 18:56:09 Media kernel: secondary_startup_64+0xa4/0xb0 Mar 4 18:56:32 Media kernel: sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x00 Mar 4 18:56:32 Media kernel: sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x2a 2a 00 00 00 08 01 00 00 01 00 Mar 4 18:56:32 Media kernel: print_req_error: I/O error, dev sda, sector 2049 Mar 4 18:56:32 Media kernel: Buffer I/O error on dev sda1, logical block 1, lost async page write Mar 4 18:56:32 Media kernel: sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x00 Mar 4 18:56:32 Media kernel: sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x2a 2a 00 00 00 17 91 00 00 01 00 Mar 4 18:56:32 Media kernel: print_req_error: I/O error, dev sda, sector 6033 Mar 4 18:56:32 Media kernel: Buffer I/O error on dev sda1, logical block 3985, lost async page write Mar 4 19:11:09 Media kernel: RDX: 0000000000000000 RSI: 0000000021bf5b50 RDI: 0000000000000000 Mar 4 19:11:09 Media kernel: RBP: 00000589298a6770 R08: 00000589298a6770 R09: 000000000000573e Mar 4 19:11:09 Media kernel: R10: 000000008f3bab38 R11: 071c71c71c71c71c R12: 0000000000000002 Mar 4 19:11:09 Media kernel: R13: ffffffff81e5e2a0 R14: 0000000000000000 R15: ffffffff81e5e378 Mar 4 19:11:09 Media kernel: ? cpuidle_enter_state+0xbf/0x141 Mar 4 19:11:09 Media kernel: do_idle+0x17e/0x1fc Mar 4 19:11:09 Media kernel: cpu_startup_entry+0x6a/0x6c Mar 4 19:11:09 Media kernel: start_secondary+0x197/0x1b2 Mar 4 19:11:09 Media kernel: secondary_startup_64+0xa4/0xb0 Mar 4 19:13:09 Media login[15244]: ROOT LOGIN on '/dev/pts/1' Mar 4 19:14:09 Media kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks: Mar 4 19:14:09 Media kernel: rcu: 4-...0: (0 ticks this GP) idle=cd2/1/0x4000000000000000 softirq=314663/314663 fqs=814935 Mar 4 19:14:09 Media kernel: rcu: 11-...0: (2 GPs behind) idle=98e/0/0x1 softirq=365175/365176 fqs=814935 Mar 4 19:14:09 Media kernel: rcu: (detected by 8, t=3300092 jiffies, g=1434025, q=713807) Mar 4 19:14:09 Media kernel: Sending NMI from CPU 8 to CPUs 4: Mar 4 19:14:09 Media kernel: NMI backtrace for cpu 4 Mar 4 19:14:09 Media kernel: CPU: 4 PID: 15614 Comm: unraidd7 Tainted: G O 4.19.98-Unraid #1 Mar 4 19:14:09 Media kernel: Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019 Mar 4 19:14:09 Media kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x11e/0x171 Mar 4 19:14:09 Media kernel: Code: 48 03 04 cd 20 37 db 81 48 89 10 8b 42 08 85 c0 75 04 f3 90 eb f5 48 8b 0a 48 85 c9 74 c9 0f 0d 09 8b 07 66 85 c0 74 04 f3 90 <eb> f5 41 89 c0 66 45 31 c0 44 39 c6 74 0a 48 85 c9 c6 07 01 75 1b Mar 4 19:14:09 Media kernel: RSP: 0018:ffffc9000bd4fe80 EFLAGS: 00000002 Mar 4 19:14:09 Media kernel: RAX: 0000000000140101 RBX: ffffc9000bd4fec0 RCX: 0000000000000000 Mar 4 19:14:09 Media kernel: RDX: ffff88840e720740 RSI: 0000000000140000 RDI: ffff88840bfe8498 Mar 4 19:14:09 Media kernel: RBP: ffff88840bfe8498 R08: 000000000000029c R09: 0000000000000000 Mar 4 19:14:09 Media kernel: R10: 0000000000000020 R11: ffff88840e71fb40 R12: 0000000000000246 Mar 4 19:14:09 Media kernel: R13: ffff88840bfe8498 R14: ffff8883d4c81b00 R15: ffffc9000bc0faf0 Mar 4 19:14:09 Media kernel: FS: 0000000000000000(0000) GS:ffff88840e700000(0000) knlGS:0000000000000000 Mar 4 19:14:09 Media kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 4 19:14:09 Media kernel: CR2: 00001460ca2b2000 CR3: 0000000001e0a000 CR4: 0000000000340ee0 Mar 4 19:14:09 Media kernel: Call Trace: Mar 4 19:14:09 Media kernel: _raw_spin_lock_irqsave+0x29/0x31 Mar 4 19:14:09 Media kernel: prepare_to_wait_event+0x13/0xd2 Mar 4 19:14:09 Media kernel: md_thread+0x8f/0x115 [md_mod] Mar 4 19:14:09 Media kernel: ? wait_woken+0x6a/0x6a Mar 4 19:14:09 Media kernel: ? md_open+0x2c/0x2c [md_mod] Mar 4 19:14:09 Media kernel: kthread+0x10c/0x114 Mar 4 19:14:09 Media kernel: ? kthread_park+0x89/0x89 Mar 4 19:14:09 Media kernel: ret_from_fork+0x22/0x40 Mar 4 19:14:09 Media kernel: Sending NMI from CPU 8 to CPUs 11: Mar 4 19:14:09 Media kernel: NMI backtrace for cpu 11 Mar 4 19:14:09 Media kernel: CPU: 11 PID: 0 Comm: swapper/11 Tainted: G O 4.19.98-Unraid #1 Mar 4 19:14:09 Media kernel: Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019 Mar 4 19:14:09 Media kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x6b/0x171 Mar 4 19:14:09 Media kernel: Code: 42 f0 8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 74 0e 81 e6 00 ff 00 00 75 1a c6 47 01 00 eb 14 85 f6 74 0a 8b 07 84 c0 74 04 f3 90 <eb> f6 66 c7 07 01 00 c3 48 c7 c2 40 07 02 00 65 48 03 15 48 6d f8 Mar 4 19:14:09 Media kernel: RSP: 0018:ffff88840e8c3c80 EFLAGS: 00000002 Mar 4 19:14:09 Media kernel: RAX: 0000000000140101 RBX: ffff88840bfe8498 RCX: 0000000000000000 Mar 4 19:14:09 Media kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88840bfe8498 Mar 4 19:14:09 Media kernel: RBP: 0000000000000003 R08: 0000000000000000 R09: ffff88840e8c3ca0 Mar 4 19:14:09 Media kernel: R10: 0000000000000000 R11: ffff88840e71fb40 R12: 0000000000000046 Mar 4 19:14:09 Media kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Mar 4 19:14:09 Media kernel: FS: 0000000000000000(0000) GS:ffff88840e8c0000(0000) knlGS:0000000000000000 Mar 4 19:14:09 Media kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 4 19:14:09 Media kernel: CR2: 00001460ca290000 CR3: 0000000001e0a000 CR4: 0000000000340ee0 Mar 4 19:14:09 Media kernel: Call Trace: Mar 4 19:14:09 Media kernel: <IRQ> Mar 4 19:14:09 Media kernel: _raw_spin_lock_irqsave+0x29/0x31 Mar 4 19:14:09 Media kernel: __wake_up_common_lock+0x5b/0xcb Mar 4 19:14:09 Media kernel: end_request+0x178/0x18e [md_mod] Mar 4 19:14:09 Media kernel: blk_update_request+0x114/0x21e Mar 4 19:14:09 Media kernel: scsi_end_request+0x29/0x203 Mar 4 19:14:09 Media kernel: scsi_io_completion+0x27c/0x4fa Mar 4 19:14:09 Media kernel: blk_mq_complete_request+0xea/0xef Mar 4 19:14:09 Media kernel: _scsih_io_done+0x6c5/0x6d7 [mpt3sas] Mar 4 19:14:09 Media kernel: ? load_balance+0x124/0x713 Mar 4 19:14:09 Media kernel: ? __accumulate_pelt_segments+0x1d/0x2c Mar 4 19:14:09 Media kernel: ? __update_load_avg_se+0xeb/0x19c Mar 4 19:14:09 Media kernel: _base_interrupt+0x1aa/0xe0a [mpt3sas] Mar 4 19:14:09 Media kernel: __handle_irq_event_percpu+0x36/0xcb Mar 4 19:14:09 Media kernel: handle_irq_event_percpu+0x2c/0x6f Mar 4 19:14:09 Media kernel: handle_irq_event+0x34/0x51 Mar 4 19:14:09 Media kernel: handle_edge_irq+0xfc/0x11f Mar 4 19:14:09 Media kernel: handle_irq+0x1c/0x1f Mar 4 19:14:09 Media kernel: do_IRQ+0x46/0xd0 Mar 4 19:14:09 Media kernel: common_interrupt+0xf/0xf Mar 4 19:14:09 Media kernel: </IRQ> Mar 4 19:14:09 Media kernel: RIP: 0010:cpuidle_enter_state+0xe8/0x141 Mar 4 19:14:09 Media kernel: Code: ff 45 84 f6 74 1d 9c 58 0f 1f 44 00 00 0f ba e0 09 73 09 0f 0b fa 66 0f 1f 44 00 00 31 ff e8 a8 8f bb ff fb 66 0f 1f 44 00 00 <48> 2b 2c 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48 39 cd Mar 4 19:14:09 Media kernel: RSP: 0018:ffffc90001a0be98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc Mar 4 19:14:09 Media kernel: RAX: ffff88840e8dfac0 RBX: ffff888408eb0c00 RCX: 000000000000001f Mar 4 19:14:09 Media kernel: RDX: 0000000000000000 RSI: 0000000021bf5b50 RDI: 0000000000000000 Mar 4 19:14:09 Media kernel: RBP: 00000589298a6770 R08: 00000589298a6770 R09: 000000000000573e Mar 4 19:14:09 Media kernel: R10: 000000008f3bab38 R11: 071c71c71c71c71c R12: 0000000000000002 Mar 4 19:14:09 Media kernel: R13: ffffffff81e5e2a0 R14: 0000000000000000 R15: ffffffff81e5e378 Mar 4 19:14:09 Media kernel: ? cpuidle_enter_state+0xbf/0x141 Mar 4 19:14:09 Media kernel: do_idle+0x17e/0x1fc Mar 4 19:14:09 Media kernel: cpu_startup_entry+0x6a/0x6c Mar 4 19:14:09 Media kernel: start_secondary+0x197/0x1b2 Mar 4 19:14:09 Media kernel: secondary_startup_64+0xa4/0xb0 I don't know if this helps narrow down the case of the lock up. I checked, I can still see the Overview page, but have no SMB access. I took a picture of the overview page and attached it. All information regarding the live CPU stats is frozen. Also I looked up the device listed in the errors above, sda1, sda is my flash drive. I'm at a loss for now. Thanks for the help!! Quote Link to comment
JorgeB Posted March 5, 2020 Share Posted March 5, 2020 Errors on sda suggests possible flash issues, but unlikely that would make the server crash, that still looks like a hardware problem to me, possibly board/cpu combo. Quote Link to comment
Alex.vision Posted March 5, 2020 Author Share Posted March 5, 2020 9 hours ago, johnnie.black said: Errors on sda suggests possible flash issues, but unlikely that would make the server crash, that still looks like a hardware problem to me, possibly board/cpu combo. OK, I'll put this on hold for now. Going to be out of action for the next two weeks. When I get back Ill swap all the parts with my duplicate system and see if i get the same errors. Then hopefully I can narrow down the problem, one by one. Thanks @johnnie.black for the assistance!! -Alex Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.