Random Lockups 6.8.1, 6.8.2


Recommended Posts

Hello fellow UnRaiders,

 

Issue  

My machine locks up randomly, never running for more than 2 or 3 days. I had originally thought it was related to my pi-hole docker, it seemed to produce the problem more frequently when I was running docker.  However, with that disabled my machine still crashes. I can’t ping the system or reach any share. When I had the GUI enabled it would be completely frozen at the login screen. A few times I had been logged into the system and was running htop full screen, it would also be frozen, requiring a hard reset. System stability is so bad, I can start the computer in the morning and within a few hours it has locked up. 

 

Machine Specs 

MB: ASUS Prime X470-Pro (Bios Version 5406) 
CPU: AMD Ryzen 5 3600X (Stock speeds) (Stock Cooler) 
Ram: G.SKILL Ripjaws V Series 16GB (2 x 8GB) 288-Pin DDR4 SDRAM DDR4 3600 
PSU: CORSAIR RM Series, RM850, 850 Watt 
HBA: AOC-SAS2LP-MV8 (I know it’s a Marvell based card; I’m swapping it for another LSI I have) 
External Das Card: LSI SAS 9212-4I4E Single Connected to 16 Bay Das. 
Drives: 22 8TB Data Drives, 1 8TB Parity and 1 8TB Cache Drive, mostly Seagate, some Western Digital. 

GPU: Gigabyte Video Card Graphics Cards GV-R523D3-1GL REV2.0 

 

Attempted Fixes 

Ran in safe mode, no plugins, disabled docker, disabled VM’s.  Disabled HT in bios. Disabled any extra mb ports, like serial and floppy support. Ran without the array started. I have run in GUI mode, GUI Safemode, no GUI and headless.  I have run fix common problems, and other than the complaint about the Marvell, all seems well. No obvious issues.

I haven't been able to run Memtest86 on this new board, for some reason it won't launch, it instantly reboots after I select it in the UnRaid boot menu. 

 

Unraid Versions 

I don’t keep the best records of the versions which work and which have problems.  I can't remember what version I was on when this problem started after I built the thing. I do know I’m currently on 6.8.2 as of today, I was on 6.7.2 and worked my way up through each RC version up to 6.8.2 now. On each version I had crashes, but I just today went to 6.8.2 so it may or may not last.  If (when) I crash on 6.8.2 I will post that current syslog too. 

 

Final Thoughts

I’m attaching two photos of what displays on my monitor when the system locks up. I’m also attaching 2 diagnostic files, 1 from just after rebooting from a lockup (6.8.1). The second one is after my update to 6.8.2, I'm not expecting either log to have any big revelations, I think the photos of the screen might lead me in a better direction, if I knew what they meant.  I'm at a complete loss here, and would love if someone could help me diagnose my problem.  

IMG_4803 - Copy.jpg

IMG_5041 - Copy.jpg

Alex.Vision Log 6.8.1 post Crash 1.zip Alex.Vision Log 6.8.2.zip

Link to comment
3 minutes ago, Alex.vision said:

I haven't been able to run Memtest86 on this new board, for some reason it won't launch, it instantly reboots after I select it in the UnRaid boot menu. 

 

Memtest won't boot via UEFI boot mode.  You have to change the BIOS / boot order to boot via Legacy / normal from the USB for memtest to work.

Link to comment
10 hours ago, Alex.vision said:

Ram: G.SKILL Ripjaws V Series 16GB (2 x 8GB) 288-Pin DDR4 SDRAM DDR4 3600 

Don't overclock your RAM, it's known to cause stability issues with some Ryzen system, even data corruption, respect max speed based on config:

2093328511_3rdgen.jpg.e6d1b55e26606a1fde5ee83dbf720848.jpg

 

Also look for "Power Supply Idle Control" (or similar) on the BIOS and set it to "typical current idle" (or similar)

Link to comment
6 hours ago, johnnie.black said:

Don't overclock your RAM, it's known to cause stability issues with some Ryzen system, even data corruption, respect max speed based on config:

Also look for "Power Supply Idle Control" (or similar) on the BIOS and set it to "typical current idle" (or similar)

jonnie.black

 

Thanks for the help, I'll change the ram settings and look for the power setting when I get done with work. I would have done it this morning but I wanted to let memtest run for a good 24 hours.  Could you tell me where that Ram chart came from? I guess I thought that because it was 3600 speed ram it wasnt really overclocked, even though I had to use XMP or DOCP. I think I remember the old days when it said 2667, 3200(OC), 3600(OC) on the box.  

 

I'll do some googling on the "typical current idle" too.

 

Thanks

Link to comment

Well it has been a few days, and I have been walking on eggshells when it comes to running my server.  It seems that either turning off the ram overclock or changing the Power Supply Idle Control seems to have remedied my random reboot issue. Thank you @johnnie.black and @Squid for the troubleshooting help. I have 2 days and 16 hours of uptime, which is way better than it has been in the past.  I still have docker disabled but i'll work things back in one step at a time.  A functioning system minus a few features is way better than no system at all.  

 

Thanks for the help!!

Link to comment
  • Alex.vision changed the title to (Solved) Random Lockups 6.8.1
  • 4 weeks later...

OK, Round two.

 

I made the suggested changes from above; I also changed out my Marvell based HBA to an LSI model. I also upgraded to version 6.8.2 with the same results. I had thought the problem might have been solved, but the past week and a half my server really struggles to maintain any uptime.  It locks up constantly, but I think I finally managed to get a log file that might help.  In my last attempt to fix my problem myself, I erased my flash drive and started over, importing just a few settings from the old one. I thought maybe something on it was causing problems.  Apparently not.  I’m uploading to log files, one mid lockup and one after I hard reset the system.  I can see a bunch of information starting on line 2810, but I don’t know enough about Linux to say if it really is bug or what not.

 

I really hope this is something I can fix.

alex.vision (crash) media-syslog-20200227-0511.zip alex.vision (next boot) media-syslog-20200227-1636.zip

Link to comment
  • Alex.vision changed the title to Random Lockups 6.8.1, 6.8.2

Before I can change out all of the hardware to test the above hypothesis im trying to duplicate the data to another server.  I strt the system and then intiate a transfer and see how much data can be pulled before it crashes.  I just looked at my transfer and noticed it had stopped.  So I logged into the  web page, which displayed.  I got three notifications about plugin updates, and when I clicked the plugins tab, the system refreshed to a blank page with the header still showing and the chrome busy icon in the tab.  I quickly opened a terminal and tried to pull info from the syslog.

 

root@Media:~# tail -f /var/log/syslog
Mar  4 18:56:09 Media kernel: start_secondary+0x197/0x1b2
Mar  4 18:56:09 Media kernel: secondary_startup_64+0xa4/0xb0
Mar  4 18:56:32 Media kernel: sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x00
Mar  4 18:56:32 Media kernel: sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x2a 2a 00 00 00 08 01 00 00 01 00
Mar  4 18:56:32 Media kernel: print_req_error: I/O error, dev sda, sector 2049
Mar  4 18:56:32 Media kernel: Buffer I/O error on dev sda1, logical block 1, lost async page write
Mar  4 18:56:32 Media kernel: sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x00
Mar  4 18:56:32 Media kernel: sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x2a 2a 00 00 00 17 91 00 00 01 00
Mar  4 18:56:32 Media kernel: print_req_error: I/O error, dev sda, sector 6033
Mar  4 18:56:32 Media kernel: Buffer I/O error on dev sda1, logical block 3985, lost async page write

Mar  4 19:11:09 Media kernel: RDX: 0000000000000000 RSI: 0000000021bf5b50 RDI: 0000000000000000
Mar  4 19:11:09 Media kernel: RBP: 00000589298a6770 R08: 00000589298a6770 R09: 000000000000573e
Mar  4 19:11:09 Media kernel: R10: 000000008f3bab38 R11: 071c71c71c71c71c R12: 0000000000000002
Mar  4 19:11:09 Media kernel: R13: ffffffff81e5e2a0 R14: 0000000000000000 R15: ffffffff81e5e378
Mar  4 19:11:09 Media kernel: ? cpuidle_enter_state+0xbf/0x141
Mar  4 19:11:09 Media kernel: do_idle+0x17e/0x1fc
Mar  4 19:11:09 Media kernel: cpu_startup_entry+0x6a/0x6c
Mar  4 19:11:09 Media kernel: start_secondary+0x197/0x1b2
Mar  4 19:11:09 Media kernel: secondary_startup_64+0xa4/0xb0
Mar  4 19:13:09 Media login[15244]: ROOT LOGIN  on '/dev/pts/1'
Mar  4 19:14:09 Media kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar  4 19:14:09 Media kernel: rcu:      4-...0: (0 ticks this GP) idle=cd2/1/0x4000000000000000 softirq=314663/314663 fqs=814935 
Mar  4 19:14:09 Media kernel: rcu:      11-...0: (2 GPs behind) idle=98e/0/0x1 softirq=365175/365176 fqs=814935 
Mar  4 19:14:09 Media kernel: rcu:      (detected by 8, t=3300092 jiffies, g=1434025, q=713807)
Mar  4 19:14:09 Media kernel: Sending NMI from CPU 8 to CPUs 4:
Mar  4 19:14:09 Media kernel: NMI backtrace for cpu 4
Mar  4 19:14:09 Media kernel: CPU: 4 PID: 15614 Comm: unraidd7 Tainted: G           O      4.19.98-Unraid #1
Mar  4 19:14:09 Media kernel: Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019
Mar  4 19:14:09 Media kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x11e/0x171
Mar  4 19:14:09 Media kernel: Code: 48 03 04 cd 20 37 db 81 48 89 10 8b 42 08 85 c0 75 04 f3 90 eb f5 48 8b 0a 48 85 c9 74 c9 0f 0d 09 8b 07 66 85 c0 74 04 f3 90 <eb> f5 41 89 c0 66 45 31 c0 44 39 c6 74 0a 48 85 c9 c6 07 01 75 1b
Mar  4 19:14:09 Media kernel: RSP: 0018:ffffc9000bd4fe80 EFLAGS: 00000002
Mar  4 19:14:09 Media kernel: RAX: 0000000000140101 RBX: ffffc9000bd4fec0 RCX: 0000000000000000
Mar  4 19:14:09 Media kernel: RDX: ffff88840e720740 RSI: 0000000000140000 RDI: ffff88840bfe8498
Mar  4 19:14:09 Media kernel: RBP: ffff88840bfe8498 R08: 000000000000029c R09: 0000000000000000
Mar  4 19:14:09 Media kernel: R10: 0000000000000020 R11: ffff88840e71fb40 R12: 0000000000000246
Mar  4 19:14:09 Media kernel: R13: ffff88840bfe8498 R14: ffff8883d4c81b00 R15: ffffc9000bc0faf0
Mar  4 19:14:09 Media kernel: FS:  0000000000000000(0000) GS:ffff88840e700000(0000) knlGS:0000000000000000
Mar  4 19:14:09 Media kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  4 19:14:09 Media kernel: CR2: 00001460ca2b2000 CR3: 0000000001e0a000 CR4: 0000000000340ee0
Mar  4 19:14:09 Media kernel: Call Trace:
Mar  4 19:14:09 Media kernel: _raw_spin_lock_irqsave+0x29/0x31
Mar  4 19:14:09 Media kernel: prepare_to_wait_event+0x13/0xd2
Mar  4 19:14:09 Media kernel: md_thread+0x8f/0x115 [md_mod]
Mar  4 19:14:09 Media kernel: ? wait_woken+0x6a/0x6a
Mar  4 19:14:09 Media kernel: ? md_open+0x2c/0x2c [md_mod]
Mar  4 19:14:09 Media kernel: kthread+0x10c/0x114
Mar  4 19:14:09 Media kernel: ? kthread_park+0x89/0x89
Mar  4 19:14:09 Media kernel: ret_from_fork+0x22/0x40
Mar  4 19:14:09 Media kernel: Sending NMI from CPU 8 to CPUs 11:
Mar  4 19:14:09 Media kernel: NMI backtrace for cpu 11
Mar  4 19:14:09 Media kernel: CPU: 11 PID: 0 Comm: swapper/11 Tainted: G           O      4.19.98-Unraid #1
Mar  4 19:14:09 Media kernel: Hardware name: System manufacturer System Product Name/PRIME X470-PRO, BIOS 5406 11/13/2019
Mar  4 19:14:09 Media kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x6b/0x171
Mar  4 19:14:09 Media kernel: Code: 42 f0 8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 74 0e 81 e6 00 ff 00 00 75 1a c6 47 01 00 eb 14 85 f6 74 0a 8b 07 84 c0 74 04 f3 90 <eb> f6 66 c7 07 01 00 c3 48 c7 c2 40 07 02 00 65 48 03 15 48 6d f8
Mar  4 19:14:09 Media kernel: RSP: 0018:ffff88840e8c3c80 EFLAGS: 00000002
Mar  4 19:14:09 Media kernel: RAX: 0000000000140101 RBX: ffff88840bfe8498 RCX: 0000000000000000
Mar  4 19:14:09 Media kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff88840bfe8498
Mar  4 19:14:09 Media kernel: RBP: 0000000000000003 R08: 0000000000000000 R09: ffff88840e8c3ca0
Mar  4 19:14:09 Media kernel: R10: 0000000000000000 R11: ffff88840e71fb40 R12: 0000000000000046
Mar  4 19:14:09 Media kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Mar  4 19:14:09 Media kernel: FS:  0000000000000000(0000) GS:ffff88840e8c0000(0000) knlGS:0000000000000000
Mar  4 19:14:09 Media kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  4 19:14:09 Media kernel: CR2: 00001460ca290000 CR3: 0000000001e0a000 CR4: 0000000000340ee0
Mar  4 19:14:09 Media kernel: Call Trace:
Mar  4 19:14:09 Media kernel: <IRQ>
Mar  4 19:14:09 Media kernel: _raw_spin_lock_irqsave+0x29/0x31
Mar  4 19:14:09 Media kernel: __wake_up_common_lock+0x5b/0xcb
Mar  4 19:14:09 Media kernel: end_request+0x178/0x18e [md_mod]
Mar  4 19:14:09 Media kernel: blk_update_request+0x114/0x21e
Mar  4 19:14:09 Media kernel: scsi_end_request+0x29/0x203
Mar  4 19:14:09 Media kernel: scsi_io_completion+0x27c/0x4fa
Mar  4 19:14:09 Media kernel: blk_mq_complete_request+0xea/0xef
Mar  4 19:14:09 Media kernel: _scsih_io_done+0x6c5/0x6d7 [mpt3sas]
Mar  4 19:14:09 Media kernel: ? load_balance+0x124/0x713
Mar  4 19:14:09 Media kernel: ? __accumulate_pelt_segments+0x1d/0x2c
Mar  4 19:14:09 Media kernel: ? __update_load_avg_se+0xeb/0x19c
Mar  4 19:14:09 Media kernel: _base_interrupt+0x1aa/0xe0a [mpt3sas]
Mar  4 19:14:09 Media kernel: __handle_irq_event_percpu+0x36/0xcb
Mar  4 19:14:09 Media kernel: handle_irq_event_percpu+0x2c/0x6f
Mar  4 19:14:09 Media kernel: handle_irq_event+0x34/0x51
Mar  4 19:14:09 Media kernel: handle_edge_irq+0xfc/0x11f
Mar  4 19:14:09 Media kernel: handle_irq+0x1c/0x1f
Mar  4 19:14:09 Media kernel: do_IRQ+0x46/0xd0
Mar  4 19:14:09 Media kernel: common_interrupt+0xf/0xf
Mar  4 19:14:09 Media kernel: </IRQ>
Mar  4 19:14:09 Media kernel: RIP: 0010:cpuidle_enter_state+0xe8/0x141
Mar  4 19:14:09 Media kernel: Code: ff 45 84 f6 74 1d 9c 58 0f 1f 44 00 00 0f ba e0 09 73 09 0f 0b fa 66 0f 1f 44 00 00 31 ff e8 a8 8f bb ff fb 66 0f 1f 44 00 00 <48> 2b 2c 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48 39 cd
Mar  4 19:14:09 Media kernel: RSP: 0018:ffffc90001a0be98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc
Mar  4 19:14:09 Media kernel: RAX: ffff88840e8dfac0 RBX: ffff888408eb0c00 RCX: 000000000000001f
Mar  4 19:14:09 Media kernel: RDX: 0000000000000000 RSI: 0000000021bf5b50 RDI: 0000000000000000
Mar  4 19:14:09 Media kernel: RBP: 00000589298a6770 R08: 00000589298a6770 R09: 000000000000573e
Mar  4 19:14:09 Media kernel: R10: 000000008f3bab38 R11: 071c71c71c71c71c R12: 0000000000000002
Mar  4 19:14:09 Media kernel: R13: ffffffff81e5e2a0 R14: 0000000000000000 R15: ffffffff81e5e378
Mar  4 19:14:09 Media kernel: ? cpuidle_enter_state+0xbf/0x141
Mar  4 19:14:09 Media kernel: do_idle+0x17e/0x1fc
Mar  4 19:14:09 Media kernel: cpu_startup_entry+0x6a/0x6c
Mar  4 19:14:09 Media kernel: start_secondary+0x197/0x1b2
Mar  4 19:14:09 Media kernel: secondary_startup_64+0xa4/0xb0

 

I don't know if this helps narrow down the case of the lock up. I checked, I can still see the Overview page, but have no SMB access.  I took a picture of the overview page and attached it.  All information regarding the live CPU stats is frozen.  Also I looked up the device listed in the errors above, sda1, sda is my flash drive. I'm at a loss for now. 

 

Thanks for the help!!

One.png

Link to comment
9 hours ago, johnnie.black said:

Errors on sda suggests possible flash issues, but unlikely that would make the server crash, that still looks like a hardware problem to me, possibly board/cpu combo.

OK, I'll put this on hold for now. Going to be out of action for the next two weeks.  When I get back Ill swap all the parts with my duplicate system and see if i get the same errors.  Then hopefully I can narrow down the problem, one by one.  Thanks @johnnie.black for the assistance!!

 

-Alex

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.