gray squirrel Posted April 25, 2020 Share Posted April 25, 2020 (edited) over the past few week I have experienced two unclean shut-downs randomly. my IMPI Log lists list them as Critical Stop ; OEM Event Data2 code = 69h ; OEM Event Data3 code = 6Dh. any pointers? I have attached the diagnostics and a snip-it from the IPMI event log. thanks in advance. Edit. I didn't trigger the shut-down. megatron-diagnostics-20200425-1400.zip Edited April 25, 2020 by gray squirrel clarification Quote Link to comment
JorgeB Posted April 25, 2020 Share Posted April 25, 2020 Best bet would probably be to email Asrock support to get a list all the SEL codes, or at least tell you what those are about. Quote Link to comment
gray squirrel Posted May 19, 2020 Author Share Posted May 19, 2020 It took a little time for Asrock to come back to me and wait for another even while I was mirroring the logs to the USB. asrock point to the OS but I could use some help understanding the log. I was sitting next to the server at this time and I didn't even realize it rebooted until l tried to RDP into a VM for some testing I'm doing. IPMI has the lock up at 2020-05-19 14:09:13. Below is a snip it of the log before the boot: May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered blocking state May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered disabled state May 19 13:52:58 Megatron kernel: device vnet1 entered promiscuous mode May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered blocking state May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered forwarding state May 19 13:53:03 Megatron kernel: vfio_ecap_init: 0000:03:00.0 hiding ecap 0x1e@0x258 May 19 13:53:03 Megatron kernel: vfio_ecap_init: 0000:03:00.0 hiding ecap 0x19@0x900 May 19 14:08:58 Megatron kernel: br0: port 3(vnet1) entered disabled state May 19 14:08:58 Megatron kernel: device vnet1 left promiscuous mode May 19 14:08:58 Megatron kernel: br0: port 3(vnet1) entered disabled state May 19 14:09:00 Megatron kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none My guess is the rest is the boot log? May 19 14:11:37 Megatron kernel: microcode: microcode updated early to revision 0x718, date = 2019-05-21 May 19 14:11:37 Megatron kernel: Linux version 4.19.107-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Mar 5 13:55:57 PST 2020 May 19 14:11:37 Megatron kernel: Command line: BOOT_IMAGE=/bzimage isolcpus=0-3,16-19 initrd=/bzroot May 19 14:11:37 Megatron kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' May 19 14:11:37 Megatron kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' May 19 14:11:37 Megatron kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' May 19 14:11:37 Megatron kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 May 19 14:11:37 Megatron kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. May 19 14:11:37 Megatron kernel: BIOS-provided physical RAM map: Quote Link to comment
trurl Posted May 19, 2020 Share Posted May 19, 2020 On 4/25/2020 at 9:21 AM, johnnie.black said: Asrock support to get a list all the SEL codes 6 minutes ago, gray squirrel said: Asrock to come back to me .. asrock point to the OS Did they provide the requested information? 7 minutes ago, gray squirrel said: mirroring the logs to the USB You should zip and post. Quote Link to comment
gray squirrel Posted May 19, 2020 Author Share Posted May 19, 2020 (edited) They asked if there was anything in the logs. It doesn't look like it to me, but new to linux. Is it safe to post the full syslog? Edited May 19, 2020 by gray squirrel typo Quote Link to comment
Frank1940 Posted May 19, 2020 Share Posted May 19, 2020 Look for e-mail addresses and any possible directory/share names that you might not want 'known' to the larger world. User names and passwords would be another couple of items but I don't believe that they are logged by Unraid. Quote Link to comment
gray squirrel Posted May 19, 2020 Author Share Posted May 19, 2020 see attached syslog Quote Link to comment
gray squirrel Posted May 20, 2020 Author Share Posted May 20, 2020 So Asrock have come back and recommend a fresh install with a different HD and Ram. The only thing o think that has changed recently is upgrading to 6.8.3 Quote Link to comment
gray squirrel Posted June 2, 2020 Author Share Posted June 2, 2020 I am completely stuck on this I can't see anything in the logs that provided any more indication of the fault, Asrock's view is that its the OS creating the issue. I had another random shutdown and the last item in the log was one of the drives spinning down. I am over 250Hr into running memtest to see if it still happens. any idea of trouble shooting or do i need to build a completely new config? Quote Link to comment
Frank1940 Posted June 2, 2020 Share Posted June 2, 2020 As I look at the problem, it is not a shutdown but a reboot as shown here: May 19 14:08:58 Megatron kernel: br0: port 3(vnet1) entered disabled state May 19 14:09:00 Megatron kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none May 19 14:11:37 Megatron kernel: microcode: microcode updated early to revision 0x718, date = 2019-05-21 May 19 14:11:37 Megatron kernel: Linux version 4.19.107-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Mar 5 13:55:57 PST 2020 May 19 14:11:37 Megatron kernel: Command line: BOOT_IMAGE=/bzimage isolcpus=0-3,16-19 initrd=/bzroot The reason for this is that there is only about 2 minutes 37 Seconds between the last logged syslog entry and the first syslog entry of the Unraid boot sequence. Do you have a pet or child who might have hit the reset button on the case? Regarding the memtst, do you have ECC memory. (Standard memtst can not test ECC memory.) Any other thing to look at is the power supply. They have been involved in several similar types of problems. (Modern MB's and PS's 'talk' to each other...) Quote Link to comment
gray squirrel Posted June 2, 2020 Author Share Posted June 2, 2020 thanks for the pointers. no pet's or children have access to the server as its in my office. i even noticed it once as the fans spun up but there was no beep from the motherboard (that i would associate with a restart or shutdown). Memtest wasn't to check the RAM, more to just have something running that would not decided to restart itself, I was concerned about OS updates power cycling anyway. is there a better testing methodology? unfortunately i don't have another PSU to test with, the current one is a corsair HX unit that is about 8 years old. its been running in this system for over a year with no problem. Quote Link to comment
Frank1940 Posted June 2, 2020 Share Posted June 2, 2020 18 minutes ago, gray squirrel said: i don't have another PSU to test with If worse came to worse, you could arrange to 'borrow' one from a vendor with a liberal return policy... Quote Link to comment
gray squirrel Posted June 23, 2020 Author Share Posted June 23, 2020 (edited) so i have been doing more trouble shooting short of changing the ram or using another PSU as i don't have either on hand. the system is now on a UPS, i also have tried reverting to 6.8.2 but with no luck. I had another reboot at around 6am this morning with nothing in the mirrored syslog. however i now am getting detected hardware errors. might this help explain the issue i have? edit: looks like a memory issue, is the log able to tell me if its one stick or all of it? megatron-diagnostics-20200623-1131.zip Edited June 23, 2020 by gray squirrel more detail Quote Link to comment
Frank1940 Posted June 23, 2020 Share Posted June 23, 2020 Jun 23 07:10:30 Megatron kernel: mce: [Hardware Error]: Machine check events logged Jun 23 07:10:30 Megatron kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR Jun 23 07:10:30 Megatron kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c000046000800c0 Notice the 0 Bank 8 notification. I saw some other Bank numbers also. I am no expert in this area but I believe this does tell which stick the error is on BUT you will need to do a bit of research in your MB manual or the manufacturer's website to figure out which one it is pointing to. You seem to have a lot of memory sticks in your server. This can cause problems in some cases if the MB is picky about memory. Are the sticks on the 'approved list' for this MB? Again, is this ECC memory? Quote Link to comment
gray squirrel Posted June 23, 2020 Author Share Posted June 23, 2020 I only have 4 sticks of ram (8GB each). my guess is that is CPU 0 (the one on the left) and bank 8, the bottom most below that socket. Quote Link to comment
Frank1940 Posted June 23, 2020 Share Posted June 23, 2020 (edited) You are not doing any overclocking on CPU or memory??? (This is usually bad news with Unraid) IF you are going to replace memory, make sure to check to see if your MB is picky about memory. IF so follow the recomendations. And I would always plan to replace the pair involved-- not just one stick. Edited June 23, 2020 by Frank1940 Quote Link to comment
gray squirrel Posted August 16, 2020 Author Share Posted August 16, 2020 so i switched the ram out for 4X2GB sticks of ECC i had. after almost 2 months of uptime i think my ram was at fault. how do i check the old ram, i had run extensive runs of memtest with no errors. or do i just hit the bay and by another 32GB? although at that point I might as well swap the platform out for a second gen zen platform to go back to and ATX board and lower power consumption. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.