Jump to content

random unclean shut-downs


Recommended Posts

over the past few week I have experienced two unclean shut-downs randomly.

 

my IMPI Log lists list them as  Critical Stop ; OEM Event Data2 code = 69h ; OEM Event Data3 code = 6Dh.

 

any pointers?

 

I have attached the diagnostics and a snip-it from the IPMI event log.

 

thanks in advance.

 

Edit. I didn't trigger the shut-down.

 

 

Shutdown 1.JPG

Shutdown 2.JPG

megatron-diagnostics-20200425-1400.zip

Edited by gray squirrel
clarification
Link to comment
  • 4 weeks later...

It took a little time for Asrock to come back to me and wait for another even while I was mirroring the logs to the USB. asrock point to the OS but I could use some help understanding the log. I was sitting next to the server at this time and I didn't even realize it rebooted until l tried to RDP into a VM for some testing I'm doing.

 

IPMI has the lock up at 2020-05-19 14:09:13.

 

Below is a snip it of the log before the boot:

 

May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered blocking state
May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered disabled state
May 19 13:52:58 Megatron kernel: device vnet1 entered promiscuous mode
May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered blocking state
May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered forwarding state
May 19 13:53:03 Megatron kernel: vfio_ecap_init: 0000:03:00.0 hiding ecap 0x1e@0x258
May 19 13:53:03 Megatron kernel: vfio_ecap_init: 0000:03:00.0 hiding ecap 0x19@0x900
May 19 14:08:58 Megatron kernel: br0: port 3(vnet1) entered disabled state
May 19 14:08:58 Megatron kernel: device vnet1 left promiscuous mode
May 19 14:08:58 Megatron kernel: br0: port 3(vnet1) entered disabled state
May 19 14:09:00 Megatron kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none

 

My guess is the rest is the boot log?


May 19 14:11:37 Megatron kernel: microcode: microcode updated early to revision 0x718, date = 2019-05-21
May 19 14:11:37 Megatron kernel: Linux version 4.19.107-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Mar 5 13:55:57 PST 2020
May 19 14:11:37 Megatron kernel: Command line: BOOT_IMAGE=/bzimage isolcpus=0-3,16-19 initrd=/bzroot
May 19 14:11:37 Megatron kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
May 19 14:11:37 Megatron kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
May 19 14:11:37 Megatron kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
May 19 14:11:37 Megatron kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
May 19 14:11:37 Megatron kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
May 19 14:11:37 Megatron kernel: BIOS-provided physical RAM map:

 

 

Link to comment
On 4/25/2020 at 9:21 AM, johnnie.black said:

Asrock support to get a list all the SEL codes

6 minutes ago, gray squirrel said:

Asrock to come back to me .. asrock point to the OS

Did they provide the requested information?

 

7 minutes ago, gray squirrel said:

mirroring the logs to the USB

You should zip and post.

Link to comment
  • 2 weeks later...

I am completely stuck on this I can't see anything in the logs that provided any more indication of the fault, Asrock's view is that its the OS creating the issue.

 

I had another random shutdown and the last item in the log was one of the drives spinning down.

 

I am over 250Hr into running memtest to see if it still happens. any idea of trouble shooting or do i need to build a completely new config?

 

 

Link to comment

As I look at the problem, it is not a shutdown but a reboot as shown here:

May 19 14:08:58 Megatron kernel: br0: port 3(vnet1) entered disabled state
May 19 14:09:00 Megatron kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
May 19 14:11:37 Megatron kernel: microcode: microcode updated early to revision 0x718, date = 2019-05-21
May 19 14:11:37 Megatron kernel: Linux version 4.19.107-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Mar 5 13:55:57 PST 2020
May 19 14:11:37 Megatron kernel: Command line: BOOT_IMAGE=/bzimage isolcpus=0-3,16-19 initrd=/bzroot  

The reason for this is that there is only about 2 minutes 37 Seconds between the last logged syslog entry and the first syslog entry of the Unraid boot sequence. 

 

Do you have a pet or child who might have hit the reset button on the case?  

 

Regarding the memtst, do you have ECC memory.   (Standard memtst can  not test ECC memory.)

 

Any other thing to look at is the power supply.  They have been involved in several similar types of problems.  (Modern MB's and PS's 'talk' to each other...)

Link to comment

thanks for the pointers.

 

no pet's or children have access to the server as its in my office. i even noticed it once as the fans spun up but there was no beep from the motherboard (that i would associate with a restart or shutdown).

 

Memtest wasn't to check the RAM, more to just have something running that would not decided to restart itself, I was concerned about OS updates power cycling anyway. is there a better testing methodology?

 

unfortunately i don't have another PSU to test with, the current one is a corsair HX unit that is about 8 years old. its been running in this system for over a year with no problem.

Link to comment
  • 3 weeks later...

so i have been doing more trouble shooting short of changing the ram or using another PSU as i don't have either on hand.

 

the system is now on a UPS, i also have tried reverting to 6.8.2 but with no luck.

 

I had another reboot at around 6am this morning with nothing in the mirrored syslog.

 

however i now am getting detected hardware errors. might this help explain the issue i have?

 

edit: looks like a memory issue, is the log able to tell me if its one stick or all of it?

megatron-diagnostics-20200623-1131.zip

Edited by gray squirrel
more detail
Link to comment
Jun 23 07:10:30 Megatron kernel: mce: [Hardware Error]: Machine check events logged
Jun 23 07:10:30 Megatron kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 23 07:10:30 Megatron kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c000046000800c0

Notice the   0 Bank 8  notification.  I saw some other Bank numbers also.  I am no expert in this area but I believe this does tell which stick the error is on BUT you will need to do a bit of research in your MB manual or the manufacturer's website to figure out which one it is pointing to.   

 

You seem to have a lot of memory sticks in your server.  This can cause problems in some cases if the MB is picky about memory.  Are the sticks on the 'approved list' for this MB?  Again, is this ECC memory?  

Link to comment

You are not doing any overclocking on CPU or memory???  (This is usually bad news with Unraid)

 

IF you are going to replace memory, make sure to check to see if your MB is picky about memory.  IF so follow the recomendations.   And I would always plan to replace the pair involved-- not just one stick.

Edited by Frank1940
Link to comment
  • 1 month later...

so i switched the ram out for 4X2GB sticks of ECC i had. after almost 2 months of uptime i think my ram was at fault.

 

how do i check the old ram, i had run extensive runs of memtest with no errors.

 

or do i just hit the bay and by another 32GB? although at that point I might as well swap the platform out for a second gen zen platform to go back to and ATX board and lower power consumption.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...