random unclean shut-downs

gray squirrel · April 25, 2020

over the past few week I have experienced two unclean shut-downs randomly.

my IMPI Log lists list them as Critical Stop ; OEM Event Data2 code = 69h ; OEM Event Data3 code = 6Dh.

any pointers?

I have attached the diagnostics and a snip-it from the IPMI event log.

thanks in advance.

Edit. I didn't trigger the shut-down.

megatron-diagnostics-20200425-1400.zip

Edited April 25, 2020 by gray squirrel
clarification

JorgeB · April 25, 2020

Best bet would probably be to email Asrock support to get a list all the SEL codes, or at least tell you what those are about.

gray squirrel · May 19, 2020

It took a little time for Asrock to come back to me and wait for another even while I was mirroring the logs to the USB. asrock point to the OS but I could use some help understanding the log. I was sitting next to the server at this time and I didn't even realize it rebooted until l tried to RDP into a VM for some testing I'm doing.

IPMI has the lock up at 2020-05-19 14:09:13.

Below is a snip it of the log before the boot:

May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered blocking state
May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered disabled state
May 19 13:52:58 Megatron kernel: device vnet1 entered promiscuous mode
May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered blocking state
May 19 13:52:58 Megatron kernel: br0: port 3(vnet1) entered forwarding state
May 19 13:53:03 Megatron kernel: vfio_ecap_init: 0000:03:00.0 hiding ecap 0x1e@0x258
May 19 13:53:03 Megatron kernel: vfio_ecap_init: 0000:03:00.0 hiding ecap 0x19@0x900
May 19 14:08:58 Megatron kernel: br0: port 3(vnet1) entered disabled state
May 19 14:08:58 Megatron kernel: device vnet1 left promiscuous mode
May 19 14:08:58 Megatron kernel: br0: port 3(vnet1) entered disabled state
May 19 14:09:00 Megatron kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none

My guess is the rest is the boot log?

May 19 14:11:37 Megatron kernel: microcode: microcode updated early to revision 0x718, date = 2019-05-21
May 19 14:11:37 Megatron kernel: Linux version 4.19.107-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Mar 5 13:55:57 PST 2020
May 19 14:11:37 Megatron kernel: Command line: BOOT_IMAGE=/bzimage isolcpus=0-3,16-19 initrd=/bzroot
May 19 14:11:37 Megatron kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
May 19 14:11:37 Megatron kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
May 19 14:11:37 Megatron kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
May 19 14:11:37 Megatron kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
May 19 14:11:37 Megatron kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
May 19 14:11:37 Megatron kernel: BIOS-provided physical RAM map:

trurl · May 19, 2020

On 4/25/2020 at 9:21 AM, johnnie.black said:

Asrock support to get a list all the SEL codes

6 minutes ago, gray squirrel said:

Asrock to come back to me .. asrock point to the OS

Did they provide the requested information?

7 minutes ago, gray squirrel said:

mirroring the logs to the USB

You should zip and post.

gray squirrel · May 19, 2020

They asked if there was anything in the logs. It doesn't look like it to me, but new to linux.

Is it safe to post the full syslog?

Edited May 19, 2020 by gray squirrel
typo

Frank1940 · May 19, 2020

Look for e-mail addresses and any possible directory/share names that you might not want 'known' to the larger world. User names and passwords would be another couple of items but I don't believe that they are logged by Unraid.

gray squirrel · May 19, 2020

see attached

syslog

gray squirrel · May 20, 2020

So Asrock have come back and recommend a fresh install with a different HD and Ram.

The only thing o think that has changed recently is upgrading to 6.8.3

gray squirrel · June 2, 2020

I am completely stuck on this I can't see anything in the logs that provided any more indication of the fault, Asrock's view is that its the OS creating the issue.

I had another random shutdown and the last item in the log was one of the drives spinning down.

I am over 250Hr into running memtest to see if it still happens. any idea of trouble shooting or do i need to build a completely new config?

Frank1940 · June 2, 2020

As I look at the problem, it is not a shutdown but a reboot as shown here:

May 19 14:08:58 Megatron kernel: br0: port 3(vnet1) entered disabled state
May 19 14:09:00 Megatron kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
May 19 14:11:37 Megatron kernel: microcode: microcode updated early to revision 0x718, date = 2019-05-21
May 19 14:11:37 Megatron kernel: Linux version 4.19.107-Unraid (root@Develop) (gcc version 9.2.0 (GCC)) #1 SMP Thu Mar 5 13:55:57 PST 2020
May 19 14:11:37 Megatron kernel: Command line: BOOT_IMAGE=/bzimage isolcpus=0-3,16-19 initrd=/bzroot

The reason for this is that there is only about 2 minutes 37 Seconds between the last logged syslog entry and the first syslog entry of the Unraid boot sequence.

Do you have a pet or child who might have hit the reset button on the case?

Regarding the memtst, do you have ECC memory. (Standard memtst can not test ECC memory.)

Any other thing to look at is the power supply. They have been involved in several similar types of problems. (Modern MB's and PS's 'talk' to each other...)

gray squirrel · June 2, 2020

thanks for the pointers.

no pet's or children have access to the server as its in my office. i even noticed it once as the fans spun up but there was no beep from the motherboard (that i would associate with a restart or shutdown).

Memtest wasn't to check the RAM, more to just have something running that would not decided to restart itself, I was concerned about OS updates power cycling anyway. is there a better testing methodology?

unfortunately i don't have another PSU to test with, the current one is a corsair HX unit that is about 8 years old. its been running in this system for over a year with no problem.

Frank1940 · June 2, 2020

18 minutes ago, gray squirrel said:

i don't have another PSU to test with

If worse came to worse, you could arrange to 'borrow' one from a vendor with a liberal return policy...

gray squirrel · June 23, 2020

so i have been doing more trouble shooting short of changing the ram or using another PSU as i don't have either on hand.

the system is now on a UPS, i also have tried reverting to 6.8.2 but with no luck.

I had another reboot at around 6am this morning with nothing in the mirrored syslog.

however i now am getting detected hardware errors. might this help explain the issue i have?

edit: looks like a memory issue, is the log able to tell me if its one stick or all of it?

megatron-diagnostics-20200623-1131.zip

Edited June 23, 2020 by gray squirrel
more detail

Frank1940 · June 23, 2020

Jun 23 07:10:30 Megatron kernel: mce: [Hardware Error]: Machine check events logged
Jun 23 07:10:30 Megatron kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 23 07:10:30 Megatron kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c000046000800c0

Notice the 0 Bank 8 notification. I saw some other Bank numbers also. I am no expert in this area but I believe this does tell which stick the error is on BUT you will need to do a bit of research in your MB manual or the manufacturer's website to figure out which one it is pointing to.

You seem to have a lot of memory sticks in your server. This can cause problems in some cases if the MB is picky about memory. Are the sticks on the 'approved list' for this MB? Again, is this ECC memory?

gray squirrel · June 23, 2020

I only have 4 sticks of ram (8GB each).

my guess is that is CPU 0 (the one on the left) and bank 8, the bottom most below that socket.

Frank1940 · June 23, 2020

You are not doing any overclocking on CPU or memory??? (This is usually bad news with Unraid)

IF you are going to replace memory, make sure to check to see if your MB is picky about memory. IF so follow the recomendations. And I would always plan to replace the pair involved-- not just one stick.

Edited June 23, 2020 by Frank1940

gray squirrel · August 16, 2020

so i switched the ram out for 4X2GB sticks of ECC i had. after almost 2 months of uptime i think my ram was at fault.

how do i check the old ram, i had run extensive runs of memtest with no errors.

or do i just hit the bay and by another 32GB? although at that point I might as well swap the platform out for a second gen zen platform to go back to and ATX board and lower power consumption.

random unclean shut-downs

Recommended Posts

gray squirrel

Link to comment

JorgeB

Link to comment

gray squirrel

Link to comment

trurl

Link to comment

gray squirrel

Link to comment

Frank1940

Link to comment

gray squirrel

Link to comment

gray squirrel

Link to comment

gray squirrel

Link to comment

Frank1940

Link to comment

gray squirrel

Link to comment

Frank1940

Link to comment

gray squirrel

Link to comment

Frank1940

Link to comment

gray squirrel

Link to comment

Frank1940

Link to comment

gray squirrel

Link to comment

Join the conversation