Jump to content

Server Crashing. Need Help Finding Fault


Recommended Posts

Mar 22 02:06:01 Tower kernel: mce: [Hardware Error]: Machine check events logged
Mar 22 02:08:05 Tower kernel: mce: [Hardware Error]: Machine check events logged

 

A Machine Check Exception (MCE) is a type of computer hardware error that occurs when a computer's central processing unit detects a hardware problem.

Your computer experienced a hardware error and the kernel logged an event in a buffer. You can use mcelog to log and view the machine check events. From mcelog manpage:

 

X86 CPUs report errors detected by the CPU as machine check events (MCEs). These can be data corruption detected in the CPU caches, in main memory by an integrated memory controller, data transfer errors on the front side bus or CPU interconnect or other internal errors. Possible causes can be cosmic radiation, instable power supplies, cooling problems, broken hardware, running systems out of specification, or bad luck.

 

Most errors can be corrected by the CPU by internal error correction mechanisms. Uncorrected errors cause machine check exceptions which may kill processes or panic the machine. A small number of corrected errors is usually not a cause for worry, but a large number can indicate future failure.

 

When a corrected or recovered error happens the x86 kernel writes a record describing the MCE into a internal ring buffer available through the /dev/mcelog device. mcelog retrieves errors from /dev/mcelog, decodes them into a human readable format and prints them on the standard output or optionally into the system log.

 

Unfortunately, unRaid doesn't include mcelog so you can't see what was actually logged.  (But you can try

cat /dev/mcelog

from putty / a console (not sure if this will work since I don't have any mce problems)  (Note that if your crash is an outright reset / shutdown, this log will be cleared upon it starting back up)

 

First things to do would be to clean out all dust bunnies (and dust) out of the system (not only do they inhibit your cooling system, but they also can conduct electricity.

 

Next reseat all of your memory sticks and run a memtest

 

Then post up complete specs of your hardware (including model # of power supply)

 

Link to comment

Those Machine Check Events have happened on and off for the last year. I think my machine might be crashing when my Plex cleans the media meta data up once a week (its stored on my cache). Also it might have crashed the next day because the mover started. (had some time to think it through)

 

Finally got some time up my sleeve. I'm going to give it a clean. I'll check all my sata cables while I'm at it.

 

Going to check if Mover got interrupted and fix the move if need be.

 

my cache drive is starting to make noise when i turn the server on. It is getting old now (4 years 9 months).

 

I am heading to unraid 6 and get me some SSD's for my cache. Minimize some HDD thrashing. I hope that will fix it.

 

Thanks Squid for showing me where to find the MCE log. I'll definitely be keeping an eye on it.

 

I'll let you know how the upgrade/fix goes over the next week or so.

 

I'll update my sig soon with my PSU model number.

Link to comment

Had a power outage over the weekend. thought i would do a uncorrecting parity check. got a hole heap of errors. (supplied latest syslog).

The drive short smart test is supplied as well.

I have no more sata ports available

 

is this drive failing?

 

can i safely pull the data off the array to start from scratch?

 

Can i Upgrade to Unraid 6 with these problems?

 

Also is Cache failing?

 

Much appreciated for your replies.

 

 

syslog-2016-03-29.zip

smart_report_disk1_29-03-2016.txt

smart_report_cache_29-03-2016.txt

Link to comment

Disk1 need replacing:

 

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       47

 

Cache disk looks ok but you probably have a bad sata cable, try replacing it an monitor this value, an increase of 2 or more means there's still a problem:

 

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       9597

Link to comment
  • 1 month later...

So after a month of my server sitting turned off waiting for the funds to replace disk1, I replaced it with a 3TB. The data rebuilt nicely. My server is up and running.

While my server was off for the month, my UPS was still on as I had my gigabit switch still plugged in. During this time I noticed my UPS was feeling rather hot. Anyway one night when shutting down my laptop I heard my UPS switch off and a few seconds later switch on. So I have since got my UPS checked out and it turns out my battery has fried itself.

I will be upgrading to Unraid6 and get the UPS communication sorted. Hopefully I will be able to catch this sort of problem in the future.

 

Regards for the help I received.

 

 

Sent from my iPad using Tapatalk

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...