BillClinton Posted March 22, 2016 Share Posted March 22, 2016 This is my first post and my first major problem. Over the last 2 weeks my server has crashed/ dirty shutdown 3 times. If anyone can help it would be much appreciated. i have attached tail of syslog before last crash. also the boot syslog after last crash. i have attached pre short test and post short test smart reports for my Drives. Thanks Unraid_syslog_and_reports.zip Link to comment
Squid Posted March 22, 2016 Share Posted March 22, 2016 Mar 22 02:06:01 Tower kernel: mce: [Hardware Error]: Machine check events logged Mar 22 02:08:05 Tower kernel: mce: [Hardware Error]: Machine check events logged A Machine Check Exception (MCE) is a type of computer hardware error that occurs when a computer's central processing unit detects a hardware problem. Your computer experienced a hardware error and the kernel logged an event in a buffer. You can use mcelog to log and view the machine check events. From mcelog manpage: X86 CPUs report errors detected by the CPU as machine check events (MCEs). These can be data corruption detected in the CPU caches, in main memory by an integrated memory controller, data transfer errors on the front side bus or CPU interconnect or other internal errors. Possible causes can be cosmic radiation, instable power supplies, cooling problems, broken hardware, running systems out of specification, or bad luck. Most errors can be corrected by the CPU by internal error correction mechanisms. Uncorrected errors cause machine check exceptions which may kill processes or panic the machine. A small number of corrected errors is usually not a cause for worry, but a large number can indicate future failure. When a corrected or recovered error happens the x86 kernel writes a record describing the MCE into a internal ring buffer available through the /dev/mcelog device. mcelog retrieves errors from /dev/mcelog, decodes them into a human readable format and prints them on the standard output or optionally into the system log. Unfortunately, unRaid doesn't include mcelog so you can't see what was actually logged. (But you can try cat /dev/mcelog from putty / a console (not sure if this will work since I don't have any mce problems) (Note that if your crash is an outright reset / shutdown, this log will be cleared upon it starting back up) First things to do would be to clean out all dust bunnies (and dust) out of the system (not only do they inhibit your cooling system, but they also can conduct electricity. Next reseat all of your memory sticks and run a memtest Then post up complete specs of your hardware (including model # of power supply) Link to comment
BillClinton Posted March 23, 2016 Author Share Posted March 23, 2016 Those Machine Check Events have happened on and off for the last year. I think my machine might be crashing when my Plex cleans the media meta data up once a week (its stored on my cache). Also it might have crashed the next day because the mover started. (had some time to think it through) Finally got some time up my sleeve. I'm going to give it a clean. I'll check all my sata cables while I'm at it. Going to check if Mover got interrupted and fix the move if need be. my cache drive is starting to make noise when i turn the server on. It is getting old now (4 years 9 months). I am heading to unraid 6 and get me some SSD's for my cache. Minimize some HDD thrashing. I hope that will fix it. Thanks Squid for showing me where to find the MCE log. I'll definitely be keeping an eye on it. I'll let you know how the upgrade/fix goes over the next week or so. I'll update my sig soon with my PSU model number. Link to comment
BillClinton Posted March 29, 2016 Author Share Posted March 29, 2016 Had a power outage over the weekend. thought i would do a uncorrecting parity check. got a hole heap of errors. (supplied latest syslog). The drive short smart test is supplied as well. I have no more sata ports available is this drive failing? can i safely pull the data off the array to start from scratch? Can i Upgrade to Unraid 6 with these problems? Also is Cache failing? Much appreciated for your replies. syslog-2016-03-29.zip smart_report_disk1_29-03-2016.txt smart_report_cache_29-03-2016.txt Link to comment
JorgeB Posted March 29, 2016 Share Posted March 29, 2016 Disk1 need replacing: 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 47 Cache disk looks ok but you probably have a bad sata cable, try replacing it an monitor this value, an increase of 2 or more means there's still a problem: 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 9597 Link to comment
BillClinton Posted May 5, 2016 Author Share Posted May 5, 2016 So after a month of my server sitting turned off waiting for the funds to replace disk1, I replaced it with a 3TB. The data rebuilt nicely. My server is up and running. While my server was off for the month, my UPS was still on as I had my gigabit switch still plugged in. During this time I noticed my UPS was feeling rather hot. Anyway one night when shutting down my laptop I heard my UPS switch off and a few seconds later switch on. So I have since got my UPS checked out and it turns out my battery has fried itself. I will be upgrading to Unraid6 and get the UPS communication sorted. Hopefully I will be able to catch this sort of problem in the future. Regards for the help I received. Sent from my iPad using Tapatalk Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.