Jump to content

Unraid keeps on crashing


greennick

Recommended Posts

Install Fix Common Problems plugin and turn on its troubleshooting mode.   That mode will write periodical syslog updates updates to your flash drive.  You could also connect a monitor to your server and see if there are any clues on the screen after the crash.  

 

Did the server hardware ever work without a problem?  If so, what changed in the hardware or software setup.  

 

You might also tell us what software and VM's you are using.  

Link to comment

Thanks @Frank1940

 

I'll do that and report back.

 

I wasn't even running anything on it today and it still crashed. Usually run Sonar, Radar, Sab, Plex, and maybe Deluge. The server worked fine for months with no issues and I made no hardware changes, but got all the above working 3 weeks ago and downloaded a lot of files. It seems to have started to crash regular around 10 days ago.

Link to comment

Did you add more memory about the time this all started happening?  

 

EDIT:  See below

Jul 15 13:32:22 Tower kernel: mce: [Hardware Error]: Machine check events logged
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: CPU 6: Machine Check Event: 0 Bank 7: 8c00004000010091
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: TSC 21390b306b4 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: ADDR 84d5030c0 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: MISC 14016b086 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1531632742 SOCKET 1 APIC 20
Jul 15 13:32:22 Tower kernel: EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x84d503 offset:0xc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:8 rank:0)
Jul 15 13:32:23 Tower root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token

 

Link to comment
21 hours ago, Frank1940 said:

Did you add more memory about the time this all started happening?  

 

EDIT:  See below


Jul 15 13:32:22 Tower kernel: mce: [Hardware Error]: Machine check events logged
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: CPU 6: Machine Check Event: 0 Bank 7: 8c00004000010091
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: TSC 21390b306b4 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: ADDR 84d5030c0 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: MISC 14016b086 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1531632742 SOCKET 1 APIC 20
Jul 15 13:32:22 Tower kernel: EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x84d503 offset:0xc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:8 rank:0)
Jul 15 13:32:23 Tower root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token

It's had the same memory since I built it. Will roll back the version first, then check the memory of the problem persists.

 

Link to comment
22 hours ago, ashman70 said:

Did you update unRAID to a newer version by chance?

 

You might be on to something, I downloaded the upgrade to 6.5.3, but didn't reboot it, however it seems to have switched to that when the server crashed. I'll roll back and report back on stability.

 

Thanks.

Link to comment

Yeah I am not trying to suggest there is anything wrong with unRAID, however I have personally found that that at least one newer build behaved erratically on one of my servers so I rolled back to the pervious version that was stable for me. When I have time I will look into it some more, but I simply couldn't afford to have a server not functioning after an update and I didn't have time to investigate it further.

Link to comment
14 minutes ago, greennick said:

 

You might be on to something, I downloaded the upgrade to 6.5.3, but didn't reboot it, however it seems to have switched to that when the server crashed. I'll roll back and report back on stability.

 

Thanks.

That means the crash happened with the old version, as the new one would only be used after a reboot.

Link to comment
9 hours ago, johnnie.black said:

That means the crash happened with the old version, as the new one would only be used after a reboot.

I know that, but still trying to eliminate possible causes of the continued crashes, figured it can't hurt!

 

It's been online for over 10 hours now, which is better than 6.5.3 was giving me. Fingers crossed now that I've fired up all the Dockers and started some downloads.

Link to comment

You might also want to check to see that the installed RAM is actually recommended by the MB manufacturer.  Some MB's are picky about the RAM used. This issue often raises its ugly head when the number of RAM modules increases.  (Probably because of the loading on the address busses.)  Google can be your friend for this type of research.  

 

Running memtst is one of the options in the boot menu for unRAID... 

Link to comment
4 minutes ago, Frank1940 said:

You might also want to check to see that the installed RAM is actually recommended by the MB manufacturer.  Some MB's are picky about the RAM used. This issue often raises its ugly head when the number of RAM modules increases.  (Probably because of the loading on the address busses.)  Google can be your friend for this type of research.  

 

Running memtst is one of the options in the boot menu for unRAID... 

 

I am hoping it isn't RAM conflicts, it is server ram in a server board that is to spec, that's as good as I could do.  There isn't much RAM available that is recommended by Supermicro as they want you to buy proprietary or in their small approved lines from key suppliers.

 

I worked out how to run memtest earlier, so far no errors 43% in.  TBH, never even see the boot screen as i mainly run it headless!  At the end of this, dunno if I would rather have a fault or not, at least I know the issue if I have a fault.

 

Thanks

Link to comment
On 7/19/2018 at 10:00 PM, johnnie.black said:

Memtest won't work on systems with ECC, if there are single bit errors they will be corrected and go undetected, you can try checking the board's SEL (system event log), there might be more info there.

 

Seems like you're right

20180722_193849.jpeg

Link to comment
On 7/19/2018 at 4:00 PM, johnnie.black said:

Memtest won't work on systems with ECC, if there are single bit errors they will be corrected and go undetected, you can try checking the board's SEL (system event log), there might be more info there.

 

The good thing with Memtest is that it will still stress-test the memory - so running Memtest before looking at the event log can help locating bad memory.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...