Unraid keeps on crashing

greennick · July 15, 2018

Not sure how much info you need to assist, so I will do my best and let me know if there is something else you need. Syslog attached following these instructions.

My server:

X9DRD-7LN4F

2x Xeon 2630L v2

3x 3TB HGST with one set to parity

Samsung 850 500GB as Cache

32GB (8x4GB) ECC RAM

Any ideas on what I need to look at changing? Any diagnosis I should be running?

syslog

Frank1940 · July 15, 2018

Install Fix Common Problems plugin and turn on its troubleshooting mode. That mode will write periodical syslog updates updates to your flash drive. You could also connect a monitor to your server and see if there are any clues on the screen after the crash.

Did the server hardware ever work without a problem? If so, what changed in the hardware or software setup.

You might also tell us what software and VM's you are using.

greennick · July 15, 2018

Thanks @Frank1940

I'll do that and report back.

I wasn't even running anything on it today and it still crashed. Usually run Sonar, Radar, Sab, Plex, and maybe Deluge. The server worked fine for months with no issues and I made no hardware changes, but got all the above working 3 weeks ago and downloaded a lot of files. It seems to have started to crash regular around 10 days ago.

ashman70 · July 15, 2018

Did you update unRAID to a newer version by chance?

Frank1940 · July 15, 2018

Did you add more memory about the time this all started happening?

EDIT: See below

Jul 15 13:32:22 Tower kernel: mce: [Hardware Error]: Machine check events logged
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: CPU 6: Machine Check Event: 0 Bank 7: 8c00004000010091
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: TSC 21390b306b4 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: ADDR 84d5030c0 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: MISC 14016b086 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1531632742 SOCKET 1 APIC 20
Jul 15 13:32:22 Tower kernel: EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x84d503 offset:0xc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:8 rank:0)
Jul 15 13:32:23 Tower root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token

greennick · July 16, 2018

21 hours ago, Frank1940 said:

Did you add more memory about the time this all started happening?

EDIT: See below


Jul 15 13:32:22 Tower kernel: mce: [Hardware Error]: Machine check events logged
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: CPU 6: Machine Check Event: 0 Bank 7: 8c00004000010091
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: TSC 21390b306b4 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: ADDR 84d5030c0 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: MISC 14016b086 
Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1531632742 SOCKET 1 APIC 20
Jul 15 13:32:22 Tower kernel: EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x84d503 offset:0xc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:8 rank:0)
Jul 15 13:32:23 Tower root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token

It's had the same memory since I built it. Will roll back the version first, then check the memory of the problem persists.

greennick · July 16, 2018

22 hours ago, ashman70 said:

Did you update unRAID to a newer version by chance?

You might be on to something, I downloaded the upgrade to 6.5.3, but didn't reboot it, however it seems to have switched to that when the server crashed. I'll roll back and report back on stability.

Thanks.

ashman70 · July 16, 2018

Yeah I am not trying to suggest there is anything wrong with unRAID, however I have personally found that that at least one newer build behaved erratically on one of my servers so I rolled back to the pervious version that was stable for me. When I have time I will look into it some more, but I simply couldn't afford to have a server not functioning after an update and I didn't have time to investigate it further.

JorgeB · July 16, 2018

14 minutes ago, greennick said:

You might be on to something, I downloaded the upgrade to 6.5.3, but didn't reboot it, however it seems to have switched to that when the server crashed. I'll roll back and report back on stability.

Thanks.

That means the crash happened with the old version, as the new one would only be used after a reboot.

greennick · July 17, 2018

9 hours ago, johnnie.black said:

That means the crash happened with the old version, as the new one would only be used after a reboot.

I know that, but still trying to eliminate possible causes of the continued crashes, figured it can't hurt!

It's been online for over 10 hours now, which is better than 6.5.3 was giving me. Fingers crossed now that I've fired up all the Dockers and started some downloads.

greennick · July 18, 2018

It was more stable yesterday, but still crashed overnight after a fair amount of downloading. I ran syslog again and it looks like my memory errors popped up again (attached). So, I'll try work out how to run memtest and report back.

syslog 2018 07 18

Frank1940 · July 18, 2018

You might also want to check to see that the installed RAM is actually recommended by the MB manufacturer. Some MB's are picky about the RAM used. This issue often raises its ugly head when the number of RAM modules increases. (Probably because of the loading on the address busses.) Google can be your friend for this type of research.

Running memtst is one of the options in the boot menu for unRAID...

greennick · July 18, 2018

4 minutes ago, Frank1940 said:

You might also want to check to see that the installed RAM is actually recommended by the MB manufacturer. Some MB's are picky about the RAM used. This issue often raises its ugly head when the number of RAM modules increases. (Probably because of the loading on the address busses.) Google can be your friend for this type of research.

Running memtst is one of the options in the boot menu for unRAID...

I am hoping it isn't RAM conflicts, it is server ram in a server board that is to spec, that's as good as I could do. There isn't much RAM available that is recommended by Supermicro as they want you to buy proprietary or in their small approved lines from key suppliers.

I worked out how to run memtest earlier, so far no errors 43% in. TBH, never even see the boot screen as i mainly run it headless! At the end of this, dunno if I would rather have a fault or not, at least I know the issue if I have a fault.

Thanks

Frank1940 · July 18, 2018

Be sure to run it for at least 24 hours...

greennick · July 19, 2018

No errors. Hmmm, will see how it runs overnight.

JorgeB · July 19, 2018

Memtest won't work on systems with ECC, if there are single bit errors they will be corrected and go undetected, you can try checking the board's SEL (system event log), there might be more info there.

greennick · July 22, 2018

On 7/19/2018 at 10:00 PM, johnnie.black said:

Memtest won't work on systems with ECC, if there are single bit errors they will be corrected and go undetected, you can try checking the board's SEL (system event log), there might be more info there.

Seems like you're right

pwm · July 22, 2018

On 7/19/2018 at 4:00 PM, johnnie.black said:

Memtest won't work on systems with ECC, if there are single bit errors they will be corrected and go undetected, you can try checking the board's SEL (system event log), there might be more info there.

The good thing with Memtest is that it will still stress-test the memory - so running Memtest before looking at the event log can help locating bad memory.

JorgeB · July 22, 2018

And don't forget that single bit errors are corrected, but if a multi bit error occurs the server will halt/crash to prevent corruption, not sure if those would be logged on the SEL but would expect so, in any case that DIMM should be replaced.

Unraid keeps on crashing

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived