greennick Posted July 15, 2018 Share Posted July 15, 2018 Not sure how much info you need to assist, so I will do my best and let me know if there is something else you need. Syslog attached following these instructions. My server: X9DRD-7LN4F 2x Xeon 2630L v2 3x 3TB HGST with one set to parity Samsung 850 500GB as Cache 32GB (8x4GB) ECC RAM Any ideas on what I need to look at changing? Any diagnosis I should be running? syslog Link to comment
Frank1940 Posted July 15, 2018 Share Posted July 15, 2018 Install Fix Common Problems plugin and turn on its troubleshooting mode. That mode will write periodical syslog updates updates to your flash drive. You could also connect a monitor to your server and see if there are any clues on the screen after the crash. Did the server hardware ever work without a problem? If so, what changed in the hardware or software setup. You might also tell us what software and VM's you are using. Link to comment
greennick Posted July 15, 2018 Author Share Posted July 15, 2018 Thanks @Frank1940 I'll do that and report back. I wasn't even running anything on it today and it still crashed. Usually run Sonar, Radar, Sab, Plex, and maybe Deluge. The server worked fine for months with no issues and I made no hardware changes, but got all the above working 3 weeks ago and downloaded a lot of files. It seems to have started to crash regular around 10 days ago. Link to comment
ashman70 Posted July 15, 2018 Share Posted July 15, 2018 Did you update unRAID to a newer version by chance? Link to comment
Frank1940 Posted July 15, 2018 Share Posted July 15, 2018 Did you add more memory about the time this all started happening? EDIT: See below Jul 15 13:32:22 Tower kernel: mce: [Hardware Error]: Machine check events logged Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: CPU 6: Machine Check Event: 0 Bank 7: 8c00004000010091 Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: TSC 21390b306b4 Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: ADDR 84d5030c0 Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: MISC 14016b086 Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1531632742 SOCKET 1 APIC 20 Jul 15 13:32:22 Tower kernel: EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x84d503 offset:0xc0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:8 rank:0) Jul 15 13:32:23 Tower root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token Link to comment
greennick Posted July 16, 2018 Author Share Posted July 16, 2018 21 hours ago, Frank1940 said: Did you add more memory about the time this all started happening? EDIT: See below Jul 15 13:32:22 Tower kernel: mce: [Hardware Error]: Machine check events logged Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: CPU 6: Machine Check Event: 0 Bank 7: 8c00004000010091 Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: TSC 21390b306b4 Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: ADDR 84d5030c0 Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: MISC 14016b086 Jul 15 13:32:22 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1531632742 SOCKET 1 APIC 20 Jul 15 13:32:22 Tower kernel: EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x84d503 offset:0xc0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:8 rank:0) Jul 15 13:32:23 Tower root: error: /plugins/unassigned.devices/UnassignedDevices.php: wrong csrf_token It's had the same memory since I built it. Will roll back the version first, then check the memory of the problem persists. Link to comment
greennick Posted July 16, 2018 Author Share Posted July 16, 2018 22 hours ago, ashman70 said: Did you update unRAID to a newer version by chance? You might be on to something, I downloaded the upgrade to 6.5.3, but didn't reboot it, however it seems to have switched to that when the server crashed. I'll roll back and report back on stability. Thanks. Link to comment
ashman70 Posted July 16, 2018 Share Posted July 16, 2018 Yeah I am not trying to suggest there is anything wrong with unRAID, however I have personally found that that at least one newer build behaved erratically on one of my servers so I rolled back to the pervious version that was stable for me. When I have time I will look into it some more, but I simply couldn't afford to have a server not functioning after an update and I didn't have time to investigate it further. Link to comment
JorgeB Posted July 16, 2018 Share Posted July 16, 2018 14 minutes ago, greennick said: You might be on to something, I downloaded the upgrade to 6.5.3, but didn't reboot it, however it seems to have switched to that when the server crashed. I'll roll back and report back on stability. Thanks. That means the crash happened with the old version, as the new one would only be used after a reboot. Link to comment
greennick Posted July 17, 2018 Author Share Posted July 17, 2018 9 hours ago, johnnie.black said: That means the crash happened with the old version, as the new one would only be used after a reboot. I know that, but still trying to eliminate possible causes of the continued crashes, figured it can't hurt! It's been online for over 10 hours now, which is better than 6.5.3 was giving me. Fingers crossed now that I've fired up all the Dockers and started some downloads. Link to comment
greennick Posted July 18, 2018 Author Share Posted July 18, 2018 It was more stable yesterday, but still crashed overnight after a fair amount of downloading. I ran syslog again and it looks like my memory errors popped up again (attached). So, I'll try work out how to run memtest and report back. syslog 2018 07 18 Link to comment
Frank1940 Posted July 18, 2018 Share Posted July 18, 2018 You might also want to check to see that the installed RAM is actually recommended by the MB manufacturer. Some MB's are picky about the RAM used. This issue often raises its ugly head when the number of RAM modules increases. (Probably because of the loading on the address busses.) Google can be your friend for this type of research. Running memtst is one of the options in the boot menu for unRAID... Link to comment
greennick Posted July 18, 2018 Author Share Posted July 18, 2018 4 minutes ago, Frank1940 said: You might also want to check to see that the installed RAM is actually recommended by the MB manufacturer. Some MB's are picky about the RAM used. This issue often raises its ugly head when the number of RAM modules increases. (Probably because of the loading on the address busses.) Google can be your friend for this type of research. Running memtst is one of the options in the boot menu for unRAID... I am hoping it isn't RAM conflicts, it is server ram in a server board that is to spec, that's as good as I could do. There isn't much RAM available that is recommended by Supermicro as they want you to buy proprietary or in their small approved lines from key suppliers. I worked out how to run memtest earlier, so far no errors 43% in. TBH, never even see the boot screen as i mainly run it headless! At the end of this, dunno if I would rather have a fault or not, at least I know the issue if I have a fault. Thanks Link to comment
Frank1940 Posted July 18, 2018 Share Posted July 18, 2018 Be sure to run it for at least 24 hours... Link to comment
greennick Posted July 19, 2018 Author Share Posted July 19, 2018 No errors. Hmmm, will see how it runs overnight. Link to comment
JorgeB Posted July 19, 2018 Share Posted July 19, 2018 Memtest won't work on systems with ECC, if there are single bit errors they will be corrected and go undetected, you can try checking the board's SEL (system event log), there might be more info there. Link to comment
greennick Posted July 22, 2018 Author Share Posted July 22, 2018 On 7/19/2018 at 10:00 PM, johnnie.black said: Memtest won't work on systems with ECC, if there are single bit errors they will be corrected and go undetected, you can try checking the board's SEL (system event log), there might be more info there. Seems like you're right Link to comment
pwm Posted July 22, 2018 Share Posted July 22, 2018 On 7/19/2018 at 4:00 PM, johnnie.black said: Memtest won't work on systems with ECC, if there are single bit errors they will be corrected and go undetected, you can try checking the board's SEL (system event log), there might be more info there. The good thing with Memtest is that it will still stress-test the memory - so running Memtest before looking at the event log can help locating bad memory. Link to comment
JorgeB Posted July 22, 2018 Share Posted July 22, 2018 And don't forget that single bit errors are corrected, but if a multi bit error occurs the server will halt/crash to prevent corruption, not sure if those would be logged on the SEL but would expect so, in any case that DIMM should be replaced. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.