deaerator Posted December 9, 2017 Share Posted December 9, 2017 (edited) I was told through the unraid gui that i might have hardware problems. tower-diagnostics-20171209-1044.zip System: Model: Custom M/B: ASRock - EP2C602-4L/D16 CPU: Intel® Xeon® CPU E5-2670 0 @ 2.60GHz HVM: Enabled IOMMU: Enabled Cache: 512 kB, 2048 kB, 20480 kB Memory: 128 GB (max. installable capacity 192 GB) Network: eth0: 1000 Mb/s, full duplex, mtu 1500 eth1: not connected eth2: not connected eth3: not connected Kernel: Linux 4.9.30-unRAID x86_64 OpenSSL: 1.0.2k In the log I notice there was errors regarding Ram Scrubbing? Edited December 9, 2017 by deaerator Quote Link to comment
Frank1940 Posted December 9, 2017 Share Posted December 9, 2017 You might want to give the Gurus a bit more information about what hardware might have a problem and, if that is not available, a list of hardware that you have on the system. Quote Link to comment
Squid Posted December 9, 2017 Share Posted December 9, 2017 1 hour ago, deaerator said: In the log I notice there was errors regarding Ram Scrubbing? Searching through the syslog, there is no mention of "scrub" 1 hour ago, deaerator said: I was told through the unraid gui that i might have hardware problems. Any notification would have been through Fix Common Problems, and there is no record of FCP finding anything other than docker updates available. What is this notification that you were talking about? Quote Link to comment
deaerator Posted December 9, 2017 Author Share Posted December 9, 2017 I will re-post a log when it shows up again. Last time the whole server crashed and I couldn't retrieve the log. Quote Link to comment
Squid Posted December 9, 2017 Share Posted December 9, 2017 1 hour ago, deaerator said: I will re-post a log when it shows up again. Last time the whole server crashed and I couldn't retrieve the log. How often does the crash happen? If its say daily, then throw FCP into troubleshooting mode. Quote Link to comment
deaerator Posted December 9, 2017 Author Share Posted December 9, 2017 It happens every 2-3 days. It gets very frustrating, sometimes i feel like junking and buy new hardware. Quote Link to comment
deaerator Posted December 11, 2017 Author Share Posted December 11, 2017 A new log has been posted. FCP reports new hardware errors tower-diagnostics-20171211-0710.zip Quote Link to comment
Squid Posted December 11, 2017 Share Posted December 11, 2017 You have a bad memory DIMM Quote Link to comment
deaerator Posted December 11, 2017 Author Share Posted December 11, 2017 Thank you, How hard is that going to find? Quote Link to comment
Squid Posted December 11, 2017 Share Posted December 11, 2017 Since you've got ECC, there may be clues in the BIOS (some sort of logging?) as to which DIMM is bad. Beyond that, you can always pull them out one at a time and see what happens. Quote Link to comment
deaerator Posted December 11, 2017 Author Share Posted December 11, 2017 Nothing in the logs, thank you for the information. It will be a long day testing 16 sticks of ram with memtest Quote Link to comment
JorgeB Posted December 11, 2017 Share Posted December 11, 2017 ECC errors on Supermicro boards are logged on the boards's event log, I would assume it would be the same with Asrock. Quote Link to comment
Frank1940 Posted December 11, 2017 Share Posted December 11, 2017 6 minutes ago, deaerator said: It will be a long day testing 16 sticks of ram with memtest With 16 sticks of RAM, You might want to make sure that your RAM is on the approved list. (Sometimes, the control line drivers can't maintain the required waveform which can cause problems as the stick count goes up.) Quote Link to comment
deaerator Posted December 11, 2017 Author Share Posted December 11, 2017 15 minutes ago, Frank1940 said: With 16 sticks of RAM, You might want to make sure that your RAM is on the approved list. (Sometimes, the control line drivers can't maintain the required waveform which can cause problems as the stick count goes up.) Where do I find this approved LIst? Quote Link to comment
tdallen Posted December 11, 2017 Share Posted December 11, 2017 The motherboard manufacturers website, under the support section for the specific board. Quote Link to comment
deaerator Posted December 11, 2017 Author Share Posted December 11, 2017 good to know, Thank you. Quote Link to comment
deaerator Posted December 11, 2017 Author Share Posted December 11, 2017 (edited) Now it feels like a buyer beware kind of deal. I purchased the Motherboard & Ram as a combo; made the assumption that it will work together. . Turns out the ram is not the approved QVL, but it worked in my system for over a year, without any major problems until now. Edited December 11, 2017 by deaerator Quote Link to comment
Squid Posted December 11, 2017 Share Posted December 11, 2017 (edited) 5 minutes ago, deaerator said: but it worked in my system for over a year, without any major problems until now. Memory can (and does) go bad. No different than anything else. QVL doesn't necessarily mean that the memory won't work with the mobo. It simply means that if its on the list that it will work (if its not bad) Edited December 11, 2017 by Squid Quote Link to comment
Frank1940 Posted December 11, 2017 Share Posted December 11, 2017 You could try pulling a couple of sticks (perhaps from each CPU bank) and see if that clears up the problem. If it is a problem with the line drivers/RAM, that may fix the problem. (I am not certain from what you wrote if you had any indication of a problem BEFORE you got the notice from Fix Common Problems. If that is the case, the issue may always have been there but you just were not aware of it. With EEC, A single bit read error should be detected and a reread should result which may return the proper data. The system may be a bit slower but it probably won't crash and burn.) Quote Link to comment
deaerator Posted December 11, 2017 Author Share Posted December 11, 2017 I decided to do the slow and steady way, and test each stick individually. Quote Link to comment
Squid Posted December 11, 2017 Share Posted December 11, 2017 Part of your problem though is that if you can't find this 1 hour ago, johnnie.black said: ECC errors on Supermicro boards are logged on the boards's event log, I would assume it would be the same with Asrock. then if the errors are being corrected properly a memtest will pass with flying colors and you will have done a ton of work with no results. Quote Link to comment
deaerator Posted December 11, 2017 Author Share Posted December 11, 2017 (edited) I seen zero errors in the event logs regarding Memory problems. Edited December 11, 2017 by deaerator Quote Link to comment
Squid Posted December 11, 2017 Share Posted December 11, 2017 Then you need to talk to AsRock regarding this Dec 11 04:40:08 Tower root: Hardware event. This is not a software error. Dec 11 04:40:08 Tower root: MCE 0 Dec 11 04:40:08 Tower root: CPU 8 BANK 8 Dec 11 04:40:08 Tower root: MISC 1221020002000e8c ADDR 1e3872c000 Dec 11 04:40:08 Tower root: TIME 1512963915 Sun Dec 10 23:45:15 2017 Dec 11 04:40:08 Tower root: MCG status: Dec 11 04:40:08 Tower root: MCi status: Dec 11 04:40:08 Tower root: Corrected error Dec 11 04:40:08 Tower root: MCi_MISC register valid Dec 11 04:40:08 Tower root: MCi_ADDR register valid Dec 11 04:40:08 Tower root: MCA: MEMORY CONTROLLER MS_CHANNEL0_ERR Dec 11 04:40:08 Tower root: Transaction: Memory scrubbing error Dec 11 04:40:08 Tower root: MemCtrl: Corrected patrol scrub error Dec 11 04:40:08 Tower root: STATUS 8c000047000800c0 MCGSTATUS 0 Dec 11 04:40:08 Tower root: MCGCAP 1000c14 APICID 20 SOCKETID 1 Dec 11 04:40:08 Tower root: CPUID Vendor Intel Family 6 Model 45 Memory scrubs are when the system attempts to read & write to every ram location during idle periods to see if the ram is good / bad. Quote Link to comment
deaerator Posted December 11, 2017 Author Share Posted December 11, 2017 Just fired off a support ticket. Quote Link to comment
deaerator Posted December 12, 2017 Author Share Posted December 12, 2017 (edited) ECC ram works like it should and no Errors showed up during the memtest. I put Tower back together with 32Gb of ram and back online; Something seriously got borked overnight. Parity drives missing, call traces errors and cached drive disappeared. tower-diagnostics-20171212-0533.zip Edited December 12, 2017 by deaerator Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.