Reetik2000

Members
  • Posts

    14
  • Joined

  • Last visited

Reetik2000's Achievements

Noob

Noob (1/14)

0

Reputation

  1. Title says it all. BMC firmware used to be on 1.6 something. Tried upgrading to 1.86 in an attempt to fix another issue and ended up corrupting the BMC. Downloaded the firmware straight from the product page for the C2100. After some research, found out there's multiple versions of the C2100 board. The model number I have for this server is FS12-TY which from what I can tell is a C2100. The motherboard part number I have is 0PN94W. No idea what to do at this point. Have to flip the CMOS jumper while it's off and then plug in the server and flip the jumper back to default just to get it to actually turn on and POST. Otherwise, it'll get power but the power button doesn't do anything. After a looooong time of just sitting there looking like it's not going to POST, it finally POSTs. Going into the BIOS, it shows something like WAITING FOR OEM for all the details. BMC version showing as 0.0 and also shows as not responding. My question is, is this recoverable? Does anyone happen to have a working 0PN94W motherboard lying around that can backup their BMC firmware and send to me?
  2. So it's been a couple weeks. I ended up getting a new set of RAM. I found a really good deal and I couldn't stop myself. I installed it in the same way by adding one stick to each CPU every 3 days until I got the whole set installed. I now have 32GB instead of 16GB and it's finally stable. Thanks for the help everyone!
  3. Thanks @testdasi, I've already tried swapping the CPUs and it still showed up in node 2. So far, it hasn't crashed yet with the two sticks of RAM. I was starting to lose hope that it was a RAM issue. Thanks for confirming.
  4. No overclocks and all the BIOS settings defaulted except for obvious things like the boot sequence. I checked the logs and it shows the two errors since I installed this second board. The two errors show the ECC error in DIMM 2A(CPU2) for the first one, while the second one showed in DIMM 1A(CPU2) which is strange since every time I've checked, it's always been node 2 like the screenshot. I'll try to find a way to boot up my last board and pull the logs from their since it's probably got at least 7 errors in it's logs. Right now, I'm running it with only 2 sticks of RAM, one for each CPU, and just waiting for it to crash again. If it does, I'm gonna try it with a different 2 sticks. If it doesn't crash within the next few days, I'm planning to add 2 more sticks of RAM. Basically gonna repeat this process until I either have all the RAM back in there to rule out the RAM being the issue, or I'm left over with a couple sticks that seemed to have cause the issue. I doubt it's the power supply but I haven't tested it yet. I don't have another one I could use. Are there any other ways I could go about testing it? I'll check the logs and post here next time it crashes.
  5. Naw, I changed the motherboard after this error was happening and I tested with different ram and the CPU's swapped. I contacted the seller and they had an extra board that they sent over. Both boards are having the exact same error. I've already checked compatibility, had to scroll and click through endless docs pages for it but they're all on the compatibility list on the SuperMicro site.
  6. Hello, I've been having some trouble with these random ECC errors. It's always reporting the ECC error on Node 2 with all the other nodes showing an HT Sync error. Specs Motherboard: SuperMicro H8DGU-F CPU: Dual AMD Opteron 6278's RAM: 8x Micron 2GB DDR3 1333MHz ECC RAM (MT18JSF25672PDZ) Riser Card: SuperMicro RSC-R2UU-UA3E8+ RAID Controller: 2x Dell PERC H310's Storage: 12x WD RE4's (2 as parity drives) Samsung SSD 850 EVO 500GB (Primary Cache) SAMSUNG SSD SM841N 2.5 7mm 512GB (Secondary Cache) A picture of the error is attached. Memtest came back with no errors. I've updated the BIOS, shuffled the RAM around, and swapped the CPU slots, and even swapped the motherboard. It still shows the exact same error Node 2 with the uncorrectable ECC error and the other nodes with HT Sync errors. I'm currently testing each of the RAM sticks in pairs. The error shows up anywhere from a day to after a week so it's gonna take a while. At this point, the only things left to test is the riser card, the RAID controllers, and all the storage. I don't see how any of those can be the cause of an ECC error but I don't know what else to do.
  7. Yeah everything seems fine. I do get these random shutdowns every few weeks it seems sometime on Saturday nights or Sunday mornings. They seem to be graceful according to the logs with all the drives spinning down but I can't seem to figure out whats going on. I upgraded the CPUs a while back and it does seem a bit more stable but I noticed the first shutdown since the upgrade this morning.
  8. I’m getting these MCE errors. Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: Machine check events logged Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: f20000c000020c0f Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: TSC 0 Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: PROCESSOR 2:600f12 TIME 1581576322 SOCKET 0 APIC 0 microcode 600063e Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: Machine check events logged Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 4: f200009000020c0f Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: TSC 0 Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: PROCESSOR 2:600f12 TIME 1581576322 SOCKET 0 APIC 8 microcode 600063e Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: CPU 16: Machine Check: 0 Bank 4: f200004000020c0f Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: TSC 0 Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: PROCESSOR 2:600f12 TIME 1581576322 SOCKET 1 APIC 20 microcode 600063e Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: CPU 24: Machine Check: 0 Bank 4: f67720006c080813 Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 3a13efc70 Feb 13 00:46:08 RR-Tower kernel: mce: [Hardware Error]: PROCESSOR 2:600f12 TIME 1581576322 SOCKET 1 APIC 28 microcode 600063e My specs are below: Motherboard: Tyan S8236WGM3NR CPU: Two AMD Opteron 6278’s RAM: 22 GB of ECC DDR3 My BIOS is up to date. I’ve tried loading the failsafe and optimal default settings in the BIOS and nothing changes. I’m not sure what else to try. I’ve attached the full syslog. syslog.txt
  9. It's gonna be wierd for labeling since the default eth1 and eth2 ports are next to each other while default eth0 being off to the side but I'll try it. Still don't know what to do about eth1 being disabled on boot tho.
  10. So my motherboard(Tyan S8236GM3NR) has two onboard NICs. The singular port NIC that is eth0 and the dual port NIC that has eth1 and eth2. In unraid, I have eth1 and eth2 set to an 802.3ad bond with bridging enabled and everything else left default. It seems to work fine, eth2 interface section shows it's part of bond1. I have LAGP enabled for those ports on my network switch(Dell PowerConnect 6224). Everything seems to have been connected and configured correctly, however, I still can't get any sort of connection, and upon restart, eth1 and eth2 are shutdown. eth0 on the other hand seems to be perfectly up and running. I have to go into the terminal and run ifconfig eth1 up ifconfig eth2 up ifconfig eth0 down every time. The port up button is always greyed out even when the array is stopped. I'm trying to make bond1(eth1/eth2 bond) the default connection. eth0 is going to remain unplugged. I also can't make changes to the routing table after I bring the ports back up.
  11. Same issue here. Smart data shows up in the disk's settings and even shows accurate temperature in the smart data but on the dashboard it still shows an asterisk (*). My polling rate is set to 90.