Jomp Posted July 24, 2015 Share Posted July 24, 2015 Some backstory, I was running 5.0.5 and had about 200 days of uptime when I noticed my server had dropped off the network during a file write. Telnet, http, all not working. I IPMI'd in and the screen was showing the typical Unraid login, but was totally unresponsive. I rebooted, ran a parity sync that showed no sync mismatches, ran a md5 check of all my files and a week later it showed that everything hashed correctly. I figured it was an anomaly. I went ahead and upgraded to 6.0.0 since the final had just come out. Everything was going good and at about 30 days of uptime I decided to start converting drives from reiserfs to xfs since reiserfs was doing the annoying "drives are almost full so I don't want to write large files to fill them totally up" thing. About half way through converting my disks, again during a write, unraid went unresponsive. Telnet, http, all not working. I IPMI'd in and the screen was again showing the typical Unraid login, but was totally unresponsive. I rebooted, ran a parity sync, got a bunch of sync errors and now it's going to take another week to md5 hash everything. Annoying. This is obviously becoming a worrisome trend. I ran memtest and everything was ok there. I run unraid without any plugins or virtualization, totally stock. I'm currently running tail -f /var/log/syslog and waiting. All I can do wait for another freeze? Every crash takes a solid week to hash check everything, so I'd love some ideas to be a little bit more, uh, proactive. The common thread seems to be freezes during writes to the array. Any ideas? Thanks so much! Specs: CPU: Intel G3220 Motherboard: Supermicro X10SL7-F Ram: 4GB ECC Power Supply: Corsair RM650 Link to comment
bonienl Posted July 24, 2015 Share Posted July 24, 2015 When there is no response at all (no GUI, no telnet) the first suspect is the network connection. Do you have a fixed IP address or do you use DHCP, in the latter case it may loose or change IP address over time. Did you check the status of the ethernet port itself when using IPMI ? Port up and proper settings ? Link to comment
Jomp Posted July 24, 2015 Author Share Posted July 24, 2015 When there is no response at all (no GUI, no telnet) the first suspect is the network connection. Do you have a fixed IP address or do you use DHCP, in the latter case it may loose or change IP address over time. Did you check the status of the ethernet port itself when using IPMI ? Port up and proper settings ? Unraid has a fixed IP, as does IPMI and they share the same port/cable. I'm able to connect to IPMI when the server freezes, so I don't think it's a network thing. The server runs headless, but I'm able to see Unraid's screen using the java console redirect and it's frozen, my keyboard and the virtual keyboard don't do anything. Syslog attachment is incoming, but it obviously doesn't show what happened before the forced reboot. Link to comment
Jomp Posted July 25, 2015 Author Share Posted July 25, 2015 Saw this in my Supermicro event log, perhaps the freezing is a RAM issue after all? Event Log:12 event entries Event ID Time Stamp Sensor Name Sensor Type Description 1 2015/01/13 11:40:54 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 2 2015/01/13 11:40:55 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted 3 2015/05/22 03:47:07 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 4 2015/05/22 03:47:07 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted 5 2015/06/06 00:32:30 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 6 2015/06/06 00:32:30 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted 7 2015/07/16 19:16:52 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 8 2015/07/16 19:16:52 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted 9 2015/07/19 19:24:55 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 10 2015/07/19 19:24:55 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted 11 2015/07/24 14:06:29 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 12 2015/07/24 14:06:29 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted Link to comment
Jomp Posted July 27, 2015 Author Share Posted July 27, 2015 Run memtest overnight. 17 passes, no errors. I think the next time it freezes I'll look in the log and see if there was an event. And so we wait, unless you guys can think of something else to do .. Link to comment
Jomp Posted July 31, 2015 Author Share Posted July 31, 2015 Locked up hard again about ~30gb into a copy. So, it's happening more often now (only a few days of uptime this time). There is one more event in the supermicro log, but it's from yesterday, so not when it froze up. Event Log:14 event entries Event ID Time Stamp Sensor Name Sensor Type Description 1 2015/01/13 11:40:54 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 2 2015/01/13 11:40:55 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted 3 2015/05/22 03:47:07 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 4 2015/05/22 03:47:07 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted 5 2015/06/06 00:32:30 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 6 2015/06/06 00:32:30 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted 7 2015/07/16 19:16:52 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 8 2015/07/16 19:16:52 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted 9 2015/07/19 19:24:55 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 10 2015/07/19 19:24:55 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted 11 2015/07/24 14:06:29 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 12 2015/07/24 14:06:29 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted 13 2015/07/30 1:30:44 OEM Memory Correctable Memory ECC @ DIMMA1(CPU1) - Asserted 14 2015/07/30 1:30:44 OEM Memory Uncorrectable Memory ECC @ DIMMA1(CPU1) - Asserted I was running a tail in my browser and console, but it failed to capture anything unusual, I've attached it still, so I think that points to a hardware problem, correct? I'm at a complete loss of what to do. Basically since I've owned the server it's been 200 days uptime > freeze, 30 days uptime > freeze, 7 days uptime > freeze, and this last time 4 days uptime > freeze. So no issues until the last month and a half and since then it's really had issues. Ouch. Edit: I was just running memtest and another memory event appeared in the log. Does ECC ram generate errors in memtest? I also thought ECC ram was supposed to prevent crashes due to memory errors though? Link to comment
Jomp Posted August 12, 2015 Author Share Posted August 12, 2015 Updating in case somebody has/searches for this problem. I replaced the RAM and everything seems to be running smoothly again. So even though it was ECC ram, it going bad was still was causing crashes. Link to comment
cphillgraphics Posted January 9, 2018 Share Posted January 9, 2018 Hey! I know you posted this years ago, but I just ran into a very similar issue and your post helped me resolve it, so thank you! Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.