husky55 Posted September 17, 2013 Share Posted September 17, 2013 I have read the last 15 pages of this forum and did not find an answer. I have struggled with this problem for more than a week. First off, I have an Unraid setup running very well as a media server with 5.14 beta. I have built another unraid server just for testing the new 5.0 final. I have 3X2TB drives (parity, disk1 and disk2) in this server. This is just stock 5.0 final. All drives were precleared and the system boot up fine. Both the management utility and web control work fine. Parity check out perfect. The problem was about half and hour after the parity check or add files to the drives, I can no longer access the server. By that, I mean no web access and no telnet and no console. Nothing. So I suspected hardware problem. but after doing intensive diagnostic for 2-3 days I am sure that the hardware is OK. So reloaded 5.0 final again, configuration valid, parity check no error again no access at all even with KB and mouse and monitor, screen just blank, no signal after about 20 minutes. I checked the ECS MB and enabled all the wakeup funtions. Same results. As mentioned above, my original Unraid media server is working fine. The problem is mainly with the test server and 5.0 final. When loaded with the 5.14 beta, the test server run perfectly. I have run parity check over and over. For the server I assigned a static IP so I can find it. That was not a problem as right after loading 5.0 final, I can access the server and can transfer files and stream. I did install unmenu in the 5.0 final. It was working fine. And then after a while with no activity, everything stopped. Link to comment
whiteatom Posted September 17, 2013 Share Posted September 17, 2013 Hook up a screen and tail the syslog. It's the only way to diagnose an unresponsive system. Could be anything from hardware to a simple IP conflict in your LAN. Link to comment
smakovits Posted September 17, 2013 Share Posted September 17, 2013 I had a similar issue a while back, just during normal use when the system went offline that required an external NIC to resolve it. But the NIC driver issue is supposed to be resolved. I am running RC16 and am back to using the on-board NIC without issue. I am simply noting this because I too became very frustrated hard resetting my production box, only to learn the issue was the NIC. Link to comment
husky55 Posted September 17, 2013 Author Share Posted September 17, 2013 Just some clarification: I assigned static IP to both servers outside the DHCP range. Same subnet different ip addresses. There are 2 repeater bridges with ddwrt and an Asus RT-N66U as the main router. There was no conflict of IP and/or IRQ. The Main server is the HP Proliant in my signature. The test server has the ECS A885GM-A2 recommended by Raj in his various build prototypes. LAN is a gigaLAN Controller Realtek 8111DL, 8 MB RAM, DVD burner. The test server crashed repeatably (5 times) with 5.0 final so this is not a random crash. On the other hand with beta 14 no crash no problem and, it was repeated about 5 times. I downloaded Patilan's syslog utility and will use it when I have some time. I did use the beta 14 again on the test server, absolutely no problem of any kind. Configuration valid. Parity no error. All disk spinning green, no error read or write. No IP conflict ( wished there was one so I can fixed it, lol) Could it be that 5.0 final does not like me? Link to comment
smakovits Posted September 18, 2013 Share Posted September 18, 2013 Ok, but it still sounds random. Or can you crash the car on purpose with a specific action or is the server randomly not available after a period of inactivity? Just curious because that how it was for me. Sometimes 20 minutes, other time 2-3 days Link to comment
husky55 Posted September 18, 2013 Author Share Posted September 18, 2013 So I have been waiting for the 5.0 final to crash this morning. So far it hasn't. I ran Patilan keeplogs.sh so when it crashes I will upload the syslog hopefully. I really don't know why this is all happening. Attached is the syslog just downloaded (not crashed yet) So I don't think it tells the why and the how. syslog-2013-09-18.txt Link to comment
husky55 Posted September 18, 2013 Author Share Posted September 18, 2013 OK, my test server finally crashed and Patilan's keeplogs worked perfectly. Here is the syslog after the crash. Can this be fixed? syslog_2013-09-18_07.47.27.txt Link to comment
dgaschk Posted September 18, 2013 Share Posted September 18, 2013 The Seagate has a lose or bad SATA cable. Try new cable. Link to comment
husky55 Posted September 19, 2013 Author Share Posted September 19, 2013 @dgaschk, thank you so much for the diagnostic. The SATA cable has latches so it was not loose, probably bad. I have replaced it with a brand new SATA cable also with latches. The Seagate is my disk2. I looked at the syslog after the crash. Noticed that the last few lines seemed to identify the problem but I am not sure how to interpret them: "Sep 18 08:49:30 Tower2 kernel: mdcmd (16): spindown 0 Sep 18 08:49:30 Tower2 kernel: mdcmd (17): spindown 2 Sep 18 09:18:01 Tower2 kernel: mdcmd (18): spindown 1 Sep 18 09:18:03 Tower2 kernel: ata5: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen Sep 18 09:18:03 Tower2 kernel: ata5: irq_stat 0x00400000, PHY RDY changed Sep 18 09:18:03 Tower2 kernel: ata5: SError: { RecovComm Persist PHYRdyChg 10B8B } Sep 18 09:18:03 Tower2 kernel: ata5: hard resetting link Sep 18 09:18:03 Tower2 kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen Sep 18 09:18:03 Tower2 kernel: ata3: irq_stat 0x00400000, PHY RDY changed Sep 18 09:18:03 Tower2 kernel: ata3: SError: { RecovComm Persist PHYRdyChg 10B8B }" The problem occured after spindown. I assumed that ATA5 was the problem because PHY RDY changed and caused ATA3 to be frozen also. Essentially, I am clueless. Anyway, I will report if replacement of the SATA cable will do the trick. Thanks again for your help. Link to comment
husky55 Posted September 19, 2013 Author Share Posted September 19, 2013 Wow, I was so sure that changing the sata cable would do the trick. But it crashed again. Here is the new syslog after the new crash. UPDATE: I did some more diagnostic. Have an extra sata connector on the ECS A885GM-A2 (V1.1) so I switched the Seagate cable into the new connector. 5.0 Final boots up fine, again, config and parity are valid. So I hit spin down and that exactly what I expected, system crashed again. Thanks for the keeplogs utils, I was able to recover the syslog: Sep 19 09:25:52 Tower2 emhttp_event: svcs_restarted Sep 19 09:25:57 Tower2 kernel: mdcmd (15): nocheck Sep 19 09:25:57 Tower2 kernel: md: md_do_sync: got signal, exit... Sep 19 09:25:57 Tower2 kernel: md: recovery thread sync completion status: -4 Sep 19 09:27:34 Tower2 emhttp: Spinning down all drives... Sep 19 09:27:34 Tower2 kernel: mdcmd (16): spindown 0 Sep 19 09:27:35 Tower2 kernel: mdcmd (17): spindown 1 Sep 19 09:27:35 Tower2 kernel: mdcmd (18): spindown 2 Sep 19 09:27:36 Tower2 kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen Sep 19 09:27:36 Tower2 kernel: ata2: irq_stat 0x00400000, PHY RDY changed Sep 19 09:27:36 Tower2 kernel: ata2: SError: { RecovComm Persist PHYRdyChg 10B8B } So to summarize, new sata cable, new sata connector, same problem system crashed when spin down all drive occurred. So I plugged in the old beta 5.14 and went to the same boot up and spin down, no crash. From this, I can only reach the conclusion that 5.0 final is not compatible with the ECS MB somehow. I do not want to mess with my HP media server so I will keep it with the beta Unraid. syslog_2013-09-19_07.40.25.txt Link to comment
Barziya Posted September 19, 2013 Share Posted September 19, 2013 Wow, I was so sure that changing the sata cable would do the trick. But it crashed again. Here is the new syslog after the new crash. It looks like SATA controller problem. Both ata3 and ata5 (the WD and the Seagate) froze. What do they have in common? They seem to be on the same controller. They both started in 6Gbps mode. The Hitachi (ata4) started in 3Gbps mode. That motherboard seems to have some embedded HP "Smart" Array SATA Controller. What are the BIOS settings for that controller? Make sure all "smartness" is disabled in BIOS, no RAID-mode, no fancy stuff. The controller should be in JBOD mode, and SATA should be set to AHCI. There's another thing in there, running in 1.5Gbps mode, some DVD drive (ata1). You should unplug that cable altogether, as it is useless anyway -- there's no DVD drivers in unRAID. Eliminate the posibility of this thing causing the crashes. Also, I would strongly advise you to try the "mem=4095M" boot option, as discussed in this post: http://lime-technology.com/forum/index.php?topic=29224.msg263714#msg263714 Also, run a memtest. Make sure it passes without any errors. Let us know how it goes. Link to comment
husky55 Posted September 19, 2013 Author Share Posted September 19, 2013 Very perceptive Patilan. I did install a DVD burner into the sata ata1, probably had an extra burner. I will remove it. The BIOS settings are OK, no fancy stuff, sata was set to AHCI, no raid at all. Will check again though. This is quite an education for me. Did memtest already. No errors. Just wonder why the beta 5 did not have any problem if it was the sata controller issue. Is there something in 5 final which triggers the crash. Let me see if removing the dvd burner will do the trick. Thanks again for your utility keeplogs. I would never get this far without it. Link to comment
Barziya Posted September 19, 2013 Share Posted September 19, 2013 So to summarize ... system crashed when spin down all drive occurred. I've been saying for years that unRAID's spin down/up is buggy. (I use my own spindown script for that). Just disable the spindown functionality in the webGui, and also disable the spinup groups. If your crash is reproduceable at will, then you should email Limetech with all the details. Also, PM a moderator to move this to the "issues" board. Just wonder why the beta 5 did not have any problem if it was the sata controller issue. Is there something in 5 final which triggers the crash. Could be many things. Could be the old kernel/driver didn't start the disks in 6Gbps mode, and the new kernel does, and that revealed some hidden bug in that controller. Check with HP if there are any BIOS updates for this motherboard. Link to comment
husky55 Posted September 19, 2013 Author Share Posted September 19, 2013 This not an HP server. I have 2 servers. The HP is my main server and I have not touched it. The test server has the ECS MB which has the latest firmware. Link to comment
husky55 Posted September 19, 2013 Author Share Posted September 19, 2013 There are 5 sata connectors at 6Bbps on the ECS A885GM-A2. I do not see any option. I did remove the dvd burner and used the spin down and again, the system crashed. So dvd burner had nothing to do with it. I think I just give up at this point. The HP uses HP MB proliant for intel. Thanks again. Link to comment
limetech Posted September 19, 2013 Share Posted September 19, 2013 I've been saying for years that unRAID's spin down/up is buggy. And you've been wrong for years. Please don't spread misinformation. Link to comment
limetech Posted September 19, 2013 Share Posted September 19, 2013 These kinds of problems can be difficult to sort out. If you google "SError: { RecovComm Persist PHYRdyChg 10B8B }" you will see, besides this forum, several other forums report the exact same problem. The fact it works with a previous s/w version may or may not point to a software issue - granted usually it does, but sometimes the issue is in the "working" rev but via timing does not show up, or shows up very rarely. Link to comment
Barziya Posted September 19, 2013 Share Posted September 19, 2013 I've been saying for years that unRAID's spin down/up is buggy. And you've been wrong for years. Please don't spread misinformation. Well, I documented the case back then, and provided a reproduceable example. I even attached an alternative spin daemon which wasn't causing a crash. (But you deleted that post, along with my whole account, remember?) So, that's not "misinformation". Link to comment
husky55 Posted September 20, 2013 Author Share Posted September 20, 2013 Assuming that the ECS (EliteGroup) A885GM-A2 (V1.1) sata controller is somehow at fault, what MB would you recommend for an unraid 5.0 final trouble free build. I understand that there is always the possibility of problems/gliches but I really do want to upgrade to 5.0 final as I know that Tom has been working on it for years. Frankly I love the 4.7. I had to upgrade because of the 3tb hd issue. All I ever wanted from unraid was a reliable media server, although I realize that there are many who want security etc...My problem is I am hesitant to commit to a beta version (which is serving me well on 2 servers) for the future so I am willing to build a new server for the 5.0 final. Link to comment
Recommended Posts