Very strange and frustrating problem...


Recommended Posts

I have read the last 15 pages of this forum and did not find an answer. I have struggled with this problem for more than a week. First off, I have an Unraid setup running very well as a media server with 5.14 beta.

 

I have built another unraid server just for testing the new 5.0 final. I have 3X2TB drives (parity, disk1 and disk2) in this server. This is just stock 5.0 final. All drives were precleared and the system boot up fine. Both the management utility and web control work fine. Parity check out perfect.

 

The problem was about half and hour after the parity check or add files to the drives, I can no longer access the server. By that, I mean no web access and no telnet and no console. Nothing. So I suspected hardware problem. but after doing intensive diagnostic for 2-3 days I am sure that the hardware is OK. So reloaded 5.0 final again, configuration valid, parity check no error again no access at all even with KB and mouse and monitor, screen just blank, no signal after about 20 minutes.

 

I checked the ECS MB and enabled all the wakeup funtions. Same results.

 

As mentioned above, my original Unraid media server is working fine. The problem is mainly with the test server and 5.0 final. When loaded with the 5.14 beta, the test server run perfectly. I have run parity check over and over. For the server I assigned a static IP so I can find it. That was not a problem as right after loading 5.0 final, I can access the server and can transfer files and stream. I did install unmenu in the 5.0 final. It was working fine. And then after a while with no activity, everything stopped.

 

 

 

Link to comment

I had a similar issue a while back, just during normal use when the system went offline that required an external NIC to resolve it.  But the NIC driver issue is supposed to be resolved.  I am running RC16 and am back to using the on-board NIC without issue.

 

I am simply noting this because I too became very frustrated hard resetting my production box, only to learn the issue was the NIC.

Link to comment

Just some clarification:

 

I assigned static IP to both servers outside the DHCP range. Same subnet different ip addresses. There are 2 repeater bridges with ddwrt and an Asus RT-N66U as the main router. There was no conflict of IP and/or IRQ.

 

The Main server is the HP Proliant in my signature. The test server has the ECS A885GM-A2 recommended by Raj in his various build prototypes. LAN is a gigaLAN Controller Realtek 8111DL, 8 MB RAM, DVD burner. 

 

The test server crashed repeatably (5 times) with 5.0 final so this is not a random crash. On the other hand with beta 14 no crash no problem and, it was repeated about 5 times.

 

I downloaded Patilan's syslog utility and will use it when I have some time. I did use the beta 14 again on the test server, absolutely no problem of any kind. Configuration valid. Parity no error. All disk spinning green, no error read or write. No IP conflict ( wished there was one so I can fixed it, lol)

 

Could it be that 5.0 final does not like me?  :D

Link to comment

@dgaschk, thank you so much for the diagnostic. The SATA cable has latches so it was not loose, probably bad. I have replaced it with a brand new SATA cable also with latches. The Seagate is my disk2. I looked at the syslog after the crash. Noticed that the last few lines seemed to identify the problem but I am not sure how to interpret them:

 

"Sep 18 08:49:30 Tower2 kernel: mdcmd (16): spindown 0

Sep 18 08:49:30 Tower2 kernel: mdcmd (17): spindown 2

Sep 18 09:18:01 Tower2 kernel: mdcmd (18): spindown 1

Sep 18 09:18:03 Tower2 kernel: ata5: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen

Sep 18 09:18:03 Tower2 kernel: ata5: irq_stat 0x00400000, PHY RDY changed

Sep 18 09:18:03 Tower2 kernel: ata5: SError: { RecovComm Persist PHYRdyChg 10B8B }

Sep 18 09:18:03 Tower2 kernel: ata5: hard resetting link

Sep 18 09:18:03 Tower2 kernel: ata3: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen

Sep 18 09:18:03 Tower2 kernel: ata3: irq_stat 0x00400000, PHY RDY changed

Sep 18 09:18:03 Tower2 kernel: ata3: SError: { RecovComm Persist PHYRdyChg 10B8B }"

 

The problem occured after spindown. I assumed that ATA5 was the problem because PHY RDY changed and caused ATA3 to be frozen also.

Essentially, I am clueless.

 

Anyway, I will report if replacement of the SATA cable will do the trick. Thanks again for your help.

 

Link to comment

Wow, I was so sure that changing the sata cable would do the trick. But it crashed again. Here is the new syslog after the new crash.

 

UPDATE:

 

I did some more diagnostic. Have an extra sata connector on the ECS A885GM-A2 (V1.1) so I switched the Seagate cable into the new connector. 5.0 Final boots up fine, again, config and parity are valid. So I hit spin down and that exactly what I expected, system crashed again. Thanks for the keeplogs utils, I was able to recover the syslog:

 

Sep 19 09:25:52 Tower2 emhttp_event: svcs_restarted

Sep 19 09:25:57 Tower2 kernel: mdcmd (15): nocheck

Sep 19 09:25:57 Tower2 kernel: md: md_do_sync: got signal, exit...

Sep 19 09:25:57 Tower2 kernel: md: recovery thread sync completion status: -4

Sep 19 09:27:34 Tower2 emhttp: Spinning down all drives...

Sep 19 09:27:34 Tower2 kernel: mdcmd (16): spindown 0

Sep 19 09:27:35 Tower2 kernel: mdcmd (17): spindown 1

Sep 19 09:27:35 Tower2 kernel: mdcmd (18): spindown 2

Sep 19 09:27:36 Tower2 kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen

Sep 19 09:27:36 Tower2 kernel: ata2: irq_stat 0x00400000, PHY RDY changed

Sep 19 09:27:36 Tower2 kernel: ata2: SError: { RecovComm Persist PHYRdyChg 10B8B }

 

So to summarize, new sata cable, new sata connector, same problem system crashed when spin down all drive occurred. So I plugged in the old beta 5.14 and went to the same boot up and spin down, no crash. From this, I can only reach the conclusion that 5.0 final is not compatible with the ECS MB somehow. I do not want to mess with my HP media server so I will keep it with the beta Unraid.

 

syslog_2013-09-19_07.40.25.txt

Link to comment

Wow, I was so sure that changing the sata cable would do the trick. But it crashed again. Here is the new syslog after the new crash.

 

It looks like SATA controller problem.

 

Both ata3 and ata5 (the WD and the Seagate) froze.  What do they have in common?  They seem to be on the same controller.  They both started in 6Gbps mode.  The Hitachi (ata4) started in 3Gbps mode.

 

That motherboard seems to have some embedded HP "Smart" Array SATA Controller.  What are the BIOS settings for that controller?  Make sure all "smartness" is disabled in BIOS, no RAID-mode, no fancy stuff.  The controller should be in JBOD mode, and SATA should be set to AHCI.

 

There's another thing in there, running in 1.5Gbps mode, some DVD drive (ata1).  You should unplug that cable altogether, as it is useless anyway -- there's no DVD drivers in unRAID.  Eliminate the posibility of this thing causing the crashes.

 

Also, I would strongly advise you to try the "mem=4095M" boot option, as discussed in this post:

http://lime-technology.com/forum/index.php?topic=29224.msg263714#msg263714

 

Also, run a memtest.  Make sure it passes without any errors.

 

Let us know how it goes.

 

 

Link to comment

Very perceptive Patilan. I did install a DVD burner into the sata ata1, probably had an extra burner. I will remove it. The BIOS settings are OK, no fancy stuff, sata was set to AHCI, no raid at all. Will check again though. This is quite an education for me. Did memtest already. No errors. Just wonder why the beta 5 did not have any problem if it was the sata controller issue. Is there something in 5 final which triggers the crash.

 

Let me see if removing the dvd burner will do the trick. Thanks again for your utility keeplogs. I would never get this far without it.

Link to comment

So to summarize ... system crashed when spin down all drive occurred.

 

I've been saying for years that unRAID's spin down/up is buggy.  (I use my own spindown script for that).  Just disable the spindown functionality in the webGui, and also disable the spinup groups.

 

If your crash is reproduceable at will, then you should email Limetech with all the details.  Also, PM a moderator to move this to the "issues" board.

 

Just wonder why the beta 5 did not have any problem if it was the sata controller issue. Is there something in 5 final which triggers the crash.

Could be many things.  Could be the old kernel/driver didn't start the disks in 6Gbps mode, and the new kernel does, and that revealed some hidden bug in that controller.  Check with HP if there are any BIOS updates for this motherboard.

 

 

Link to comment

There are 5 sata connectors at 6Bbps on the ECS A885GM-A2. I do not see any option. I did remove the dvd burner and used the spin down and again, the system crashed. So dvd burner had nothing to do with it. I think I just give up at this point. The HP uses HP MB proliant for intel.

 

Thanks again.

Link to comment

These kinds of problems can be difficult to sort out.  If you google "SError: { RecovComm Persist PHYRdyChg 10B8B }" you will see, besides this forum, several other forums report the exact same problem.  The fact it works with a previous s/w version may or may not point to a software issue - granted usually it does, but sometimes the issue is in the "working" rev but via timing does not show up, or shows up very rarely.

Link to comment

I've been saying for years that unRAID's spin down/up is buggy.

And you've been wrong for years.  Please don't spread misinformation.

Well, I documented the case back then, and provided a reproduceable example.  I even attached an alternative spin daemon which wasn't causing a crash.  (But you deleted that post, along with my whole account, remember?)  So, that's not "misinformation".

 

Link to comment

Assuming that the ECS (EliteGroup) A885GM-A2 (V1.1) sata controller is somehow at fault, what MB would you recommend for an unraid 5.0 final trouble free build. I understand that there is always the possibility of problems/gliches but I really do want to upgrade to 5.0 final as I know that Tom has been working on it for years. Frankly I love the 4.7. I had to upgrade because of the 3tb hd issue. All I ever wanted from unraid was a reliable media server, although I realize that there are many who want security etc...My problem is I am hesitant to commit to a beta version (which is serving me well on 2 servers) for the future so I am willing to build a new server for the 5.0 final.

 

Link to comment