[SOLVED] unRAID network blowing up


Recommended Posts

Does the system do a parity check right after rebuild complete?  I ask because the re-build finished and the main page shows parity is OK, last checked this morning at the same time the rebuild completed (roughly) and I would have thought the check would run a little longer than it did, unless of course it just considers the rebuild a check.

 

To be safe though, I am running a manual check, but still wanted to ask as this seemed a little odd to me as it was not what I was expecting.  Not a huge deal, just one of those things though that got me wondering.

Link to comment
  • Replies 52
  • Created
  • Last Reply

Top Posters In This Topic

Does the system do a parity check right after rebuild complete?  I ask because the re-build finished and the main page shows parity is OK, last checked this morning at the same time the rebuild completed (roughly) and I would have thought the check would run a little longer than it did, unless of course it just considers the rebuild a check.

 

To be safe though, I am running a manual check, but still wanted to ask as this seemed a little odd to me as it was not what I was expecting.  Not a huge deal, just one of those things though that got me wondering.

The processes of re-constructing a data drive and initially calculating parity are absolutely identical other than the disk being written and the set of disks being read.  They apparently share the same "statistic" counter, so it is zeroed.

 

Joe L.

Link to comment

Does the system do a parity check right after rebuild complete?  I ask because the re-build finished and the main page shows parity is OK, last checked this morning at the same time the rebuild completed (roughly) and I would have thought the check would run a little longer than it did, unless of course it just considers the rebuild a check.

 

To be safe though, I am running a manual check, but still wanted to ask as this seemed a little odd to me as it was not what I was expecting.  Not a huge deal, just one of those things though that got me wondering.

The processes of re-constructing a data drive and initially calculating parity are absolutely identical other than the disk being written and the set of disks being read.  They apparently share the same "statistic" counter, so it is zeroed.

 

Joe L.

 

OK, thanks Joe.

 

In the mean time, I did start another parity check and it shows 3 errors as of 5%.  I assume these will correct themselves as part of the parity check, or will I need to run another once complete to ensure everything is OK?

 

I just want to ensure data integrity before I swap the motherboard out and work to get rid of the HPA on my 2 drives and then prepare for 4.7 as I can't get there without getting rid of the HPA as I learned when I initially tried upgrading.  Slowly, I will get there and be back online and then hopefully once this is all cleared up I can return to troubleshooting the issue with the network that started this all.

Link to comment

OK, so the parity check finished with just the 3 errors.  I see a new check box (at least i think it is new) that says Sync filesystem first below spin down.  Do I need that to fix the parity errors or does the parity check do that?  I want to start the next check to move towards ensuring a safe file system before swapping boards, but I don't want to do this wrong as parity is one of the most important things in the unRaid system.

Link to comment
  • 2 weeks later...

OK, Finally I have completed a bit of an overhaul.

 

Replaced potentially failing drive, upgraded motherboard bios to disable HPA, removed HPA from 2 drives, upgraded to 4.7, upgraded 1TB drive to 2 TB and system has been running stable.  Finally it was time to plug the system back into the Trendnet switch to see if that is the source, so I plugged back into the trendnet and started a transfer.

 

After about 4 gigs transfered, the system locked up.  So, the issue was not related to hardware failure or anything like that, could this mean the trendnet is broken?  I have a PCH a-110 plugged in through the trendnet and it seems to function fine.

 

So, to explain my configuration.  My office PC connects to a Dlink switch, from there I go to the trendnet and the UnRAID server.  I just cant imagine the issue being the switch because otherwise I would think it wouldnt work at all, but I am not sure.  It also seems my server is good cause it functions when plugged into the dlink switch.

 

Finally I searched and found this thread, http://lime-technology.com/forum/index.php?topic=6893.0;wap2

but there never seems to be a solution, or his issue is completely different.  I am not sure, but something is screwy.  I guess, the one test would be to plug both into the trendnet and see what happens.

 

It is also crazy that the server is actually locked up when this happens.  I managed to run the tail command for syslog and the last thing on the screen is a time synchronization, so it reveals just about nothing.

 

Any thoughts?

Link to comment

Well, I can place the D-link in the rack and attach to it, but I dont want that as a solution at the moment.  That is my last resort.  The Trendnet shows it is gig speed, the light is green.  Any non-gig device lights up orange. 

 

Are we suspecting that potentially the Trendnet is broken?  I ask because it has a 5 year warranty so it can be easily RMA-ed.

 

The crazy thing to me though is why in the world would a faulty router cause the entire unRAID system to lock up and stop responding, even after plugging back into the dlink, is is not accessible and nothing at the console level registers.  I am yet to hard reset the system in hopes of getting it back.

Link to comment

ata13=sdh and ata14=sdi both show errors in the syslog. You could remove both drives from the array and run initconfig. Disable parity as well for the test. See if the problem goes away. Then add the drives back one at a time to see when the problem returns. If the drives are not causing the problem you can restore the array to its original condition, run initconfig, and rebuild parity.

 

EDIT: After the console becomes unresponsive when attached to the d-link does removing the network cable get the console to work?

Link to comment

I will see what it does with the cable unplugged.

 

I will also pull a new syslog as the one attached is from when I was having issues with a drive.  The mention of h and i tells me those are 2 newer drives and may be the ones attached to my 8 port SATA.  Is there anything special that should have been done when I added that, cause I know I just stuck it in and started using it.

Link to comment

One other thought I just had was could this be related to NIC speed settings, for some reason I want to say I may have things forced to gig.  Recently I thought I heard something that with the new gig standards it is best to have it auto negotiate.  Any thoughts?  Maybe the D-link just handles it better than the trendnet.

 

As another test, I just sent the same files that crashed the unRAID to my WHS (home server machine and it worked.  that system is auto running at a gig connected to the trendnet, so the data took the same path that locked up the unraid.  So I am thinking it is something between the unraid and trendnet.  I just hate the hard resets to get the system back.

 

I will reboot, and report a new syslog and also find the network settings.

Link to comment

The card is a AOC-SAT2-MV8

 

I was looking for the NIC settings, but I cannot find them so maybe they werent changed, unless someone can not otherwise as a good place to look.  Maybe it is in the kernel, outside of unraid itself.

 

One thing I did notice was find was talk about Jumbo Frames, is it work enabling them with the MTU setting specified, 9000?

 

Link to comment

Found a different post talking about nic settings and such and it said to run some commands and post the results, so I figured it to be worth a shot to do it before anyone asked. Although not the same issue, his end solution was a new switch, which I hope not to need...

 

Here are the outputs from the commands:

when on Dlink where things seem to function,

 

 

 

Tower login: root

Linux 2.6.32.9-unRAID.

root@Tower:~# ifconfig eth0

eth0      Link encap:Ethernet  HWaddr 00:24:1d:2c:a7:5f

          inet addr:192.168.0.101  Bcast:192.168.0.255  Mask:255.255.255.0

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:2220 errors:0 dropped:0 overruns:0 frame:0

          TX packets:1615 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:257772 (251.7 KiB)  TX bytes:525824 (513.5 KiB)

          Interrupt:26 Base address:0x4000

 

root@Tower:~# ethtool eth0

Settings for eth0:

        Supported ports: [ TP MII ]

        Supported link modes:  10baseT/Half 10baseT/Full

                                100baseT/Half 100baseT/Full

                                1000baseT/Half 1000baseT/Full

        Supports auto-negotiation: Yes

        Advertised link modes:  10baseT/Half 10baseT/Full

                                100baseT/Half 100baseT/Full

                                1000baseT/Half 1000baseT/Full

        Advertised auto-negotiation: Yes

        Speed: 1000Mb/s

        Duplex: Full

        Port: MII

        PHYAD: 0

        Transceiver: internal

        Auto-negotiation: on

        Supports Wake-on: pumbg

        Wake-on: g

        Current message level: 0x00000033 (51)

        Link detected: yes

root@Tower:~# ping -c google.com

ping: bad number of packets to transmit.

root@Tower:~# ping -c 5 google.com

PING google.com (74.125.93.105) 56(84) bytes of data.

64 bytes from qw-in-f105.1e100.net (74.125.93.105): icmp_seq=1 ttl=48 time=47.2 ms

64 bytes from qw-in-f105.1e100.net (74.125.93.105): icmp_seq=2 ttl=48 time=51.0 ms

64 bytes from qw-in-f105.1e100.net (74.125.93.105): icmp_seq=3 ttl=48 time=49.6 ms

64 bytes from qw-in-f105.1e100.net (74.125.93.105): icmp_seq=4 ttl=48 time=51.9 ms

64 bytes from qw-in-f105.1e100.net (74.125.93.105): icmp_seq=5 ttl=48 time=52.2 ms

 

--- google.com ping statistics ---

5 packets transmitted, 5 received, 0% packet loss, time 4042ms

rtt min/avg/max/mdev = 47.291/50.454/52.276/1.824 ms

root@Tower:~#

 

Now when plugged into the trendnet switch

 

 

root@Tower:~# ifconfig eth0                                                    eth0      Link encap:Ethernet  HWaddr 00:24:1d:2c:a7:5f

          inet addr:192.168.0.101  Bcast:192.168.0.255  Mask:255.255.255.0

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:3037 errors:0 dropped:0 overruns:0 frame:0

          TX packets:2129 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:338384 (330.4 KiB)  TX bytes:585960 (572.2 KiB)

          Interrupt:26 Base address:0x4000

 

root@Tower:~# ethtool eth0                                                      Settings for eth0:

        Supported ports: [ TP MII ]

        Supported link modes:  10baseT/Half 10baseT/Full

                                100baseT/Half 100baseT/Full

                                1000baseT/Half 1000baseT/Full

        Supports auto-negotiation: Yes

        Advertised link modes:  10baseT/Half 10baseT/Full

                                100baseT/Half 100baseT/Full

                                1000baseT/Half 1000baseT/Full

        Advertised auto-negotiation: Yes

        Speed: 1000Mb/s

        Duplex: Full

        Port: MII

        PHYAD: 0

        Transceiver: internal

        Auto-negotiation: on

        Supports Wake-on: pumbg

        Wake-on: g

        Current message level: 0x00000033 (51)

        Link detected: yes

root@Tower:~# ping -c 5 google.com                                              PING google.com (74.125.93.105) 56(84) bytes of data.

64 bytes from qw-in-f105.1e100.net (74.125.93.105): icmp_seq=1 ttl=48 time=52.7 ms

64 bytes from qw-in-f105.1e100.net (74.125.93.105): icmp_seq=2 ttl=48 time=50.3 ms

64 bytes from qw-in-f105.1e100.net (74.125.93.105): icmp_seq=3 ttl=48 time=47.7 ms

64 bytes from qw-in-f105.1e100.net (74.125.93.105): icmp_seq=4 ttl=48 time=48.1 ms

64 bytes from qw-in-f105.1e100.net (74.125.93.105): icmp_seq=5 ttl=48 time=51.8 ms

 

--- google.com ping statistics ---

5 packets transmitted, 5 received, 0% packet loss, time 4035ms

rtt min/avg/max/mdev = 47.701/50.176/52.759/2.000 ms

root@Tower:~#

 

 

 

Link to comment

I just got a Trendnet 8 port Gig-E switch and I'm using it with unRAID with a Biostar MB. It seems to be working. I will be copying a DVD (7+GB) to the server in about an hour and I let you know how it goes. 

 

You've tried 2 MBs that don't work with the Trendnet, right? Your Trendnet is broken; RMA it.

Link to comment

Oh hell, here is a good one.

 

I tore everything apart tonight and started the network over.  To test the Trendnet, I plugged everything into it and did some test copies and they all worked.

 

Next, I added my Dlink green 8 port switch and transferred across between the 2, success.

Added the wireless router and hard wired into it, across the dlink to the trendnet and success.

Added my 10/100 to the trendnet, this is where the internet comes into the network and the dish network satellite box connects (no unraid traffic is really sent through it, but I wanted to note that it is part of the setup).  And again everything good.

Finally, I switched back the network cable to the original and things still worked.

 

Each time I copied a 6 gig DVD structure folder and a 13 gig BD video file.  I honestly have no clue.  All test that were failing before, tonight succeeded and I have no idea how.  My only guess would be that there is a faulty cable somewhere and once the data is traversing 2 switches to unraid it breaks.  It must be way more sensitive than Windows because a transfer that failed and crashed the unraid box still completed to my home server.  Because of this is why it makes no sense really, because I have changed nothing at this point.  It is all back to the way it was and knock on wood, as of this moment it all worked.

 

Now who is to say it wont fail tomorrow, but still.  A job transfer that broke the unraid worked to WHS.  This same transfer didnt break the unraid when both were plugged into the dlink switch.  Then after un-hooking and re-hooking everything back, the same transfer that once broke the unraid server is now working, go figure.  I will leave it a bit and see what happens after a few days.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.