Can not write to array - intermittently (SOLVED)


Recommended Posts

Hello,

 

I have 2 unRaid servers with 23 disks in each server, both running on Supermicro MBs, though each uses a different model, both running unRaid 6.3.1. While one server runs perfectly all of the time, the other server gives me grief on a daily basis. I have searched the posts here looking for someone with the same problem, but I can not find a situation identical to mine.

1. Sometimes the rogue server runs perfectly. I can read and write from it to and from my Windows 7 and Windows 10 machines with no issues.

2. Sometimes just a single disk will give me problems writing to it, though I don't seem to have any problems reading from it.

3. The web UI connects fine most of the time as "Tower1", but if it does not connect that way, then I simply put in the IP address (it is running a static IP) and I seem to be able to connect to it then.

4. Sometimes the server "refuses" any and all connections made in any way -  I can't connect via the web UI nor can I find it in my local LAN, nor can I access a mapped drive. If I just wait for 5 or 10 minutes, it will usually run fine, as if nothing was wrong.

5. Some disks will fill up only to a certain point, writing fine, and then they will refuse to accept any further files. I ALWAYS write directly to the disks, NEVER to the user shares.

6. I can get around problem #5 by writing the files to the cache disk, and then launching Midnight Commander in a Putty window. MC is always successful copying the files from the cache disk to the final destination array disk.

 

Like I said, I have 2 servers running identical versions of the software, setup identically, and Tower1 gives me a constant headache while Tower2 runs absolutely perfectly, so I don't understand what the problem is. Also, Tower1 runs fine and then starts refusing connections RANDOMLY whenever it wants to and nothing I can do will make it run fine again until IT decides it wants to behave - then it fixes itself until it launches its next tantrum. I have put up with this behavior for years with this machine, running lots of different version 5 software, and then version 6 software, so this is a problem that I don't believe is software dependent, though I could be wrong. Tower 2 has run perfectly regardless of which software version I have installed.

 

It is almost like Tower1 is a temperamental bitch with human qualities, acting whatever way she wants to act at the time.

 

I have attached my syslog for those who have a clue (I don't).

 

If there is any other information you want or need, I will try me best to provide it for you.

 

Thanks for listening!

OddwunnSyslog.txt

OddwunnSyslog.txt

Edited by Oddwunn
problem solved
Link to comment

The following needs to be solved.

 

Do you have something else also named "Tower1"?

 

If you power off your Tower1 and wait a while - will your Windows machines be able to find another Tower1?

 

Does your DHCP server allow you to see log entries of the devices it has sent out IP numbers to?

Dec 31 02:34:25 Tower1 avahi-daemon[1741]: Host name conflict, retrying with Tower1-73
Dec 31 02:34:25 Tower1 avahi-daemon[1741]: Registering new address record for 192.168.1.110 on eth0.IPv4.
Dec 31 02:34:26 Tower1 avahi-daemon[1741]: Server startup complete. Host name is Tower1-73.local. Local service cookie is 2216981387.
Dec 31 02:34:27 Tower1 avahi-daemon[1741]: Service "Tower1-73" (/services/ssh.service) successfully established.
Dec 31 02:34:27 Tower1 avahi-daemon[1741]: Service "Tower1-73" (/services/smb.service) successfully established.
Dec 31 02:34:27 Tower1 avahi-daemon[1741]: Service "Tower1-73" (/services/sftp-ssh.service) successfully established.
Dec 31 04:44:09 Tower1 avahi-daemon[1741]: Host name conflict, retrying with Tower1-74
Dec 31 04:44:09 Tower1 avahi-daemon[1741]: Registering new address record for 192.168.1.110 on eth0.IPv4.
Dec 31 04:44:10 Tower1 avahi-daemon[1741]: Server startup complete. Host name is Tower1-74.local. Local service cookie is 2216981387.
Dec 31 04:44:11 Tower1 avahi-daemon[1741]: Service "Tower1-74" (/services/ssh.service) successfully established.
Dec 31 04:44:11 Tower1 avahi-daemon[1741]: Service "Tower1-74" (/services/smb.service) successfully established.
Dec 31 04:44:11 Tower1 avahi-daemon[1741]: Service "Tower1-74" (/services/sftp-ssh.service) successfully established.
Dec 31 07:00:13 Tower1 avahi-daemon[1741]: Host name conflict, retrying with Tower1-75
Dec 31 07:00:13 Tower1 avahi-daemon[1741]: Registering new address record for 192.168.1.110 on eth0.IPv4.
Dec 31 07:00:14 Tower1 avahi-daemon[1741]: Server startup complete. Host name is Tower1-75.local. Local service cookie is 2216981387.
Dec 31 07:00:15 Tower1 avahi-daemon[1741]: Service "Tower1-75" (/services/ssh.service) successfully established.
Dec 31 07:00:15 Tower1 avahi-daemon[1741]: Service "Tower1-75" (/services/smb.service) successfully established.
Dec 31 07:00:15 Tower1 avahi-daemon[1741]: Service "Tower1-75" (/services/sftp-ssh.service) successfully established.
Dec 31 10:29:44 Tower1 avahi-daemon[1741]: Host name conflict, retrying with Tower1-76
Dec 31 10:29:44 Tower1 avahi-daemon[1741]: Registering new address record for 192.168.1.110 on eth0.IPv4.
Dec 31 10:29:45 Tower1 avahi-daemon[1741]: Server startup complete. Host name is Tower1-76.local. Local service cookie is 2216981387.
Dec 31 10:29:46 Tower1 avahi-daemon[1741]: Service "Tower1-76" (/services/ssh.service) successfully established.
Dec 31 10:29:46 Tower1 avahi-daemon[1741]: Service "Tower1-76" (/services/smb.service) successfully established.
Dec 31 10:29:46 Tower1 avahi-daemon[1741]: Service "Tower1-76" (/services/sftp-ssh.service) successfully established.


 

Link to comment
1 minute ago, johnnie.black said:

Description sounds like what happens when there is a duplicate IP address, make sure you don't have any other device with the same IP, or change that IP, if not that try with with a different NIC.

Yes - a device with static IP assignment that matches what the DHCP server wants to send out for this machine would also introduce magical networking problems.

Link to comment

Ok, I am certainly not very educated with network issues, DHCP, etc., but I have tried to resolve this problem by adding Tower1 and Tower2 (along with their corresponding static IPs and MAC addresses) to the router's DHCP reservations list and then rebooting the router. Is this the right way to proceed?

 

As faras having another device on the LAN named "Tower1" it is not from me, but I have all kinds of audio/video devices with names created by the companies that built them that must be trying to use Tower1 as well. I saw this in real time while going from page to page in my router's firmware. My router does not give me the names of the devices, but in its DHCP client list the static IP of neither of my 2 Towers was listed there. Then, one time flipping back to the DHCP client list, I saw that the IP for my troublesome Tower was suddenly listed, but not using the MAC address that unRaid reports as belonging to my Tower, so I assume that you guys are right and that some other device is robbing the IP from my unRaid Tower1.

 

Do you think that the DHCP reservation list should work or that I would be better off changing the static IPs of both Towers toa IPs that are much farther away from the ones that are being handed out by the router?

 

Thanks, guys! I think you have found the problem and I will soon be going in the right direction...:)

Link to comment
Ok, I am certainly not very educated with network issues, DHCP, etc., but I have tried to resolve this problem by adding Tower1 and Tower2 (along with their corresponding static IPs and MAC addresses) to the router's DHCP reservations list and then rebooting the router. Is this the right way to proceed?
 ...
Do you think that the DHCP reservation list should work


If you assign static IPs to your servers then you should exclude those addresses from the ones DHCP uses. DHCP reservations are meant to be used, when you want the server to use DHCP, but manually control which addresses they get. The former has the advantage that the servers remain reachable even if the DHCP server fails. The latter has the advantage to have all IP configurations in one place. Your choice, personally I prefer the latter.

In the log you posted the server continuously selects a new name for itself, which seems to rather be a naming conflict than an IP conflict, but it repeats for every new name the server selects for itself. This gives the impression that it conflicts with itself. Some home routers scan the network to learn about IPs in the range they provide DHCP in order to avoid conflicts due to misconfigurations. This could lead to such symptoms. Clean up the DHCP setup (static IPs OR reservations, but not both) and report, whether the problem persists.
Link to comment
Quote

This gives the impression that it conflicts with itself. 

When you wrote that, I started thinking and realized that my MB had 2 RJ45 ports included on the back panel. As I remember, they failed, so I bought a PCI card with a single RJ45. When I read your statement above, I wondered if I had disabled the 2 onboard RJ45s in the mobo's BiOS. I just rebooted the server, entered the BiOS, and found that both RJ45 connectors had been disabled, but I have to wonder if they still could somehow be the source of my conflict. Possible?

 

Quote

 Clean up the DHCP setup (static IPs OR reservations, but not both)

 

Since I have reservations setup on my router, I can't leave the static IP setup in unRaid? That is, I should switch unRaid back to  "automatic"?

 

So, I should switch the server back to "automatic" in 2 places:

 

1. settings/IP address assignment = automatic

2.settings/DNS server assignment = automatic

 

Is this correct?

 

BTW, I ordered a new PCI NIC from Amazon - Intel PWLA8391GT PRO/1000 GT PCI Network Adapter

Is that a good choice?

 

Edited by Oddwunn
Link to comment

I have set up a DHCP reservation on my router, changed the IP to "automatic" in unRaid and rebooted both units, yet the problem persists. I can not find any other computer in my LAN named "Tower1". I have attached the latest syslog to this message in hopes that there will be more information there that might provide a clue as to what is going on. Like I mentioned, I ordered a new NIC to install in the server, so I will change that card as soon as it arrives on Friday.

 

The syslog seems to indicate some sort of error with the WebUI, yet the WebUI seems to be working perfectly. The problem is still the same - Tower1 simply chooses to refuse the connection at random times, for random amounts of time, on random disks within the array, yet ALWAYS writes to the cache disk without issue. Is there anything else I can try while waiting for the new NIC to arrive?

OddwunnSyslog2.txt

Link to comment

Ok, these are just ideas on how I would address this problem.

 

1. Login to each machine via unRAID GUI interface and check each machines ip addresses and record.

2. Set each machine to a static ip address.

3. Login to router and set those two address's as reserved

 

You can also verify the address of each machine by using console. ifconfig

 

The problem your having with your second machine I had the same issue when I had my Nest Thermostat and my server attempting to share the same ip address. 

 

When I had two servers setup I had a bizarre problem as well using Tower. I ended up just addressing each machine as their IP and avoided using Tower, but changed each machine one to Tower and the other to XEON to avoid it, but make sure them IP's are conflicting. 

Link to comment

According to your initial post, point 3, you don't have issues when connecting via IP, so I still think that the issues are less likely to be related to IP address conflicts than something else.

 

Please check the following points:

  1. On the array server having issues go to Settings > Network Settings and check the Network protocol it is running on the active interface (IPv4, IPv6, both?)
  2. When you can connect via name, do the following:
    1. open a CMD window on your Windows machine and type "arp -a", this will provide you with a list of Ethernet (MAC) addresses associated with each IP address. Write down the MAC address associated with your server.
    2. in the CMD window type "ping tower1" and note, to which address the echo request packets are being sent. Is it an IPv4 or an IPv6 address?
  3. When you cannot connect via name, do the same as above and compare the output to when it is working. If you get a different MAC address than before, you have an IP conflict. If you get a different IP address when doing a ping, you have a name conflict.
  4. Do you have SMB shares from one server mounted on the other?
Edited by tstor
Link to comment
2 minutes ago, tstor said:

According to your initial post, point 3, you don't have issues when connecting via IP, so I still think that the issues are less likely to be related to IP address conflicts than something else.

 

The machine constantly picking a new name will break the host resolution in Windows, which means it doesn't matter if the machine keeps the same IP or not.

 

But if the unRAID server has hardcoded settings, then it will not try to change hostname and so will constantly announce itself using the proper name.

Link to comment
2 minutes ago, pwm said:

The machine constantly picking a new name will break the host resolution in Windows, which means it doesn't matter if the machine keeps the same IP or not.

 

Yes, but if everything else is fine, the server shouldn't continuously change its name.

Link to comment

Hi guys,

 

Thanks for the responses, though admittedly a lot of it is over my head. My original post #3 reads:

3. The web UI connects fine most of the time as "Tower1", but if it does not connect that way, then I simply put in the IP address (it is running a static IP) and I seem to be able to connect to it then.

Over the last few days, I will amend that statement to now read "SOMETIMES connecting via the IP address works and sometimes it doesn't."

 

But the main issue here is not in accessing the rogue server via the webUI, as MOST of time I have no problems connecting. The IMPORTANT, MAIN problem is in writing to the array disks. Writing to the array is a hit or miss proposition at best. I can read from it 100% of the time, but sometimes I can not write to it, sometimes I can write to it but it bails out part way, and sometimes I can not write to it at all. Since this is a media server with large MKV video files, I would be panicking big time if I could not read from the array reliably. My workaround solution to writing to the array is to write the files to the cache disk, open a Putty window, and then use Midnight Commander to write the files to the final destination array disk, a slow and cumbersome process at best.

 

To answer some of the recent questions:

 

1. Yes, I can control the range of IPs available to be handed out by DHCP. Should I limit the DHCP handout range and then assign a static IP outside of the range to the rogue server?

2. "On the array server having issues go to Settings > Network Settings and check the Network protocol it is running on the active interface (IPv4, IPv6, both?)"

 

I don't see anything even remotely close to that. I see one section labeled "Interface eth 0" that displays the MAC address, several IPs, and 2 settings which allow me to choose between static and automatic IP. There is nowhere there that provides me with values like "IPv4, IPv6, or both" and since there is only a single interface, there is no other interface to look at.

 

3. Tstor said, in an earlier post "Clean up the DHCP setup (static IPs OR reservations, but not both)", yet kizer said 

"2. Set each machine to a static ip address.

3. Login to router and set those two address's as reserved"

 

Which one is correct?

 

4. Opening a command prompt and typing arp -a yields perfect results - right now.

Link to comment

Shrink the IP range owned by the DHCP server.

 

In the unRAID configuration - specify a static name + IP, where the IP is outside of the range owned by the DHCP server.

 

The above will most probably be enough to get the machine working as expected.

Link to comment

With "static IPs OR reservations, but not both" I mean the same thing as pwm is stating above: Manually assigned IP addresses should be outside of the range used by the DHCP server. So I suggest you follow his advice and try this as a next step.

 

The question for IPv6 was my mistake, I forgot that you're running 6.3.1. As far as I remember, IPv6 support has only been added in 6.4. If you say that you only see "several IPs" I assume you are referring to the IPv4 address, gateway, DNS server and DNS server 2, correct?

 

I understand that your main issue from a user perspective is writing to the array, but there are more variables involved with that than for the web GUI, so getting web access stable will probably help for the other issues, too.

 

A few other questions:

  • Is the connection server - switch stable (look for errors in "netstat -i")?
  • Does your router eventually have a name for the local network defined that ends in .local?

unRAID runs an avahi-daemon that announces available services using multicast DNS names ending in .local

This daemon is known to create name conflicts with itself under some circumstances, e.g. issues with link stability, unicast DNS names ending in .local and others.

 

Do you have Apple software (e.g iTunes, Safari) installed on one or more of your Windows machines?

Link to comment

With "static IPs OR reservations, but not both" I mean the same thing as pwm is stating above: Manually assigned IP addresses should be outside of the range used by the DHCP server. So I suggest you follow his advice and try this as a next step.

Ok, got it...I will do that next.

If you say that you only see "several IPs" I assume you are referring to the IPv4 address, gateway, DNS server and DNS server 2, correct?

Yes, that is correct.

I understand that your main issue from a user perspective is writing to the array, but there are more variables involved with that than for the web GUI, so getting web access stable will probably help for the other issues, too.

Ok...got it.

  • Is the connection server - switch stable (look for errors in "netstat -i")?
  • Does your router eventually have a name for the local network defined that ends in .local?

Do you have Apple software (e.g iTunes, Safari) installed on one or more of your Windows machines?

Nope...I don't own anything with Apple's name on it.

 

While I have been composing this reply, I made the changes the 2 of you suggested, and now the server has a distinctly different "feel" to it (for lack of a better word). The webUI responds very quickly now, where it once seemed lethargic even when it was working. I am testing it further by writing several large files to it in order to see how it responds over a longer period of time. I will report back here as soon as I have the results (some time tomorrow, as I am writing about 2TB of files.)

 

Once again, than you for hanging with me with this problem, as I am very hopeful that the problem may be solved.

 
                             I have no idea where to find that information.
Link to comment
Quote

 


Shrink the IP range owned by the DHCP server.

 

In the unRAID configuration - specify a static name + IP, where the IP is outside of the range owned by the DHCP server.

 

 

It has been almost 24 hours since I followed the above advice, and so far everything is working properly. I am not technically educated enough to know why this solution works, but I am just thrilled to have things working properly with both servers.

 

Many thanks to everyone who contributed to this thread. You guys really know your stuff!!

Edited by Oddwunn
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.