unRAID server having severe DNS issues


Go to solution Solved by autumnwalker,

Recommended Posts

  • Replies 58
  • Created
  • Last Reply

Top Posters In This Topic

these messages come from a "far away running" clock.

 

you need to stop the ntpd daemon (it wont sync if the current clock is more than a certain limit away from the real time), then issue "ntpdate <server>" (you can use any known ntp server like time.windows.com)

ntpdate will sync to the received time without any limit

 

afterwards you can restart ntpd

(if this happens again soon after, check the battery on your motherboard)

 

Edited by MAM59
Link to comment
4 hours ago, MAM59 said:

these messages come from a "far away running" clock.

 

you need to stop the ntpd daemon (it wont sync if the current clock is more than a certain limit away from the real time), then issue "ntpdate <server>" (you can use any known ntp server like time.windows.com)

ntpdate will sync to the received time without any limit

 

afterwards you can restart ntpd

(if this happens again soon after, check the battery on your motherboard)

 

I tried this. I believe the ntp errors are coming because the npt daemon isn't able to connect with remote "clock" due to DNS issue
image.png.97430541e4a6600e3befee1de716dbb5.png

Link to comment
5 minutes ago, Astatine said:

I believe the ntp errors are coming because the npt daemon isn't able to connect with remote "clock" due to DNS issue

yeah but those "a message from the past..." means that the clock is far away. Of course, without LAN you cannot sync, but the error happened before. still no clue in your logs what is going on.

 

I meant you should sync the clock with ntpdate BEFORE the LAN is gone. Reboot, stop ntpd, use ntpdate, restart ntpd and wait,

we want to be sure that the box has the correct time when the error occurs.

Edited by MAM59
Link to comment
8 minutes ago, MAM59 said:

yeah but those "a message from the past..." means that the clock is far away. Of course, without LAN you cannot sync, but the error happened before. still no clue in your logs what is going on.

 

I meant you should sync the clock with ntpdate BEFORE the LAN is gone. Reboot, stop ntpd, use ntpdate, restart ntpd and wait,

we want to be sure that the box has the correct time when the error occurs.

Okay. I'm checking what JorgeB suggested. Booted in safe mode. I'll observe the server for a while in safe mode and after ntpdate. 

 

Just bear with my stupid questions. I just want to understand as much as I can in case something like this happens in the future.

Link to comment

@JorgeBOkay, so waited about an hour after booting into safe mode. The server had lost connection. I guess we can rule out a bad plugin from the things causing the issue

@MAM59I rebooted and used the following commands to resync ntp

/etc/rc.d/rc.ntpd stop
ntpdate time.cloudflare.com
/etc/rc.d/rc.ntpd restart

Right after restarting the service. I checked the logs and found this

Mar 14 16:33:36 astatine  ntpd[1153]: ntpd exiting on signal 1 (Hangup)
Mar 14 16:33:36 astatine  ntpd[1153]: 127.127.1.0 local addr 127.0.0.1 -> <null>
Mar 14 16:33:36 astatine  ntpd[1153]: 216.239.35.0 local addr 192.168.1.4 -> <null>
Mar 14 16:33:36 astatine  ntpd[1153]: 216.239.35.4 local addr 192.168.1.4 -> <null>
Mar 14 16:33:36 astatine  ntpd[1153]: 216.239.35.8 local addr 192.168.1.4 -> <null>
Mar 14 16:33:36 astatine  ntpd[1153]: 216.239.35.12 local addr 192.168.1.4 -> <null>
Mar 14 16:35:02 astatine  ntpd[18956]: ntpd 4.2.8p15@1.3728-o Fri Jun  3 04:17:10 UTC 2022 (1): Starting
Mar 14 16:35:02 astatine  ntpd[18956]: Command line: /usr/sbin/ntpd -g -u ntp:ntp
Mar 14 16:35:02 astatine  ntpd[18956]: ----------------------------------------------------
Mar 14 16:35:02 astatine  ntpd[18956]: ntp-4 is maintained by Network Time Foundation,
Mar 14 16:35:02 astatine  ntpd[18956]: Inc. (NTF), a non-profit 501(c)(3) public-benefit
Mar 14 16:35:02 astatine  ntpd[18956]: corporation.  Support and training for ntp-4 are
Mar 14 16:35:02 astatine  ntpd[18956]: available at https://www.nwtime.org/support
Mar 14 16:35:02 astatine  ntpd[18956]: ----------------------------------------------------
Mar 14 16:35:02 astatine  ntpd[18960]: proto: precision = 0.034 usec (-25)
Mar 14 16:35:02 astatine  ntpd[18960]: basedate set to 2022-05-22
Mar 14 16:35:02 astatine  ntpd[18960]: gps base set to 2022-05-22 (week 2211)
Mar 14 16:35:02 astatine  ntpd[18960]: Listen normally on 0 lo 127.0.0.1:123
Mar 14 16:35:02 astatine  ntpd[18960]: Listen normally on 1 br0 192.168.1.4:123
Mar 14 16:35:02 astatine  ntpd[18960]: Listen normally on 2 lo [::1]:123
Mar 14 16:35:02 astatine  ntpd[18960]: Listening on routing socket on fd #19 for interface updates
Mar 14 16:35:02 astatine  ntpd[18960]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Mar 14 16:35:02 astatine  ntpd[18960]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

kernel reports TIME_ERROR: 0x41: Clock Unsynchronized 

Edited by Astatine
Link to comment

So, I was looking online and found that sometimes the 'Clock Unsync' issue is caused because the BIOS time is incorrect. So, I checked that and lo & behold, the time is correct. As expected the time is in UTC and it correct. Which means we can rule out a faulty CMOS battery, right @MAM59?

 

Anyways, good thing after my recent reboot, I am no longer seeing any clock out of sync errors. However, internet connection remains

Edited by Astatine
Link to comment
6 hours ago, Astatine said:

rule out a faulty CMOS battery, right @MAM59?

yeah, but the manual time correction was necessary I think because ntp alone would never have come to the real time. It only adjusts in minimal steps and calculates a correction factor for a (believed) unreliable cmos clock. So it can take ages until it works.

ntpdate (once should be enough if the battery is ok) forces an update and stores the received time into the cmos clock.

on the next boot, ntp should find an "almost correct" clock and keeps it in sync with its mini-adjustment.

 

(but dont expect too much, the time problem is not what you are searching for, it is just an additional problem that came up because of your lack of internet connection. Anyway, fix it before it comes into your way too much)

 

 

Edited by MAM59
Link to comment

Yeah, totally agreed. Time sync issue seems to have occurred due to loss of connection and not the other way around. However, I there something else that comes to mind in how to diagnose the real issue? I'm going to try booting into a live ubuntu install later in the evening and observe if something similar happens in that too. In the meantime, I was also wondering if there is way to do a fresh install of unRAID while keeping my config and settings intact. That way, if there is a corrupt file somewhere, it would be fixed. Thoughts?

Link to comment
31 minutes ago, Astatine said:

I was also wondering if there is way to do a fresh install of unRAID while keeping my config and settings intact. That way, if there is a corrupt file somewhere, it would be fixed

The core Unraid files are unpacked into RAM from the archives on the flash drive every time that Unraid boots so in that sense they should not be corrupt.   It is possible that something in the config folder which holds all your settings/customisations is corrupt but if you do not keep that you have to redo your settings.  You can simply stop plugins being loaded by booting in Safe Mode, but any other change would require you to do something to your settings.

Link to comment
49 minutes ago, Astatine said:

I there something else that comes to mind in how to diagnose the real issue?

reading your diagnostics give absolutely no clue about what is going on. even the "loss of LAN" is not recognized, so I would go with JorgeB ("the error is outside of unraid").

If the Link to the switch would be lost, it would be noted, if you say "the others" can still comunicate, its also not the link to the router.

The only faint (VERY FAINT!) idea that comes to my mind is a blocking port. try to disable "flow control" in the network settings (you need to use the "tips & tricks" plugin for this).

 

But another test you can do (and report the result here) is to wait, until communication is gone. Then pull the ethernet plug, wait a few seconds, then put it back in again.

I'm curious to see if a brutal line cut will reset the error state...

 

Link to comment
On 3/15/2023 at 3:38 PM, itimpi said:

The core Unraid files are unpacked into RAM from the archives on the flash drive every time that Unraid boots so in that sense they should not be corrupt.   It is possible that something in the config folder which holds all your settings/customisations is corrupt but if you do not keep that you have to redo your settings.  You can simply stop plugins being loaded by booting in Safe Mode, but any other change would require you to do something to your settings.

I am open to setting up everything from the scratch. How can I have a fresh start without using the limited USB Key resets? I believe all the parity data, appdata, etc. will just remain the same in the new install so long as I assign the correct drive back to the same purpose. As for docker container install, I am willing to do it again and point the new containers to old appdata to bring server back to my latest state. 

 

Is there a way to accomplish the above?

Link to comment
On 3/15/2023 at 4:00 PM, MAM59 said:

reading your diagnostics give absolutely no clue about what is going on. even the "loss of LAN" is not recognized, so I would go with JorgeB ("the error is outside of unraid").

If the Link to the switch would be lost, it would be noted, if you say "the others" can still comunicate, its also not the link to the router.

The only faint (VERY FAINT!) idea that comes to my mind is a blocking port. try to disable "flow control" in the network settings (you need to use the "tips & tricks" plugin for this).

 

But another test you can do (and report the result here) is to wait, until communication is gone. Then pull the ethernet plug, wait a few seconds, then put it back in again.

I'm curious to see if a brutal line cut will reset the error state...

 

Well, with all due respect, I find it hard to believe that it is not an unRAID issue. I booted the server to a live ubuntu USB and left it to ping google.com, cloudflare.com & unraid.net for more than a day and the packet loss are as follows: 0% for google.com, 0.00383547% packet loss for unraid.net & 0.00153467% packet loss for cloudflare.com. So, I believe this rules out the possibility of a hardware issue.

 

- Not a router issues as all my other devices are working just fine connecting to the outside world

- Not a switch issue as all the other devices connected to the switch are working fine + unRAID server also get connection for 10-15 minutes after reboot and then loses DNS

- Not a hardware issue as booting into live ubuntu USB doesn't seem to have the same DNS resolution issue as the unRAID server

 

With this new information on hand, I can confidently say that it is an unRAID issue, whether it originates from the OS or a corrupted config file, that's a different story. 

 

ping_cloudflare.txt ping_google.txt ping_unraid.txt

Link to comment

Setup new stock Unraid flash drive, no key needed, boot the server with it and test, if no issues it suggest a problem with current /config, you can then try to copy just the bare minimum from the old flash drive and reconfigure the rest of the server, or copy a few config files at a time and re-test to see if you can find the culprit.

Link to comment
1 minute ago, JorgeB said:

Setup new stock Unraid flash drive, no key needed, boot the server with it and test, if no issues it suggest a problem with current /config, you can then try to copy just the bare minimum from the old flash drive and reconfigure the rest of the server, or copy a few config files at a time and re-test to see if you can find the culprit.

Thanks. I'll get back to you once I've tried it.

Link to comment

I appear to be having the same symptoms. I'm not getting clock sync errors or anything. I can hit anything on my local LAN using IP or DNS entry (using pfSense DNS resolver), but anything that my local DNS cannot resolve (e.g. google.com) I get "Destination Host Unreachable". 

 

What's odd is I tried the same ping tests from inside a VM running on Unraid while Unraid is having these issues ... and the VM has no network problems at all. The VM is configured to use the same pfSense DNS resolver.

 

I wasn't having this issue until after I upgraded to 6.11.5. I was previously on 6.10.3.

Link to comment

@autumnwalker Yes, I am using ipvlan + host access to custom networks because I want the prowlarr, jackett containers on Host IP to communicate with some containers that are on a different IP setup using br0 interface. I'll turn off the "host access to custom networks" and see if the issue gets fixed like it did for the folks you mentioned

 

Edited by Astatine
Link to comment
On 3/20/2023 at 1:15 PM, JorgeB said:

Setup new stock Unraid flash drive, no key needed, boot the server with it and test, if no issues it suggest a problem with current /config, you can then try to copy just the bare minimum from the old flash drive and reconfigure the rest of the server, or copy a few config files at a time and re-test to see if you can find the culprit.

@JorgeB So, I did what you asked and turns out the test unRAID flash drive has 0 issues. And after going through what @autumnwalker mentioned above, I feel like it's a known issue with how docker handles networking and hasn't been fixed yet. I'll test this theory before anything else.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.