Page MenuHomePhabricator

DNS stops working
Needs testing, LowPublicBUG

Description

Certain domains randomly stop resolving for about 5 minutes. No pattern in which domains. Happens 3-4 times a day. Rights itself with no configuration changes and no actions I've found make any difference in speeding that up. (Short of rebooting the router - which is a short term fix).

google.com

$ dig google.com

; <<>> DiG 9.11.3-1ubuntu1.5-Ubuntu <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 39675
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.            IN  A

;; AUTHORITY SECTION:
.           2647    IN  SOA a.root-servers.net. nstld.verisign-grs.com. 2019041000 1800 900 604800 86400

;; Query time: 2 msec
;; SERVER: 192.168.18.1#53(192.168.18.1)
;; WHEN: Wed Apr 10 21:59:38 AEST 2019
;; MSG SIZE  rcvd: 111

21:59

$  host google.com
Host google.com not found: 3(NXDOMAIN)
$  nslookup google.com
Server:     192.168.18.1
Address:    192.168.18.1#53

** server can't find google.com: NXDOMAIN

On router itself, resolution is fine.

vyos@graham-vyos:~$  host google.com
google.com has address 216.58.196.142
google.com has IPv6 address 2404:6800:4006:805::200e
google.com mail is handled by 10 aspmx.l.google.com.
google.com mail is handled by 40 alt3.aspmx.l.google.com.
google.com mail is handled by 50 alt4.aspmx.l.google.com.
google.com mail is handled by 20 alt1.aspmx.l.google.com.
google.com mail is handled by 30 alt2.aspmx.l.google.com.
vyos@graham-vyos:~$ sudo rec_control ping
pong

after running host google.com on the router - makes no difference to resolving google.com inside my lan.

router lan interface tcpdump port 53
around this timestamp I hit refresh repeatedly on google.com in chrome on linux (hostname blue-canoe, graham is the router running vyos).

21:59:16.745527 IP blue-canoe.40818 > graham.domain: 14122+ A? www.google.com. (32)
21:59:16.745597 IP graham.domain > blue-canoe.40818: 14122 NXDomain 0/1/0 (104)
21:59:16.747666 IP blue-canoe.54593 > graham.domain: 9066+ A? www.google.com.xxxxx.info. (47)
21:59:16.747730 IP graham.domain > blue-canoe.54593: 9066 NXDomain 0/1/0 (96)
21:59:16.764154 IP blue-canoe.45325 > graham.domain: 62526+ A? www.google.com. (32)
21:59:16.764229 IP graham.domain > blue-canoe.45325: 62526 NXDomain 0/1/0 (104)
21:59:16.765527 IP blue-canoe.44157 > graham.domain: 30752+ A? www.google.com.xxxxx.info. (47)
21:59:16.765591 IP graham.domain > blue-canoe.44157: 30752 NXDomain 0/1/0 (96)
21:59:16.767154 IP blue-canoe.35299 > graham.domain: 12206+ A? www.google.com. (32)
21:59:16.767227 IP graham.domain > blue-canoe.35299: 12206 NXDomain 0/1/0 (104)
21:59:16.768541 IP blue-canoe.37555 > graham.domain: 20245+ A? www.google.com.xxxxx.info. (47)
21:59:16.768604 IP graham.domain > blue-canoe.37555: 20245 NXDomain 0/1/0 (96)

21:59:16.770041 IP blue-canoe.39109 > graham.domain: 64731+ A? www.google.com. (32)
21:59:16.770112 IP graham.domain > blue-canoe.39109: 64731 NXDomain 0/1/0 (104)
21:59:16.771904 IP blue-canoe.42471 > graham.domain: 11000+ A? www.google.com.xxxxxxx.info. (47)
21:59:16.771966 IP graham.domain > blue-canoe.42471: 11000 NXDomain 0/1/0 (96)
21:59:16.803780 IP blue-canoe.40262 > graham.domain: 49111+ A? www.google.com. (32)
21:59:16.803853 IP graham.domain > blue-canoe.40262: 49111 NXDomain 0/1/0 (104)
21:59:16.806279 IP blue-canoe.49720 > graham.domain: 57719+ A? www.google.com.xxxxxxx.info. (47)
21:59:16.806341 IP graham.domain > blue-canoe.49720: 57719 NXDomain 0/1/0 (96)
21:59:16.835405 IP blue-canoe.41244 > graham.domain: 30213+ A? www.google.com. (32)
21:59:16.835477 IP graham.domain > blue-canoe.41244: 30213 NXDomain 0/1/0 (104)
21:59:16.836906 IP blue-canoe.53697 > graham.domain: 24971+ A? www.google.com.xxxxxxx.info. (47)
21:59:16.836975 IP graham.domain > blue-canoe.53697: 24971 NXDomain 0/1/0 (96)
21:59:17.197047 IP mediaroom-tv.13159 > graham.domain: 27754+ A? socialize.au1.gigya.com. (41)
21:59:17.197212 IP graham.domain > mediaroom-tv.13159: 27754 NXDomain 0/1/0 (113)
21:59:17.197668 IP mediaroom-tv.32640 > graham.domain: 2787+ A? socialize.au1.gigya.com.xxxxxxx.info. (56)
21:59:17.197846 IP graham.domain > mediaroom-tv.32640: 2787 NXDomain 0/1/0 (105)
21:59:17.432176 IP mediaroom-tv.31617 > graham.domain: 21253+ A? bcp.crwdcntrl.net. (35)
21:59:17.444639 IP graham.domain > mediaroom-tv.31617: 21253 NXDomain 2/1/0 CNAME td.crwdcntrl.net., CNAME nginx-bcp-stackA-1013960178.ap-southeast-2.elb.amazonaws.com. (195)
21:59:17.445439 IP mediaroom-tv.33475 > graham.domain: 22858+ A? bcp.crwdcntrl.net.xxxxxxx.info. (50)
21:59:17.445741 IP graham.domain > mediaroom-tv.33475: 22858 NXDomain 0/1/0 (99)
21:59:17.548682 IP blue-canoe.38113 > graham.domain: 32303+ A? www.google.com. (32)
21:59:17.548771 IP graham.domain > blue-canoe.38113: 32303 NXDomain 0/1/0 (104)
21:59:17.551055 IP blue-canoe.55358 > graham.domain: 25861+ A? www.google.com.xxxxxxx.info. (47)
21:59:17.551124 IP graham.domain > blue-canoe.55358: 25861 NXDomain 0/1/0 (96)
21:59:17.554305 IP blue-canoe.42845 > graham.domain: 2210+ A? www.google.com. (32)
21:59:17.554377 IP graham.domain > blue-canoe.42845: 2210 NXDomain 0/1/0 (104)
21:59:17.556555 IP blue-canoe.41995 > graham.domain: 6374+ A? www.google.com.xxxxxxx.info. (47)
21:59:17.556619 IP graham.domain > blue-canoe.41995: 6374 NXDomain 0/1/0 (96)
21:59:18.165826 IP blue-canoe.46368 > graham.domain: 56695+ A? www.google.com. (32)
21:59:18.165930 IP graham.domain > blue-canoe.46368: 56695 NXDomain 0/1/0 (104)
21:59:18.168701 IP blue-canoe.50915 > graham.domain: 33099+ A? www.google.com.xxxxxxx.info. (47)
21:59:18.168776 IP graham.domain > blue-canoe.50915: 33099 NXDomain 0/1/0 (96)
21:59:18.171836 IP blue-canoe.40443 > graham.domain: 28363+ A? www.google.com. (32)
21:59:18.171964 IP graham.domain > blue-canoe.40443: 28363 NXDomain 0/1/0 (104)
21:59:18.174825 IP blue-canoe.44161 > graham.domain: 18174+ A? www.google.com.xxxxxxx.info. (47)
21:59:18.174906 IP graham.domain > blue-canoe.44161: 18174 NXDomain 0/1/0 (96)
21:59:18.232833 IP blue-canoe.37112 > graham.domain: 42181+ A? www.google.com. (32)
21:59:18.232918 IP graham.domain > blue-canoe.37112: 42181 NXDomain 0/1/0 (104)
21:59:18.234203 IP blue-canoe.54379 > graham.domain: 45708+ A? www.google.com.xxxxxxx.info. (47)
21:59:18.234284 IP graham.domain > blue-canoe.54379: 45708 NXDomain 0/1/0 (96)
21:59:18.237202 IP blue-canoe.33771 > graham.domain: 46809+ A? www.google.com. (32)
21:59:18.237265 IP graham.domain > blue-canoe.33771: 46809 NXDomain 0/1/0 (104)
21:59:18.239327 IP blue-canoe.57118 > graham.domain: 29551+ A? www.google.com.xxxxxxx.info. (47)
21:59:18.239401 IP graham.domain > blue-canoe.57118: 29551 NXDomain 0/1/0 (96)
21:59:18.556466 IP blue-canoe.34221 > graham.domain: 20864+ A? www.google.com. (32)
21:59:18.556543 IP graham.domain > blue-canoe.34221: 20864 NXDomain 0/1/0 (104)
21:59:18.558089 IP blue-canoe.54049 > graham.domain: 19530+ A? www.google.com.xxxxxxx.info. (47)
21:59:18.558152 IP graham.domain > blue-canoe.54049: 19530 NXDomain 0/1/0 (96)
21:59:18.560237 IP blue-canoe.42765 > graham.domain: 16723+ A? www.google.com. (32)
21:59:18.560305 IP graham.domain > blue-canoe.42765: 16723 NXDomain 0/1/0 (104)
21:59:18.561713 IP blue-canoe.41813 > graham.domain: 21407+ A? www.google.com.xxxxxxx.info. (47)
21:59:18.561789 IP graham.domain > blue-canoe.41813: 21407 NXDomain 0/1/0 (96)

tcpdump of port 53 on router WAN interface

21:58:33.031563 IP one.one.one.one.domain > xxx-xxx-xxx-xxx.myrepublic.net.42790: 26354 0/1/0 (92)
21:59:16.698347 IP xxx-xxx-xxx-xxx.myrepublic.net.48441 > 103-217-165-53.myrepublic.net.domain: 6265+ [1au] A? asia.adform.net. (44)
21:59:16.708928 IP 103-217-165-53.myrepublic.net.domain > xxx-xxx-xxx-xxx.myrepublic.net.48441: 6265 6/0/1 CNAME track-apac.adformnet.akadns.net., A 185.84.60.25, A 185.84.60.29, A 185.84.60.23, A 185.84.60.27, A 185.84.60.12 (166)
21:59:17.432390 IP xxx-xxx-xxx-xxx.myrepublic.net.61432 > 103-217-165-53.myrepublic.net.domain: 2427+ [1au] A? bcp.crwdcntrl.net. (46)
21:59:17.444421 IP 103-217-165-53.myrepublic.net.domain > xxx-xxx-xxx-xxx.myrepublic.net.61432: 2427 4/0/1 CNAME td.crwdcntrl.net., CNAME nginx-bcp-stackA-1013960178.ap-southeast-2.elb.amazonaws.com., A 3.104.4.113, A 13.210.233.18 (169)
21:59:18.574420 IP xxx-xxx-xxx-xxx.myrepublic.net.13604 > 103-217-165-53.myrepublic.net.domain: 23454+ [1au] A? config.swm.digital. (47)
21:59:18.585203 IP 103-217-165-53.myrepublic.net.domain > xxx-xxx-xxx-xxx.myrepublic.net.13604: 23454 4/0/1 A 13.35.146.96, A 13.35.146.95, A 13.35.146.31, A 13.35.146.20 (111)

22:06 it's back with no configuration changes at all in any way.

It's almost like it has its wires crossed about a cached value and is sure it has checked and the domain doesn't resolve. Then it does check again and resolves it forevermore until the next time about 6 hours later.
I have tried setting the cache size to 0 and also to 128 with no discernible difference.
I switched off dnssec in case that helped, no luck. I log the firewall packets dropped, nothing interesting. I ensure port 53 is not firewalled, no difference. I've followed other superstition too with no luck.

vyos@graham-vyos:~$ cat /etc/powerdns/recursor.conf

### Autogenerated by dns_forwarding.py ###

# Non-configurable defaults
daemon=yes
threads=1
allow-from=0.0.0.0/0, ::/0
log-common-errors=yes
non-local-bind=yes
query-local-address=0.0.0.0
query-local-address6=::

# cache-size
max-cache-entries=128

# negative TTL for NXDOMAIN
max-negative-ttl=3600

# ignore-hosts-file
export-etc-hosts=yes

# listen-on
local-address=192.168.18.1

# domain ... server ...

# dnssec
dnssec=off

# name-server
forward-zones-recurse=.=103.217.165.53;45.248.197.53

Just quietly I might be almost ready to go back to running straight debian as a router, using dnsmasq, IPTables commands etc because I can't fix or even diagnose this. Which seems a shame, the vyos configuration ui seems nice to use but this is actually quite annoying for users of the local network when suddenly mid flow things just stop working.

Cheers vyos team. I really do like your work...

Details

Difficulty level
Unknown (require assessment)
Version
1.20
Why the issue appeared?
Will be filled on close

Related Objects

StatusAssignedTask
Needs testingNone
Resolvedc-po

Event Timeline

hal8 created this task.Apr 10 2019, 4:22 PM
syncer changed the task status from Open to Needs testing.Apr 17 2019, 7:35 PM
syncer triaged this task as Low priority.
syncer edited projects, added VyOS 1.3 Equuleus; removed VyOS 1.2 Crux.
hal8 added a comment.Apr 22 2019, 1:09 AM

I don't understand "needs testing" here at all? What testing? How can I help? For me this is obviously high priority, in fact it's a deal-breaker, it's a problem I need to solve. As I think it would be for you if you had the bug on your network. I thought maybe the vyos team would also consider it high. Why low? Seems important that dns fails regularly. Why isn't it? (I have a solution that works but it's a selfish one that doesn't help vyos).

I have the fault, there's no doubt about that as can be seen from the logs I provided. I have all the information on this post I can think of to help diagnose the fault so it can be fixed for all vyos users.

If you need some additional, different info that I can extract when the fault presents itself please LET ME KNOW and I'll get it for you. It's an intermittent fault. I can't reproduce it at will, if you can, please tell me about that. Something is broken in a nasty way and needs fixing and I'm trying to help the vyos project do that. Right now my experience of vyos is that the quality of my routing is less than what the vyos team might hope.

This might be interesting to whoever is in charge of dns forwarding, is it you syncer? Who made the decision to swap dnsmasq for pdns_resursor for example? I'm sure they had reasons that involved the idea of "this is better" so I think they would probably care about this bug. Maybe not?

So what testing are you asking for? What do you want to know? Who is actually responsible for vyos dns forwarding. If it were me and a fault like this happened with someone willing to do everything they can to get the details I'd want to know that. If it's you and you don't, ok. Do let me know please so we can stop wasting each other's time.

By the way I have 1.2 installed, that's where the bug actually _is_ and _presents_. I'm not sure why it's been changed to a bug against 1.3 about which I know nothing but that's the vyos team's business I guess.

Additional information:
$ reset dns forwarding all

kicks dns forwarding in the right place to get it working again if waiting 5-10 minutes is too annoying. Which it most certainly is for the rest of the users of this network, who can't do that...

Or please just say in plain language that having very poor quality dns is not something the vyos project cares about as priorities are elsewhere. Which is a reasonable position to take and so we can all proceed accordingly in a sensible and polite fashion.

syncer added a subscriber: syncer.Apr 22 2019, 2:46 AM

You have not provided a procedure on how to reproduce the issue and there are no other reports on such problem, therefore this task has low priority.
Either it's something specific to your environment or you have issues with upstream DNS and/or network

hal8 added a comment.Apr 22 2019, 3:39 AM

Thank you for providing the necessary clarity that was lacking in your very misleading first response.

I noted from the outset it is intermittent, so I guess you have no interest unless it's an easy bug to fix? This is a bug that has only ever occurred with vyos on the router. No other time in the same environment. Any bog standard consumer router works fine. Any linux on the same machine using dnsqmasq works fine. I have a solution which fixes the problem with certainty: Remove vyos. I have no doubt that will work as it always has in this very environment.

On the router itself dns works fine. always. As I said. But hey it's easier not to read bug reports.

I noted I have the problem coming up regularly about 3-4 a day for about 5 minutes each time. O
I provided logs, config files. I provided packet captures on both upstream and downstream interfaces. Maybe learn to read them?
I offered to help anyone who cares about a pretty nast bug given I have it occurring and can dig in to get it solved.
Up until right now I could have provided you with any information you wanted when the problem presents itself as it does a couple of times a day. gdb attach, sure.

I suppose everyone else who encountered this bug got to the "Delete vyos" solution more quickly and with less fuss.

Please close the bug as "won't fix." I am no longer in a position to assist.

While you're reading the github instructions to build on a jessie chroot are broken and out of date. "Welcome we value your assistance" strike 2.

As an idle thought, perhaps you should just have "vyos" as a few packages installed with an entry into sources.list on top of stable debian, easier to install and to update. It would also get you proper security updates in a timely fashion, which might be considered a useful thing for a router.

I note the button here says "Set Sail for Adventure." Well it's been real. Peace.

I'm experiencing this issue as well on a homespun 1.2.0 image; have been for a little over a month now. Occurs almost every day with no indication as to what the cause can be.

  • It seems completely intermittent with no indication that it's about to happen. DNS resolution works until it doesn't.
  • I cannot reproduce this. I have no idea what is causing this but it will eventually fix itself after some time if I haven't already poked it to start working again.
  • DNS continues to work on the router. Only clients are impacted by this.
  • Resolving local names continues to work. Resolving local computer names continues to work; trying to resolve anything outside of the network fails.
  • My failures are longer than 5 minutes. I suspect that this is due to me running this on an Atom D525 with a slightly complex configuration (7 zones)? My router takes almost 20 minutes to boot.

Does this being tagged as a 1.3 issue mean that it's only going to be fixed in 1.3?

c-po added a subscriber: c-po.May 3 2019, 7:45 PM

I just spawn a Smokeping to monitor the DNS server with a cache of 0 entries, lets see.

I've got a few more bits of information on this. I've managed to get DNS resolution to consistently work by placing the actual DNS servers in the dns section of my config instead of using just the system config keyword.

The following configuration snippet would eventually show intermittent failure:

service {
  dns {
    ...
    system
    ...
  }
}

I haven't experienced any intermittent failures with the following:

service {
  dns {
    name-server 8.8.8.8
    name-server 8.8.4.4
    system
  }
}

Hopefully this helps.

pasik added a subscriber: pasik.May 4 2019, 1:27 PM
jjakob added a subscriber: jjakob.EditedMay 4 2019, 7:17 PM

I've had this bug occur on 1.2.0-rc11, at one site (with moderately high load) at least once a day, and at the second site (with small load) only once after several months.
After upgrading to the latest 1.2.0 rolling release I've had no issues any more, however the bug may still remain.
It may have something to do with DNSSEC setting as the second system that ran flawlessly for months before started doing it immediately after setting dnssec=validate.

There are issues open in pdns-recursor github:

Still open:
https://github.com/PowerDNS/pdns/issues/6112

Closed with PR merged in 4.2.0-beta1:
https://github.com/PowerDNS/pdns/issues/5107

1.2.0-rc11 had pdns-recursor 4.0.4, 1.2.0-rolling+2B201905041032 has 4.1.12, so I don't know if that particular merge was backported into 4.1.x from 4.2.0.

Also, is there a way to see the pdns_recursor log output? I see --disable-syslog but no way to set a log file destination or verbosity.

jjakob added a comment.EditedJun 23 2019, 1:38 PM

Still present on 1.2.0-rolling+20190616
Some domains were still working normally while most returned SERVFAIL.

This is the only thing out of ordinary in the log:

Jun 20 16:11:13 vyos pdns_recursor[2802]: Could not retrieve security status update for '4.1.14' on 'recursor-4.1.14.security-status.secpoll.powerdns.com', DNSSEC validation result was Bogus!

Full pdns log from that day:

. Also interesting that at around the same time stats: 100000 questions was reached. May or may not have something to do with it.

config

    dns {
        forwarding {
            cache-size 10000
            dnssec validate
            listen-address 10.10.0.1
            system
        }
system {
    (snip...)
    name-server 1.1.1.1
    name-server 9.9.9.9
}

Seems like 4.1.14 is the newest so a bug must still be present in pdns_recursor.

Also pdns log output isn't saved to disk (grep -r pdns /var/log returns nothing), it's only in journald's in-memory journal. Might be a case for a separate bug.

c-po added a comment.Jun 23 2019, 3:57 PM

It‘s a known bug on powerDNS and should be fixed in 4.2 release

c-po added a comment.Jul 19 2019, 4:05 PM

Unfortunately PowerDNS no longer supports the 4.2 series on Debian oldoldstable (Jessie) - reverting this on current to fix the build.

That's unfortunate. I've had to restart dns every few days at some clients due to an outage because of this bug. It would be not nice if it were to regress. Is there a way to build on buster with newer packages?

c-po added a comment.Jul 19 2019, 5:34 PM

I have ran VyOS 1.2.1 and now 1.2.2 for quiet some time and can no longer see this issue. 1.2.1 and 1.2.2 are both based on PowerDNS recursor 4.1 series.

VyOS 1.3 which is based on Debian Buster will have 4.2 definately!

@jjakob can you also set up a SmokePing DNS monitor?

I'll set up monitoring, sure.