Page MenuHomePhabricator

Wireguard FQDN endpoint doesn't work after reboot
Closed, WontfixPublicBUG

Description

If you use a FQDN for an endpoint, it will fail to work after a reboot:

# show interfaces wireguard wg3
    address 172.27.112.2/30
    description VPSVULTR
    peer VPSVULTR {
        allowed-ips 0.0.0.0/0
        endpoint vps.domain.com:2224
        persistent-keepalive 15
        pubkey xxxxxxxxxxxx
    }
$ ping 172.27.112.1
PING 172.27.112.1 (172.27.112.1) 56(84) bytes of data.
From 172.27.112.2 icmp_seq=1 Destination Host Unreachable
From 172.27.112.2 icmp_seq=1 Destination Host Unreachable
...

ping: sendmsg: Required key not available

Using the IP for the endpoint, or disabling and reenabling the interface makes it work fine:

# set interfaces wireguard wg3 disable
[edit]
# commit
[edit]
# delete interfaces wireguard wg3 disable
[edit]
# commit
[edit]
# ping 172.27.112.1
PING 172.27.112.1 (172.27.112.1) 56(84) bytes of data.
64 bytes from 172.27.112.1: icmp_seq=1 ttl=64 time=22.1 ms

Details

Difficulty level
Unknown (require assessment)
Version
1.2.3
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Perfectly compatible

Event Timeline

kroy created this task.Sep 29 2019, 8:03 PM
kroy added a comment.Sep 29 2019, 8:12 PM

Guess? Wireguard coming up before vyos-hostsd?

yes, you need to be either able to resolve your endpoints name or have it in /etc/hosts mapped. The name is being resolved (or tried) when the wg command configures the tunnel. There is unfortunately not too much I can do against, unless implementing a probe service or something like that ( could be as simple as ping).

kroy added a comment.Sep 30 2019, 4:22 PM

Changing when the tunnel comes up isn’t an option? For whatever reason the tunnel comes up before DNS resolution works. Using a hostname when the system is running works perfectly

There is not really an up or down, there is only a verified handshake and the transferred bytes. If you haven't sent and received anything, the interface is in 'unknown' state in terms of wireguard, even if it's 'up' if you look via iproute2. All can could do it checking if the endpoint resolves and if it does, send a packet and see if the handshake completes.

c-po added a subscriber: c-po.Sep 30 2019, 5:21 PM

Could we raise WireGuard Priority to 999? So it is launched very late?

runar added a subscriber: runar.Sep 30 2019, 5:44 PM

Changing the priority will only change a portion of this. It.. could fix the situation there the user have static ip and a default route, but will not give effect when the user has dhcp or uses bgp el.. so my wote goes to not changing priorities on this. This is a loosing race as long as we dont have a daemon el. That manages the connections..

kroy added a comment.EditedSep 30 2019, 8:31 PM

@runar This isn't a routing issue though.

Presumably at the point that WireGuard tunnels are brought up, you already have routing/networking capability. Otherwise bringing the tunnel up by IP wouldn't work. DHCP would have already pulled down IP/DNS/etc. The path to the endpoint is already set.

The problem is resolv.conf is empty because that part of the configuration hasn't been hit yet (at least for my static configuration). I'm wondering if I did have an interface that got its IP via DHCP, if this would work. I'm almost betting it would.

This is really only a problem because the DNS stuff is brought up after the interfaces. This definitely isn't an unsupported or strange configuration in Wireguard. It's just a problem because of VyOS ordering.

hagbard added a comment.EditedSep 30 2019, 9:30 PM

@kroy You can quickly test it via setting Priority to 999 in /opt/vyatta/share/vyatta-cfg/templates/interfaces/wireguard/node.def. It's currently 459. Let me know your results, please.

kroy added a comment.Sep 30 2019, 9:55 PM

@hagbard

Yep. Changing the priority fixes the issue completely

hagbard claimed this task.Sep 30 2019, 10:14 PM
hagbard changed the task status from Open to Needs testing.Sep 30 2019, 10:18 PM
hagbard triaged this task as Normal priority.
hagbard added a project: VyOS 1.3 Equuleus.
hagbard moved this task from Need Triage to Backlog on the VyOS 1.2 Crux board.
hagbard moved this task from Need Triage to In Progress on the VyOS 1.3 Equuleus board.
kroy added a comment.EditedOct 1 2019, 1:20 AM

@hagbard

This should be reverted, as the change is breaking. After more testing, I found some problems due to things like static routing being applied before wireguard now. So the wireguard tunnel works, but in some cases any routing that shouldbe going over the tunnel does not work.

With that said, I'm not really sure why this would be broken initially. Stuff like system name-server (and some other DNS related stuff), is priority 400, but wireguard was 459.

c-po added a comment.Oct 1 2019, 6:28 AM

Shouldn‘t OpenVPN have a similar problem?

runar added a comment.Oct 1 2019, 6:53 AM

As i tried to say, this fix will only work in some scenarios, and this comes down to the implementation of the app were configuring. And to be clear, wireguard does NOT support dns, but the wg config utillity does. On execution time it reads the dns name and tries to resolve it once, and only once. When it fails things would not work.. this is the same with eg. Nhrp that works exactly the same.. using this has raise conditions with getting ip up and running and not only on the host file. We do not wait for dhcp to delegate an address or dns servers.. these could come many ms/sec after wireguard is configured.. this is even true in the case when you change the priority.. and the length of the config/execution time also comes in as an parameter in this raise condition.. so, if you ask me, revert the priority and instead create a dns daemon thing that could read the config and populate the entry when it has failed.

runar added a comment.Oct 1 2019, 6:54 AM

As for openvpn i dont know, but if the app itself does dns queries on connect it will work quite fint (as i think it does)

kroy added a comment.Oct 1 2019, 1:58 PM

@runar

This is going to become more and more of a problem as wireguard adoption continues. Most major Wireguard VPN services provide a FQDN as their endpoint, not IP:

Mullvad: Endpoint = se4-wireguard.mullvad.net:3004
VPN.AC: Endpoint = wg-us21.cryptolayer.net:51820

runar added a comment.Oct 1 2019, 2:48 PM

@kroy just to be clear, i'm not against using dns as endpoint for wireguard.. i'm for it, because i have the same issue as you do, but what i'm against is the way to getting there. As the wireguard protocol does not support dns in it self using this method is a loosing game.. what i'm not against is writing a daemon that does the name resolution for you when it comes available.. and available could mean after 1sec, 1m, 1h or even longer after the system is booted.. this daemon also could do re-resolving when the peer is down and the dns has changed...

reverted the commit. I'm not sure if a daemon would be a good idea. Another option is to allow only IP's entered via cli or checking the name whenever wg is executed, resolve the name and send it to hostd to get it written to /etc/host. That would solve at least the issue at reboot and in most cases the correct IP should be in /etc/hosts.

c-po added a comment.Oct 1 2019, 3:37 PM

The Linux kernel has embedded name resolution, maybe this can be added to WireGuard itself. Its better then we design a patch for it.

In the no real internet world with rotating IP addresses this would have been very nice but it is at it is :(

hagbard added a comment.EditedOct 2 2019, 7:18 PM

Shall I close it as won't fix, given the fact that it is an upstream issue. Anything build around it, is in my opinion just a kludge, unless we would go with a separate daemon which can check and re-establish connections if they fail. The danger is that vyos becomes then more a server than a router. As workaround, a cronjob could do that as well, either setting an option via cli (wg-heartbeat or so since keepalive is a wg option already), which drops a cronjob onto the box and checks the wg endpoint periodically, if it fails it just calls diable/enable and checks again for X times, before it sleeps for let's say 24hs or so. @kroy would something like acronjob help you? Could be also set as a @reboot job and once the traffic flows it kicks itself out. Just wanna throw out ideas here.

c-po added a comment.Oct 2 2019, 11:25 PM

It is an upstream issue so I agree totally in closing as wonˋt fix

hagbard closed this task as Wontfix.Oct 3 2019, 5:38 PM
hagbard moved this task from Backlog to Finished on the VyOS 1.2 Crux board.
hagbard moved this task from In Progress to Finished on the VyOS 1.3 Equuleus board.