I’ve encountered a possible bug which renders DMVPN spokes almost unusable in DMVPN phase 2|3 scenario. Using static maps instead of nhrp resolutions eliminates the problem.
The lab to investigate is as follows:
HUB IP: NBMA: 192.168.77.66, tun IP: 10.160.17.131/25 Spoke1 IP: NBMA: 192.168.77.64, tun IP:10.160.17.141 Spoke2 IP: NBMA: 192.168.77.65, tun IP:10.160.17.151
No IPSEC, no NAT.
HUB config:
set interfaces ethernet eth0 address ‘dhcp’ set interfaces loopback lo set interfaces tunnel tun1 address ‘10.160.17.131/25’ set interfaces tunnel tun1 description ‘DMVPN_HUB_interface’ set interfaces tunnel tun1 encapsulation ‘gre’ set interfaces tunnel tun1 multicast ‘enable’ set interfaces tunnel tun1 parameters ip key ‘17128’ set interfaces tunnel tun1 source-address ‘192.168.77.66’ set protocols nhrp tunnel tun1 cisco-authentication ‘LPDMVPN’ set protocols nhrp tunnel tun1 holding-time ‘300’ set protocols nhrp tunnel tun1 multicast ‘dynamic’
Spoke1 config:
set interfaces ethernet eth0 address ‘dhcp’ set interfaces tunnel tun10 address ‘10.160.17.141/25’ set interfaces tunnel tun10 encapsulation ‘gre’ set interfaces tunnel tun10 multicast ‘enable’ set interfaces tunnel tun10 parameters ip key ‘17128’ set interfaces tunnel tun10 source-address ‘192.168.77.64’ set protocols nhrp tunnel tun10 cisco-authentication ‘LPDMVPN’ set protocols nhrp tunnel tun10 holding-time ‘300’ set protocols nhrp tunnel tun10 map 10.160.17.131/25 nbma-address ‘192.168.77.66’ set protocols nhrp tunnel tun10 map 10.160.17.131/25 register set protocols nhrp tunnel tun10 multicast ‘nhs’
Spoke2 config:
set interfaces ethernet eth0 address ‘dhcp’ set interfaces tunnel tun10 address ‘10.160.17.151/25’ set interfaces tunnel tun10 encapsulation ‘gre’ set interfaces tunnel tun10 multicast ‘enable’ set interfaces tunnel tun10 parameters ip key ‘17128’ set interfaces tunnel tun10 source-address ‘192.168.77.65’ set protocols nhrp tunnel tun10 cisco-authentication ‘LPDMVPN’ set protocols nhrp tunnel tun10 holding-time ‘300’ set protocols nhrp tunnel tun10 map 10.160.17.131/25 nbma-address ‘192.168.77.66’ set protocols nhrp tunnel tun10 map 10.160.17.131/25 register set protocols nhrp tunnel tun10 multicast ‘nhs’
At the beginning all the tunnels are nice and fine:
HUB:>
vyos@HUB:~$ show nhrp tu Status: ok Interface: tun1 Type: local Protocol-Address: 10.160.17.255/32 Alias-Address: 10.160.17.131 Flags: up Interface: tun1 Type: local Protocol-Address: 10.160.17.131/32 Flags: up Interface: tun1 Type: dynamic Protocol-Address: 10.160.17.141/32 NBMA-Address: 192.168.77.64 Flags: up Expires-In: 4:01 Interface: tun1 Type: dynamic Protocol-Address: 10.160.17.151/32 NBMA-Address: 192.168.77.65 Flags: up Expires-In: 3:32
Spoke1:
vyos@S1:~$ show nhrp tu Status: ok Interface: tun10 Type: local Protocol-Address: 10.160.17.255/32 Alias-Address: 10.160.17.141 Flags: up Interface: tun10 Type: local Protocol-Address: 10.160.17.141/32 Flags: up Interface: tun10 Type: cached Protocol-Address: 10.160.17.151/32 NBMA-Address: 192.168.77.65 Flags: used up Expires-In: 4:56 Interface: tun10 Type: static Protocol-Address: 10.160.17.131/25 NBMA-Address: 192.168.77.66 Flags: up
Spoke2:>
vyos@S2:~$ show nhrp tu Status: ok Interface: tun10 Type: local Protocol-Address: 10.160.17.255/32 Alias-Address: 10.160.17.151 Flags: up Interface: tun10 Type: local Protocol-Address: 10.160.17.151/32 Flags: up Interface: tun10 Type: cached Protocol-Address: 10.160.17.141/32 NBMA-Address: 192.168.77.64 Flags: up Expires-In: 2:27 Interface: tun10 Type: static Protocol-Address: 10.160.17.131/25 NBMA-Address: 192.168.77.66 Flags: up
NHRP logs from Spoke1 (when it works:):
Apr 21 17:38:35 opennhrp[1779]: NL-ARP(tun10) who-has 10.160.17.151 Apr 21 17:38:35 opennhrp[1779]: NL-ARP(tun10) 10.160.17.151 is-at 192.168.77.66 Apr 21 17:38:35 opennhrp[1779]: Adding incomplete 10.160.17.151/32 dev tun10 Apr 21 17:38:35 opennhrp[1779]: Sending Resolution Request to 10.160.17.151 Apr 21 17:38:35 opennhrp[1779]: Sending packet 1, from: 10.160.17.141 (nbma 192.168.77.64), to: 10.160.17.151 (nbma 192.168.77.66) Apr 21 17:38:35 opennhrp[1779]: Received Resolution Request from proto src 10.160.17.151 to 10.160.17.141 Apr 21 17:38:35 opennhrp[1779]: Sending Resolution Reply 10.160.17.141/32 is-at 192.168.77.64 (holdtime 300) Apr 21 17:38:35 opennhrp[1779]: Sending packet 2, from: 10.160.17.151 (nbma 192.168.77.65), to: 10.160.17.141 (nbma 192.168.77.66) Apr 21 17:38:35 opennhrp[1779]: Received Resolution Reply 10.160.17.151/32 is at proto 10.160.17.151 nbma 192.168.77.65 (code 0) Apr 21 17:38:35 opennhrp-script.py[2919]: Running script with arguments: [‘/etc/opennhrp/opennhrp-script.py’, ‘peer-up’], environment: environ({‘NHRP_TYPE’: ‘cached’, ‘NHRP_SRCADDR’: ‘10.160.17.141’, ‘NHRP_SRCNBMA’: ‘192.168.77.64’, ‘NHRP_DESTADDR’: ‘10.160.17.151’, ‘NHRP_DESTPREFIX’: ‘32’, ‘NHRP_DESTNBMA’: ‘192.168.77.65’, ‘NHRP_INTERFACE’: ‘tun10’, ‘NHRP_GRE_KEY’: ‘17128’, ‘LC_CTYPE’: ‘C.UTF-8’}) Apr 21 17:38:35 opennhrp-script.py[2919]: Peer UP event for spoke using IKE profile Apr 21 17:38:35 opennhrp-script.py[2919]: NBMA addresses: local 192.168.77.64, remote 192.168.77.65 Apr 21 17:38:35 opennhrp[1779]: [10.160.17.151] Peer up script: success Apr 21 17:38:35 opennhrp[1779]: NL-ARP(tun10) 10.160.17.151 is-at 192.168.77.65
At this point I see all the resolution requests and replies on the spoke side.
Steps to reproduce:
Modify any configuration on spoke-side with commit and optionally revert back.
(for example nhrp hold-time)
tear down connection between S-S|S-H peers or modify HUB config (works sometimes).
The Spoke stops sending resolution requests, while registration requests to hub works well. That can be seen in a traffic dump taken on the spoke/hub.
NHRP logs from Spoke1 (in failed state):
vyos@S1:~$ monitor log nhrp – Journal begins at Sat 2023-01-28 14:10:46 UTC. – Apr 21 17:45:51 opennhrp[3402]: Received Resolution Request from proto src 10.160.17.151 to 10.160.17.141 Apr 21 17:45:51 opennhrp[3402]: Sending Resolution Reply 10.160.17.141/32 is-at (unspecified) (holdtime 300) Apr 21 17:45:51 opennhrp[3402]: Sending packet 2, from: 10.160.17.151 (nbma 192.168.77.65), to: 10.160.17.141 (nbma (unspecified)) Apr 21 17:45:51 opennhrp[3402]: Cannot send packet to tun10(9): Invalid argument Apr 21 17:45:56 opennhrp[3402]: Sending packet 1, from: 10.160.17.141 (nbma (unspecified)), to: 10.160.17.151 (nbma (unspecified)) Apr 21 17:45:56 opennhrp[3402]: Cannot send packet to tun10(9): Invalid argument Apr 21 17:45:56 opennhrp[3402]: Received Resolution Request from proto src 10.160.17.151 to 10.160.17.141 Apr 21 17:45:56 opennhrp[3402]: Sending Resolution Reply 10.160.17.141/32 is-at (unspecified) (holdtime 300) Apr 21 17:45:56 opennhrp[3402]: Sending packet 2, from: 10.160.17.151 (nbma 192.168.77.65), to: 10.160.17.141 (nbma (unspecified)) Apr 21 17:45:56 opennhrp[3402]: Cannot send packet to tun10(9): Invalid argument Apr 21 17:46:01 opennhrp[3402]: Failed to resolve 10.160.17.151: timeout (65535) Apr 21 17:46:01 opennhrp[3402]: Removing incomplete 10.160.17.151/32 dev tun10 used
Spoke 2 in the meanwhile gets response timeouts because of S1 keeps silent about resolution replies:
vyos@S2:~$ monitor log nhrp – Journal begins at Sat 2023-01-28 14:10:46 UTC. – Apr 21 17:49:40 opennhrp[1613]: NL-ARP(tun10) who-has 10.160.17.141 Apr 21 17:49:40 opennhrp[1613]: NL-ARP(tun10) 10.160.17.141 is-at 192.168.77.66 Apr 21 17:49:40 opennhrp[1613]: Adding incomplete 10.160.17.141/32 dev tun10 Apr 21 17:49:40 opennhrp[1613]: Sending Resolution Request to 10.160.17.141 Apr 21 17:49:40 opennhrp[1613]: Sending packet 1, from: 10.160.17.151 (nbma 192.168.77.65), to: 10.160.17.141 (nbma 192.168.77.66) Apr 21 17:49:45 opennhrp[1613]: Sending packet 1, from: 10.160.17.151 (nbma 192.168.77.65), to: 10.160.17.141 (nbma 192.168.77.66) Apr 21 17:49:50 opennhrp[1613]: Sending packet 1, from: 10.160.17.151 (nbma 192.168.77.65), to: 10.160.17.141 (nbma 192.168.77.66) Apr 21 17:49:55 opennhrp[1613]: Sending packet 1, from: 10.160.17.151 (nbma 192.168.77.65), to: 10.160.17.141 (nbma 192.168.77.66) Apr 21 17:50:00 opennhrp[1613]: Sending packet 1, from: 10.160.17.151 (nbma 192.168.77.65), to: 10.160.17.141 (nbma 192.168.77.66) Apr 21 17:50:05 opennhrp[1613]: Sending packet 1, from: 10.160.17.151 (nbma 192.168.77.65), to: 10.160.17.141 (nbma 192.168.77.66) Apr 21 17:50:10 opennhrp[1613]: Failed to resolve 10.160.17.141: timeout (65535) Apr 21 17:50:10 opennhrp[1613]: Removing incomplete 10.160.17.141/32 dev tun10 used
If we modify any nhrp configuration on S2 it behaves the same as S1 and stops sending NHRP resolution requests: Apr 21 17:55:26 opennhrp[3654]: NL-ARP(tun10) who-has 10.160.17.141 Apr 21 17:55:26 opennhrp[3654]: NL-ARP(tun10) 10.160.17.141 is-at 192.168.77.66 Apr 21 17:55:26 opennhrp[3654]: Adding incomplete 10.160.17.141/32 dev tun10 Apr 21 17:55:26 opennhrp[3654]: Sending Resolution Request to 10.160.17.141 Apr 21 17:55:26 opennhrp[3654]: Sending packet 1, from: 10.160.17.151 (nbma (unspecified)), to: 10.160.17.141 (nbma (unspecified))
No real traffic here on a wire Apr 21 17:55:26 opennhrp[3654]: Cannot send packet to tun10(9): Invalid argument
Apr 21 17:55:31 opennhrp[3654]: Sending packet 1, from: 10.160.17.151 (nbma (unspecified)), to: 10.160.17.141 (nbma (unspecified)) Apr 21 17:55:31 opennhrp[3654]: Cannot send packet to tun10(9): Invalid argument
Restarting failed spoke always helps.
I’ve tested this issue on a plenty of 1.4 rolling versions as the issue persists for quite a long time during image upgrades. Here are some versions for example:
1: 1.4-rolling-202304130846 - latest
2: 1.4-rolling-202211290318
3: 1.4-rolling-202301280924
Observations:
There is no matter if we turn on nhrp redirect/shortcut on the hub/spoke. The spoke behaves the same in terms of nhrp resolution packets.
In the production env I use static S-S map entries which solves the problem, but scalability becomes a nightmare.
more info and testing described in the forum topic here:
https://forum.vyos.io/t/dmvpn-nhrp-spoke-stops-sending-resolution-requests-upon-restart/10874