Page MenuHomeVyOS Platform

OSPF Stops distributing default route after a while
Closed, ResolvedPublicBUG

Description

I've got a number of devices sharing routes via OSPF. VyOS/Mikrotiks/etc. The edge device for the network is an RC7 install.

After freshly rebooting the edge VyOS instance, the routing table on the internal devices looks like it should:

O>* 0.0.0.0/0 [110/10] via 10.253.253.1, eth0.253, 00:03:56
S   0.0.0.0/0 [220/0] via 10.253.253.1, eth0.253, 16:46:41
O   10.0.1.0/24 [110/20] via 10.253.253.1, eth0.253, 00:03:54

After a few hours, the routing table on one of the internal devices looks like this:

S>* 0.0.0.0/0 [220/0] via 10.253.253.1, eth0.253, 16:41:24
C * 10.0.1.0/24 is directly connected, eth0, 16:40:37

Obviously I've added a static route to cover for the deficiency. Rebooting the Edge VyOS instance fixes it for a while.

This is something I've noticed in a number of RCs, but forgot about because I created manual routes to cover for it. I just happened to notice the missing default route on my Mikrotiks and it reminded me.

Config on the edge device:

area 0 {
    network 10.253.253.0/24
}
default-information {
    originate {
        always
        metric 10
        metric-type 2
    }
}
log-adjacency-changes {
}
parameters {
    abr-type cisco
    router-id 10.254.255.1
}
redistribute {
    connected {
        metric-type 2
    }
    static {
        metric-type 2
    }
}

As mentioned, it doesn't matter if the internal device is VyOS, Mikrotik, etc. After a few hours the default route ceases to get redistributed from the Edge device.

Details

Difficulty level
Unknown (require assessment)
Version
1.2.0-rc7
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Unspecified (please specify)

Event Timeline

Looks like this might be upstream and has been corrected there.

https://github.com/FRRouting/frr/issues/3124

syncer triaged this task as High priority.Dec 1 2018, 5:42 PM
syncer added a subscriber: syncer.

can you retest on rc9
frr was updated to the latest

Unfortunately not corrected in RC9. After 30 minutes the default route fails to distribute.

Versions:

Version: VyOS 1.2.0-rc9

# /usr/lib/frr/ospfd -v
ospfd version 6.1-dev-vyos

4.19.4-amd64-vyos

Just to confirm, still a problem with RC10

We got a little bit similar situation with Ubnt ERLite and VyOS 1.1.7, sometimes it loose all routes...
Reboot of both routers solve it... (but it tookes much more then 30 min)

Maybe this isn't the same issue? Still a problem in RC11 unfortunately.

I pulled down vyos-frr submodule and the above-mentioned commit is present.

That means whatever this bug is, it’s not due to upstream, at least as far as I could find. I guess I’ll have to lab it up and see if it’s due to my my config, though I’m not sure how it could be considering after a reboot it works as expected until the first timeout.

Unfortunately still a problem in the EPA2 release. 30 minutes hits, and I cease getting a default route on my entire OSPF infrastructure

As of (at least) VyOS 1.2.0-rolling+201901090739 I believe this problem to be corrected.

Edge on VyOS 1.2.0-rolling+201901090739
Subrouters on EPA-2

After 24+ hours, the default route is still being distributed. This fixed itself on the upgrade to that particular rolling.

I'll watch for a few more days then consider this issue fixed and close this

I was mistaken. Seems to have lost the route again.

We need reproduction procedure so we can work with FRR guys on this

But we need something more including frr logs

Okay, spent the whole day messing with this and I've tracked it down so it's reproducible.

Layout is:

Router1 (edge):

eth0: ip on existing network
eth1: 10.98.98.1/30 (to Router2), 10.98.98.5/30 (to Router3)

Router2:

eth0: 10.98.98.2/30

(if there's an Router3, ip becomes 10.98.98.6/30, but it's not necessary. That layout just mimics my homelab setup a bit)

Whether this:

set protocols ospf area 0 network '10.98.99.0/24'

or this:

set protocols ospf area 0 network '10.98.99.0/30'
set protocols ospf area 0 network '10.98.99.4/30'

After 30 minutes, which is the first LSA timeout, the default route fails to flood.

I am not familiar enough with OSPF to know if it's this even a valid network layout, but I do know this:

  • Works as I expect in RC3 and before. I know this because I dumbly installed RC3 this morning instead of EPA3, and wasted most of the day trying to figure out why it the issue was suddenly gone.
  • Works as expected when a Mikrotik device is the edge.

If this isn't a valid network layout, it's worth noting that this still feels like a bug. As mentioned above, the default route is flooded exactly once, then never again. If it was an invalid layout, it feels like the default route would never get flooded.

Interestingly enough, if I use the latter setup (with the /30 instead of the /24), rebooting Router3 will force Router2 to get the default route with OSPF again.


Configuration (password is vyos for all):

Router1 (edge):

  • eth0 is existing network
  • eth1 is just a new empty vSwitch
set interfaces ethernet eth0 address '10.3.1.201/24'
set interfaces ethernet eth0 duplex 'auto'
set interfaces ethernet eth0 smp-affinity 'auto'
set interfaces ethernet eth0 speed 'auto'
set interfaces ethernet eth1 address '10.98.99.1/30'
set interfaces ethernet eth1 address '10.98.99.5/30'
set interfaces loopback lo address '10.98.98.1/32'
set protocols ospf area 0 network '10.98.99.0/24'
set protocols ospf default-information originate always
set protocols ospf default-information originate metric '10'
set protocols ospf default-information originate metric-type '2'
set protocols ospf log-adjacency-changes
set protocols ospf parameters router-id '10.98.98.1'
set protocols ospf redistribute connected metric-type '2'
set protocols static route 0.0.0.0/0 next-hop 10.3.1.1
set service ssh port '22'
set system config-management commit-revisions '100'
set system console device ttyS0 speed '9600'
set system host-name 'vyos'
set system login user vyos authentication encrypted-password '$6$iwdp4jVqP$WC5uS6g.FBG3QDsNX8Kdv446J./DMO2jzWj8MKkYbiekl0bqQVK6rFJO1SxtnwYH42k09PXck0AD4QKSMReiL1'
set system login user vyos authentication plaintext-password ''
set system login user vyos level 'admin'
set system name-server '10.3.1.254'
set system ntp server 0.pool.ntp.org
set system ntp server 1.pool.ntp.org
set system ntp server 2.pool.ntp.org
set system syslog global facility all level 'info'
set system syslog global facility protocols level 'debug'
set system time-zone 'UTC'

Router2:

  • eth0 is the new vSwitch

config:

set interfaces ethernet eth0 address '10.98.99.2/30'
set interfaces ethernet eth0 duplex 'auto'
set interfaces ethernet eth0 smp-affinity 'auto'
set interfaces ethernet eth0 speed 'auto'
set interfaces loopback lo address '10.98.98.2/32'
set protocols ospf area 0 network '10.98.99.4/30'
set protocols ospf log-adjacency-changes
set protocols ospf parameters router-id '10.98.98.2'
set protocols ospf redistribute connected
set service ssh port '22'
set system config-management commit-revisions '100'
set system console device ttyS0 speed '9600'
set system host-name 'R2'
set system login user vyos authentication encrypted-password '$6$iwdp4jVqP$WC5uS6g.FBG3QDsNX8Kdv446J./DMO2jzWj8MKkYbiekl0bqQVK6rFJO1SxtnwYH42k09PXck0AD4QKSMReiL1'
set system login user vyos authentication plaintext-password ''
set system login user vyos level 'admin'
set system name-server '10.3.1.254'
set system ntp server 0.pool.ntp.org
set system ntp server 1.pool.ntp.org
set system ntp server 2.pool.ntp.org
set system syslog global facility all level 'info'
set system syslog global facility protocols level 'debug'
set system time-zone 'UTC'

Router3 is the same as Router2. Changes router-id and loopback to 10.98.98.2 to 10.98.98.3 as well as the loopback, eth0 address to 10.98.99.6/30. Two devices are enough for reproducibility though.

Router1 frr.log:

Jan 20 18:53:03 vyos bgpd[1137]: [EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 3 in VRF 0
Jan 20 18:53:03 vyos bgpd[1137]: [EC 100663301] INTERFACE_ADDRESS_ADD: Cannot find IF 3 in VRF 0
Jan 20 18:53:08 vyos ospfd[1164]: AdjChg: Nbr 10.98.98.2 on eth1:10.98.99.1: Exchange -> Full (ExchangeDone)
Jan 20 19:27:56 vyos ospfd[1164]: AdjChg: Nbr 10.98.98.2 on eth1:10.98.99.1: Full -> Deleted (KillNbr)
Jan 20 19:28:03 vyos ospfd[1164]: AdjChg: Nbr 10.98.98.2 on eth1:10.98.99.1: Loading -> Full (LoadingDone)

Router2 frr.log:

Jan 20 18:53:07 R2 ospfd[1022]: Packet[DD]: Neighbor 10.98.98.1 Negotiation done (Master).
Jan 20 18:53:07 R2 ospfd[1022]: AdjChg: Nbr 10.98.98.1 on eth0:10.98.99.2: Loading -> Full (LoadingDone)
Jan 20 19:27:55 R2 ospfd[1022]: AdjChg: Nbr 10.98.98.1 on eth0:10.98.99.2: Full -> Init (1-WayReceived)
Jan 20 19:27:58 R2 ospfd[1022]: Packet[DD]: Neighbor 10.98.98.1 Negotiation done (Master).
Jan 20 19:27:58 R2 ospfd[1022]: AdjChg: Nbr 10.98.98.1 on eth0:10.98.99.2: Exchange -> Full (ExchangeDone)

Router2 routing table before 30 minutes:

O>* 0.0.0.0/0 [110/10] via 10.98.99.1, eth0, 00:26:03
O>* 10.3.1.0/24 [110/20] via 10.98.99.1, eth0, 00:26:03
O>* 10.98.98.1/32 [110/20] via 10.98.99.1, eth0, 00:26:03
C>* 10.98.98.2/32 is directly connected, lo, 01:09:22
O   10.98.99.0/30 [110/10] is directly connected, eth0, 01:07:30
C>* 10.98.99.0/30 is directly connected, eth0, 01:09:21

Router2 routing table after 30 minutes:

O>* 10.3.1.0/24 [110/20] via 10.98.99.1, eth0, 00:33:35
O>* 10.98.98.1/32 [110/20] via 10.98.99.1, eth0, 00:33:35
C>* 10.98.98.2/32 is directly connected, lo, 01:16:54
O   10.98.99.0/30 [110/10] is directly connected, eth0, 01:15:02
C>* 10.98.99.0/30 is directly connected, eth0, 01:16:53

Some Feedback from the frr people:

eqvinox commented 10 minutes ago
Hi, sorry this somewhat fell into the cracks —

debug logs ("all the debugs") from ospfd would be great on this. I haven't been able to reproduce this yet, I'll go through the additional configs posted here next week (travelling this week.)

Attached are the pcap and debug logs from a simple setup as outlined above, two hosts. "Master" distributing the route.

0.0.0.0/0 reliably disappears sometime between 27 and 28 minutes of uptime.

I‘d rather suggest building a test iso as we build the frr source code from their repo directly. I would not permanently add this not yet mainlined patchset to the package feed

I have setup two vyos router and one is origination default.
All runs fine with this patches since more then 40 minutes so it fixes the problems.

Using 1.2.3-eap1 frr version 7.2-dev-10290718, there is still a problem that the default route disappears between 30 minutes and 40 minutes.

Seems that upstream did not backport the fixes to the stable version's. So it is only included in frr 7.2.
I asked them for backport.

Seems that it s merged an in 1.2.3 it looks in the moment good for me:

vyos@r2:~$ show ip ospf route 
============ OSPF network routing table ============
N    185.137.128.16/31     [5] area: 0.0.0.0
                           directly attached to bond0.201

============ OSPF router routing table =============
R    185.137.xxx.1         [5] area: 0.0.0.0, ASBR
                           via 185.137.xxx.16, bond0.201

============ OSPF external routing table ===========
N E2 0.0.0.0/0             [5/10] tag: 0
                           via 185.137.xxx.16, bond0.201
N E2 185.137.128.1/32      [5/20] tag: 0
                           via 185.137.xxx.16, bond0.201

vyos@r2:~$ uptime
 17:02:57 up  6:17,  1 user,  load average: 0.00, 0.00, 0.00
vyos@r2:~$ 


vyos@r1:~$ show ip ospf route 
============ OSPF network routing table ============
N    185.137.xxx.16/31     [5] area: 0.0.0.0
                           directly attached to bond0.201

============ OSPF router routing table =============
R    185.137.xxx.2         [5] area: 0.0.0.0, ASBR
                           via 185.137.xxx.17, bond0.201

============ OSPF external routing table ===========
N E2 0.0.0.0/0             [5/10] tag: 0
                           via 185.137.xxx.17, bond0.201
N E2 185.137.128.2/32      [5/20] tag: 0
                           via 185.137.xxx.17, bond0.201

vyos@r1:~$ uptime
 17:05:44 up  6:27,  1 user,  load average: 0.02, 0.01, 0.00
vyos@r1:~$

Can confirm. All my routing tables now have 0.0.0.0/0, no matter what the device is. This is just in 1.2.3.

#show ip route ospf
Type Codes - B:BGP D:Connected O:OSPF R:RIP S:Static; Cost - Dist/Metric
BGP  Codes - i:iBGP e:eBGP
OSPF Codes - i:Inter Area 1:External Type 1 2:External Type 2
        Destination        Gateway         Port          Cost          Type Uptime
1       0.0.0.0/0          10.0.10.1       ve 10         110/1         O2   17h22m
> /ip route print
Flags: X - disabled, A - active, D - dynamic, C - connect, S - static, r - rip, b - bgp, o - ospf, m - mme, B - blackhole, U - unreachable, P - prohibit
 #      DST-ADDRESS        PREF-SRC        GATEWAY            DISTANCE
 0 ADo  0.0.0.0/0                          10.0.10.1               110
# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR,
       > - selected route, * - FIB route

O>* 0.0.0.0/0 [110/1] via 10.0.10.1, vlan10, 17:26:16

+1 for closing - never heard of this issue again and we upgraded to FRR 7.3

In which version FRR has been upgraded to 7.3?

Can confirm that this can be closed now.

syncer moved this task from Need Triage to Finished on the VyOS 1.3 Equuleus board.
syncer moved this task from Needs Triage to Finished on the VyOS 1.2 Crux (VyOS 1.2.5) board.

@wornet-mwo in 1.2.5 and rolling

dmbaturin set Is it a breaking change? to Unspecified (possibly destroys the router).
dmbaturin set Issue type to Unspecified (please specify).