Page MenuHomePhabricator

OSPF Stops distributing default route after a while
Confirmed, HighPublicBUG

Description

I've got a number of devices sharing routes via OSPF. VyOS/Mikrotiks/etc. The edge device for the network is an RC7 install.

After freshly rebooting the edge VyOS instance, the routing table on the internal devices looks like it should:

O>* 0.0.0.0/0 [110/10] via 10.253.253.1, eth0.253, 00:03:56
S   0.0.0.0/0 [220/0] via 10.253.253.1, eth0.253, 16:46:41
O   10.0.1.0/24 [110/20] via 10.253.253.1, eth0.253, 00:03:54

After a few hours, the routing table on one of the internal devices looks like this:

S>* 0.0.0.0/0 [220/0] via 10.253.253.1, eth0.253, 16:41:24
C * 10.0.1.0/24 is directly connected, eth0, 16:40:37

Obviously I've added a static route to cover for the deficiency. Rebooting the Edge VyOS instance fixes it for a while.

This is something I've noticed in a number of RCs, but forgot about because I created manual routes to cover for it. I just happened to notice the missing default route on my Mikrotiks and it reminded me.

Config on the edge device:

area 0 {
    network 10.253.253.0/24
}
default-information {
    originate {
        always
        metric 10
        metric-type 2
    }
}
log-adjacency-changes {
}
parameters {
    abr-type cisco
    router-id 10.254.255.1
}
redistribute {
    connected {
        metric-type 2
    }
    static {
        metric-type 2
    }
}

As mentioned, it doesn't matter if the internal device is VyOS, Mikrotik, etc. After a few hours the default route ceases to get redistributed from the Edge device.

Details

Difficulty level
Unknown (require assessment)
Version
1.2.0-rc7
Why the issue appeared?
Will be filled on close

Event Timeline

kroy created this task.Nov 16 2018, 3:57 PM
kroy added a comment.EditedNov 16 2018, 5:37 PM

Looks like this might be upstream and has been corrected there.

https://github.com/FRRouting/frr/issues/3124

syncer triaged this task as High priority.Dec 1 2018, 5:42 PM
syncer added a subscriber: syncer.

can you retest on rc9
frr was updated to the latest

kroy added a comment.Dec 1 2018, 7:31 PM

Unfortunately not corrected in RC9. After 30 minutes the default route fails to distribute.

Versions:

Version: VyOS 1.2.0-rc9

# /usr/lib/frr/ospfd -v
ospfd version 6.1-dev-vyos

4.19.4-amd64-vyos

kroy added a comment.Dec 6 2018, 2:53 PM

Just to confirm, still a problem with RC10

hexes added a subscriber: hexes.Dec 6 2018, 2:58 PM

We got a little bit similar situation with Ubnt ERLite and VyOS 1.1.7, sometimes it loose all routes...
Reboot of both routers solve it... (but it tookes much more then 30 min)

pasik added a subscriber: pasik.Dec 16 2018, 11:20 AM

@dmbaturin this was fixed in frr master

kroy added a comment.Dec 18 2018, 5:09 PM

Maybe this isn't the same issue? Still a problem in RC11 unfortunately.

kroy added a comment.Dec 20 2018, 5:15 AM

I pulled down vyos-frr submodule and the above-mentioned commit is present.

That means whatever this bug is, it’s not due to upstream, at least as far as I could find. I guess I’ll have to lab it up and see if it’s due to my my config, though I’m not sure how it could be considering after a reboot it works as expected until the first timeout.

kroy added a comment.Jan 2 2019, 10:39 PM

Unfortunately still a problem in the EPA2 release. 30 minutes hits, and I cease getting a default route on my entire OSPF infrastructure

kroy added a comment.Jan 13 2019, 2:29 AM

As of (at least) VyOS 1.2.0-rolling+201901090739 I believe this problem to be corrected.

Edge on VyOS 1.2.0-rolling+201901090739
Subrouters on EPA-2

After 24+ hours, the default route is still being distributed. This fixed itself on the upgrade to that particular rolling.

I'll watch for a few more days then consider this issue fixed and close this

kroy added a comment.Jan 13 2019, 5:00 PM

I was mistaken. Seems to have lost the route again.

syncer closed this task as Resolved.Jan 20 2019, 12:31 PM

We need reproduction procedure so we can work with FRR guys on this

But we need something more including frr logs

syncer reopened this task as Confirmed.Jan 20 2019, 12:31 PM
kroy added a comment.EditedJan 20 2019, 7:58 PM

Okay, spent the whole day messing with this and I've tracked it down so it's reproducible.

Layout is:

Router1 (edge):

eth0: ip on existing network
eth1: 10.98.98.1/30 (to Router2), 10.98.98.5/30 (to Router3)

Router2:

eth0: 10.98.98.2/30

(if there's an Router3, ip becomes 10.98.98.6/30, but it's not necessary. That layout just mimics my homelab setup a bit)

Whether this:

set protocols ospf area 0 network '10.98.99.0/24'

or this:

set protocols ospf area 0 network '10.98.99.0/30'
set protocols ospf area 0 network '10.98.99.4/30'

After 30 minutes, which is the first LSA timeout, the default route fails to flood.

I am not familiar enough with OSPF to know if it's this even a valid network layout, but I do know this:

  • Works as I expect in RC3 and before. I know this because I dumbly installed RC3 this morning instead of EPA3, and wasted most of the day trying to figure out why it the issue was suddenly gone.
  • Works as expected when a Mikrotik device is the edge.

If this isn't a valid network layout, it's worth noting that this still feels like a bug. As mentioned above, the default route is flooded exactly once, then never again. If it was an invalid layout, it feels like the default route would never get flooded.

Interestingly enough, if I use the latter setup (with the /30 instead of the /24), rebooting Router3 will force Router2 to get the default route with OSPF again.


Configuration (password is vyos for all):

Router1 (edge):

  • eth0 is existing network
  • eth1 is just a new empty vSwitch
set interfaces ethernet eth0 address '10.3.1.201/24'
set interfaces ethernet eth0 duplex 'auto'
set interfaces ethernet eth0 smp-affinity 'auto'
set interfaces ethernet eth0 speed 'auto'
set interfaces ethernet eth1 address '10.98.99.1/30'
set interfaces ethernet eth1 address '10.98.99.5/30'
set interfaces loopback lo address '10.98.98.1/32'
set protocols ospf area 0 network '10.98.99.0/24'
set protocols ospf default-information originate always
set protocols ospf default-information originate metric '10'
set protocols ospf default-information originate metric-type '2'
set protocols ospf log-adjacency-changes
set protocols ospf parameters router-id '10.98.98.1'
set protocols ospf redistribute connected metric-type '2'
set protocols static route 0.0.0.0/0 next-hop 10.3.1.1
set service ssh port '22'
set system config-management commit-revisions '100'
set system console device ttyS0 speed '9600'
set system host-name 'vyos'
set system login user vyos authentication encrypted-password '$6$iwdp4jVqP$WC5uS6g.FBG3QDsNX8Kdv446J./DMO2jzWj8MKkYbiekl0bqQVK6rFJO1SxtnwYH42k09PXck0AD4QKSMReiL1'
set system login user vyos authentication plaintext-password ''
set system login user vyos level 'admin'
set system name-server '10.3.1.254'
set system ntp server 0.pool.ntp.org
set system ntp server 1.pool.ntp.org
set system ntp server 2.pool.ntp.org
set system syslog global facility all level 'info'
set system syslog global facility protocols level 'debug'
set system time-zone 'UTC'

Router2:

  • eth0 is the new vSwitch

config:

set interfaces ethernet eth0 address '10.98.99.2/30'
set interfaces ethernet eth0 duplex 'auto'
set interfaces ethernet eth0 smp-affinity 'auto'
set interfaces ethernet eth0 speed 'auto'
set interfaces loopback lo address '10.98.98.2/32'
set protocols ospf area 0 network '10.98.99.4/30'
set protocols ospf log-adjacency-changes
set protocols ospf parameters router-id '10.98.98.2'
set protocols ospf redistribute connected
set service ssh port '22'
set system config-management commit-revisions '100'
set system console device ttyS0 speed '9600'
set system host-name 'R2'
set system login user vyos authentication encrypted-password '$6$iwdp4jVqP$WC5uS6g.FBG3QDsNX8Kdv446J./DMO2jzWj8MKkYbiekl0bqQVK6rFJO1SxtnwYH42k09PXck0AD4QKSMReiL1'
set system login user vyos authentication plaintext-password ''
set system login user vyos level 'admin'
set system name-server '10.3.1.254'
set system ntp server 0.pool.ntp.org
set system ntp server 1.pool.ntp.org
set system ntp server 2.pool.ntp.org
set system syslog global facility all level 'info'
set system syslog global facility protocols level 'debug'
set system time-zone 'UTC'

Router3 is the same as Router2. Changes router-id and loopback to 10.98.98.2 to 10.98.98.3 as well as the loopback, eth0 address to 10.98.99.6/30. Two devices are enough for reproducibility though.

Router1 frr.log:

Jan 20 18:53:03 vyos bgpd[1137]: [EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 3 in VRF 0
Jan 20 18:53:03 vyos bgpd[1137]: [EC 100663301] INTERFACE_ADDRESS_ADD: Cannot find IF 3 in VRF 0
Jan 20 18:53:08 vyos ospfd[1164]: AdjChg: Nbr 10.98.98.2 on eth1:10.98.99.1: Exchange -> Full (ExchangeDone)
Jan 20 19:27:56 vyos ospfd[1164]: AdjChg: Nbr 10.98.98.2 on eth1:10.98.99.1: Full -> Deleted (KillNbr)
Jan 20 19:28:03 vyos ospfd[1164]: AdjChg: Nbr 10.98.98.2 on eth1:10.98.99.1: Loading -> Full (LoadingDone)

Router2 frr.log:

Jan 20 18:53:07 R2 ospfd[1022]: Packet[DD]: Neighbor 10.98.98.1 Negotiation done (Master).
Jan 20 18:53:07 R2 ospfd[1022]: AdjChg: Nbr 10.98.98.1 on eth0:10.98.99.2: Loading -> Full (LoadingDone)
Jan 20 19:27:55 R2 ospfd[1022]: AdjChg: Nbr 10.98.98.1 on eth0:10.98.99.2: Full -> Init (1-WayReceived)
Jan 20 19:27:58 R2 ospfd[1022]: Packet[DD]: Neighbor 10.98.98.1 Negotiation done (Master).
Jan 20 19:27:58 R2 ospfd[1022]: AdjChg: Nbr 10.98.98.1 on eth0:10.98.99.2: Exchange -> Full (ExchangeDone)

Router2 routing table before 30 minutes:

O>* 0.0.0.0/0 [110/10] via 10.98.99.1, eth0, 00:26:03
O>* 10.3.1.0/24 [110/20] via 10.98.99.1, eth0, 00:26:03
O>* 10.98.98.1/32 [110/20] via 10.98.99.1, eth0, 00:26:03
C>* 10.98.98.2/32 is directly connected, lo, 01:09:22
O   10.98.99.0/30 [110/10] is directly connected, eth0, 01:07:30
C>* 10.98.99.0/30 is directly connected, eth0, 01:09:21

Router2 routing table after 30 minutes:

O>* 10.3.1.0/24 [110/20] via 10.98.99.1, eth0, 00:33:35
O>* 10.98.98.1/32 [110/20] via 10.98.99.1, eth0, 00:33:35
C>* 10.98.98.2/32 is directly connected, lo, 01:16:54
O   10.98.99.0/30 [110/10] is directly connected, eth0, 01:15:02
C>* 10.98.99.0/30 is directly connected, eth0, 01:16:53
byabcz added a subscriber: byabcz.Apr 1 2019, 12:20 PM

Run into the same with 1.2.2

Some Feedback from the frr people:

eqvinox commented 10 minutes ago
Hi, sorry this somewhat fell into the cracks —

debug logs ("all the debugs") from ospfd would be great on this. I haven't been able to reproduce this yet, I'll go through the additional configs posted here next week (travelling this week.)

kroy added a comment.Jul 25 2019, 4:17 PM

Attached are the pcap and debug logs from a simple setup as outlined above, two hosts. "Master" distributing the route.

0.0.0.0/0 reliably disappears sometime between 27 and 28 minutes of uptime.

Can we make a nightly with the patches from:

https://github.com/FRRouting/frr/issues/4237

c-po added a subscriber: c-po.Jul 31 2019, 2:17 PM

I‘d rather suggest building a test iso as we build the frr source code from their repo directly. I would not permanently add this not yet mainlined patchset to the package feed

I have setup two vyos router and one is origination default.
All runs fine with this patches since more then 40 minutes so it fixes the problems.

Using 1.2.3-eap1 frr version 7.2-dev-10290718, there is still a problem that the default route disappears between 30 minutes and 40 minutes.