Page MenuHomeVyOS Platform

RPKI doesn't boot properly
Open, Requires assessmentPublicBUG

Description

Have a config like:

rpki {
     cache routinator {
         address 192.168.100.90
         port 3323
     }
 }

and a route-map like:

route-map ebgp-transit-rpki {

    rule 10 {
        action deny
        match {
            rpki invalid
        }
    }
    rule 20 {
        action permit
        match {
            rpki notfound
        }
        set {
            local-preference 20
        }
    }
    rule 30 {
        action permit
        match {
            rpki valid
        }
        set {
            local-preference 100
        }
    }
}

after reboot this setup (bgp session having this import route-map set comes up with 0 prefixes) doesn't work till I enter vtysh and execute rpki stop, rpki start and clear bgp *

Details

Difficulty level
Unknown (require assessment)
Version
1.3-rolling-202002161917 (but this is going one for quite some time)
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Event Timeline

I can confirm this bug also with VyOS 1.2.4

bgp-rtor-01 must receive prefixes
bgp-rtor-02 advertised prefixes

After rebooting "bgp-rtor-01" there are no connection to rpki server

[email protected]:~$ show ip bgp sum
...
Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
192.168.33.2    4     203115       4       3        0    0    0 00:00:14            0

[email protected]:~$ sudo vtysh -c "show rpki cache-connection"
No connection to RPKI cache server.
[email protected]:~$

We check that we will export prefixes.

[email protected]:~$ show ip bgp neighbors 192.168.33.1 advertised-routes 
BGP table version is 8, local router ID is 192.168.33.2, vrf id 0
Default local pref 100, local AS 203115
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 100.64.0.0/24    0.0.0.0                  0         32768 i
*> 100.64.1.0/24    0.0.0.0                  0         32768 i
*> 100.64.2.0/24    0.0.0.0                  0         32768 i
*> 100.64.3.0/24    0.0.0.0                  0         32768 i
...

Total number of prefixes 8
[email protected]:~$

After rebooting bgp-rtor-01, the dump (on side routinator server) does not show any attempts to connect to the routinator server.

[email protected]:/home/sever# tcpdump -ntti ens20 port 3323
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens20, link-type EN10MB (Ethernet), capture size 262144 bytes

looks for my like an frr bug. Has someone contacted upstream?

We saw something similar to this, but it seems like FRR eventually connected to RTRR. I think it has a timeout parameter — is that how often (slowly) it tries to re-establish?

While testing T1874 the procedure we followed was:

  1. install 1.2.4
  2. configure
  3. observe bgpd crash after ~5 minutes
  4. upgrade to 1.2.5
  5. reboot
  6. check RPKI RTRR connection had established
  7. check BGP session had established
  8. observe no crash in bgpd
  9. celebrate

And I can confirm that after boot-up, FRR was indeed connected to an RTRR server:

[email protected]:~$ sudo vtysh -c "show rpki cache-connection"
Connected to group 1
rpki tcp cache 46.227.201.12 3323 pref 1

Checking from the other end, on the RTRR server:

[email protected]:~# ss -an | grep 46.227.201.78
tcp                ESTAB               0                    0                                                                   46.227.201.12:3323                                                      46.227.201.78:44676

I would say that this bug might be fixed?

I tried this today with 1.3-rolling-202004180117 ...

after reboot:

$ show rpki cache-connection
No connection to RPKI cache server.

again:
vRouter1# rpki stop
vRouter1# rpki start
vRouter1# show rpki cache-connection
Connected to group 1
rpki tcp cache 192.168.100.90 3323 pref 1
vRouter1# clear bgp *

solves everything.

@primoz, I have exactly the same issue with "1.4-rolling-202103011828 (sagitta)"

erkin set Issue type to Bug (incorrect behavior).Aug 31 2021, 5:39 PM

Still reproducible VyOS 1.3-beta-202111150443
After reboot

No imported routes:

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
192.168.122.11  4     203115         6         5        0    0    0 00:01:52            0        0

re-start rpki

r4-epa2# 
r4-epa2# show rpki cache-connection 
No connection to RPKI cache server.
r4-epa2# 

r4-epa2# rpki stop 
r4-epa2# rpki start
r4-epa2# exit

Reset bgp peer

[email protected]:~$ reset ip bgp all


[email protected]:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 192.168.122.14, local AS number 65001 vrf-id 0
BGP table version 8
RIB entries 15, using 2880 bytes of memory
Peers 1, using 21 KiB of memory

Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt
192.168.122.11  4     203115        10        10        0    0    0 00:00:02            8        0

I'm able to reproduce this with 1.4, using the new config structure:

rpki {
    cache 10.3.96.4 {
        port 8082
        preference 1
    }
}

The same procedure restores operation:

$ vtysh
dal-1# rpki stop
dal-1# rpki start
dal-1# clear bgp *
dal-1# exit