Page MenuHomeVyOS Platform

Routes vanishes when using FRR with ECMP and one of the ECMP paths is no longer available
Open, HighPublicBUG

Description

When using FRR with ECMP one can up in a situation where routes are suddently being dropped by FRR (Zebra) due to a race-condition between FRR and the Linux kernel when it comes to how nexthop-groups are being used and one of the ECMP paths for whatever reason is no longer available.

The situation can look like this in the log:

2023/07/26 10:26:30 ZEBRA: [HSYZM-HV7HF] Extended Error: Can not replace a nexthop with a nexthop group.
2023/07/26 10:26:30 ZEBRA: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Invalid argument, type=RTM_NEWNEXTHOP(104), seq=100352827, pid=2475708348
2023/07/26 10:26:30 ZEBRA: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (1769049[19431932/19431933]) into the kernel
2023/07/26 10:26:30 ZEBRA: [HSYZM-HV7HF] Extended Error: Can not replace a nexthop with a nexthop group.
2023/07/26 10:26:30 ZEBRA: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Invalid argument, type=RTM_NEWNEXTHOP(104), seq=100352851, pid=2475708348
2023/07/26 10:26:30 ZEBRA: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (1769049[19431932/19431933]) into the kernel
2023/07/26 10:26:34 ZEBRA: [SWQK6-6JY63][EC 4043309074] 132:1263:10.59.14.183/32: Failed to enqueue dataplane install
2023/07/26 10:27:10 BGP: [KTE2S-GTBDA][EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 492212 in VRF 196
2023/07/26 10:27:32 ZEBRA: [SWQK6-6JY63][EC 4043309074] 19:1468:10.115.5.246/32: Failed to enqueue dataplane install
2023/07/26 10:27:34 ZEBRA: [SWQK6-6JY63][EC 4043309074] 188:1411:10.11.4.9/32: Failed to enqueue dataplane install
2023/07/26 10:27:38 BGP: [KTE2S-GTBDA][EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 488260 in VRF 133

The fix is to add the following to frr.conf:

zebra nexthop-group keep 1

With the above the error condition no longer surfaces.

From the docs for Zebra https://docs.frrouting.org/en/latest/zebra.html#clicmd-zebra-nexthop-group-keep-1-3600

zebra nexthop-group keep (1-3600)

Set the time that zebra will keep a created and installed nexthop group before removing it from the system if the nexthop group is no longer being used. The default time is 180 seconds.

The above have been discussed at:

https://forum.vyos.io/t/frr-loses-routing-info-after-5-12k-l2tp-subs-connected/10422/16

Solution provided by:

https://github.com/FRRouting/frr/issues/12239

Details

Difficulty level
Unknown (require assessment)
Version
VyOS 1.4-rolling-202307250317
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Event Timeline

as I recall ,this case can be associate with this task : https://vyos.dev/T5077

@aserkin is it solved with this workaround ? or it only applies in this specific case.

From last night tests it seems to be solved. Though i'd prefer to test the node in production for a few weeks to be sure.

cool! it's interesting understand this complex scenery and how it works an real environment ,Additionally, the way it handle the zebra with the next-hop group ,int fact , I genuinely appreciate your valuable feedback so far!

There is still some problem with the workaround proposed. It seems not fully working when applied on the running system with active BGP sessions. At least i still see the next hop groups in the kernel which has only one next hop after our last tests:

$ ip route get 10.144.0.242 vrf PRM_MRSK
10.144.0.242  encap mpls  59704/57912 via 10.228.134.142 dev eth4 table PRM_MRSK src 10.228.134.143 uid 1003
    cache

And the route updates on this prefix cause log errors like the following:

2023/08/03 12:08:57 ZEBRA: [VYKYC-709DP] PRM_MRSK(133:1253):10.144.0.242/32: Route install failed

In normal situation there should be two nexthops like this:

10.144.97.123 nhid 2621235 proto bgp metric 20 
        nexthop  encap mpls  61011/57912 via 10.228.134.140 dev eth3 weight 1 
        nexthop  encap mpls  59704/57912 via 10.228.134.142 dev eth4 weight 1

We probably need to reboot the system and see what will be there when booted with

zebra nexthop-groups keep 1

Or may be there is a way to recreate broken groups without reboot?

Note that you had an "s" too much in your command.

And I assume you know that adding this during runtime would need something like this?

  1. vtysh
  2. configure terminal
  3. zebra nexthop-group keep 1
  4. exit
  5. write
  6. exit

I think it can be good to do a "did you turn it off and on again?" just to rule things out.

Please try these options (if possible all of them but one at a time, if one fails then remove the changes) to see if this have any effect:

Option A:

  1. Edit /config/scripts/vyos-postconfig-bootup.script and add the follow at the bottom of this file:
vtysh -c 'configure terminal' -c 'zebra nexthop-group keep 1'
vtysh -w
  1. Save the file and reboot.

Option B:

  1. Edit /etc/systemd/system/frr.service.d/override.conf and change this:
[Service]
LimitNOFILE=4096
ExecStartPre=/bin/bash -c 'mkdir -p /run/frr/config; \
             echo "log syslog" > /run/frr/config/frr.conf; \
             echo "log facility local7" >> /run/frr/config/frr.conf; \
             chown frr:frr /run/frr/config/frr.conf; \
             chmod 664 /run/frr/config/frr.conf; \
             mount --bind /run/frr/config/frr.conf /etc/frr/frr.conf'

to this:

[Service]
LimitNOFILE=4096
ExecStartPre=/bin/bash -c 'mkdir -p /run/frr/config; \
             echo "log syslog" > /run/frr/config/frr.conf; \
             echo "log facility local7" >> /run/frr/config/frr.conf; \
             echo "zebra nexthop-group keep 1" >> /run/frr/config/frr.conf; \
             chown frr:frr /run/frr/config/frr.conf; \
             chmod 664 /run/frr/config/frr.conf; \
             mount --bind /run/frr/config/frr.conf /etc/frr/frr.conf'

That is you append this line (around line 6):

echo "zebra nexthop-group keep 1" >> /run/frr/config/frr.conf; \

just after the line with "log facility local7".

  1. Save the file and reboot.

Note that option A should mimick what you do at runtime after the VyOS has completed its boot (and frr etc is already up and running) while option B will alter the frr.conf so the config exists for the first bootup of frr and the other frr-related daemons (in case there would be some differences).

Yes, i did that as option A yesterday. And rebooted. Then removed "zebra nexthop-group keep 1" and play a bit with interfaces up/down until kernel routes vanished. Then i put "zebra nexthop-group keep 1" back and rebooted again.
Will try option B then.
Meanwhile it appeared possible to fix "Route install failed" errors. I turned on debug zebra kernel, found the nhg_id which caused route install error and created it manually using nh1/nh2 provided by vtysh -c "show nexthop-group rib <nhg_id>". Just as it is described in the original thread regarding ipv6 routes.

Note also that 1.4 rolling as of today (3rd aug) uses FRR 9.0 (previously I think 8.5.4 were used or something like that).

After 19 hours of production run since yesterday the failure occurred again despite the workaround applied. Routes are cleared from kernel for some reason. During the run we observed few l2tp tunnels drops followed by 600 to 6000 sessions drop. The reason is not clear for now but i'm not sure this should kill zebra functionality this way.

And the logs looks the same as in your original post?

Also possible for you to add debugging to zebra such as these lines through vtysh (or use the previous tip on how to make it permanent in VyOS):

debug zebra kernel
debug zebra dplane
debug zebra dplane dpdk
debug zebra nexthop


Adding what was available this time. Will try to turn on debugs next time if we have another chance. Yes, the behavior was identical to previous.

I added a comment to https://github.com/FRRouting/frr/issues/12239 so hopefully there might be some other commands or stuff to do other than the debug-commands to hunt this thing down.

I assume you can install the latest VyOS 1.4-rolling (6.1.43 kernel + FRR 9.0) to rule things out?

I checked the FRR version in the recent rolling release - it is release candidate still. Does it worth upgrading from 8.5.2? As for the possibility - yes, sure we can build latest image.

I tried digging through google if somebody else have encountered the same but I couldnt find any obvious hints (except for the zebra nexthop-group keep 1 already mentioned).

It also seems that alot use FRR more like a routereflector aka its part of BGP exchange but the packets never actually passes the box itself so if this exists with more installations they wont see this error since the physical path of the packets goes elsewhere.

Closest hint that wasnt already known was this:

https://lukasz.bromirski.net/post/openfrr-openbgpd-bird/

unfortunately, while it worked very well for my home network (FRRouting that is), when deploying in AS112 I hit some unexpected behaviors quite quickly after starting the project. FRRouting was forgetting ARP entries to directly connected next-hops, which in turn was invalidating some of the routes… and traffic was no longer returning over paths advertised by IXP members. that was bad, as it meant while we’re still sinkholing private DNS requests, we were not responding to them. that may have meant slow responses and Customers noticing delay in “no such host” response.

initially I thought it was something related to my sysctl.conf config that had networking stack tuned, or pf.conf configuration that is responsible for filtering traffic coming in and out.

after clearing out all customizations, unfortunately the problem didn’t go away - FRRouting was still dropping next-hops shortly after restart. I suspect it’s somehow related to interaction between bgpd and zebra that talk over TCP connections in BSD systems, but debugging this further meant digging in the code myself. FRRouting devs seems to be currently focused more on Linux it seems.

So ehm... add "check arp status" to the list of things to monitor next time this occurs.

Speaking of which something like this can be used to maximimzie the arp-tables (and ND-tables) of VyOS:

set system ip arp table-size '32768'
set system ipv6 neighbor table-size '32768'

The default is 8192 entries, could this be related to your issue?

If that was pppoe i'd have thought of arp, but here with fixed number of l2tp tunnels (22 tunnels from LACs) i don't think arp cache oversizes the table.
Some more information which i can't think of as a failure reason yet, but it looks strange, - just before the issue we see that LAC drops l2tp tunnel for some reason and starts to send SCCRQ with tid=0 as if it just started working. After a while accel-ppp daemon drops the old tunnels and starts the new ones for few LACs. This definitely cause massive (thousands) route updates between zebra and kernel i guess. Sometimes the system can stand this, sometimes it cant.

Sounds almost related to this longrunning shitshow between FRR and the Linux kernel:

https://github.com/FRRouting/frr/issues/7299

https://github.com/FRRouting/frr/pull/11970

That is FRR somehow gets out of sync between what FRR thinks should exists as routes and what the Linux kernel actually do have as routes configured.

Does anybody know if that's going to be fixed in FRR?