Page MenuHomeVyOS Platform

BGP crash in if_destroy_via_zapi
Open, Requires assessmentPublicBUG

Description

I'm running VyOS 1.2.5 as a PPPoE server with "redistribute connected" PPPoE customer routes to iBGP. Lots of dynamic interfaces created/deleted all the time, especially with buggy cheap SOHO routers that try to establish two sessions in parallel for no good reason, fail and try again. Crashes (bgpd signal 11) about once a day, and after bgpd is restarted by watchfrr, the process is there in "ps" but no longer works ("show ip bgp summary" says "BGP instance not found"), needs reboot to recover.

Reproduced on two separate setups with similar config, one bare metal (PC Engines APU4D4) and one VM (xcp-ng 8.1, Supermicro H8SGL-F, Opteron 6338P), in both cases with 4 CPU cores and 4 GB RAM (physical, or allocated to VM) which should be plenty. In both cases signal 11 in the same function, so bad hardware can be ruled out (both setups have ECC RAM). See below for frr.log messages (a few seconds surrounding the crash, please ask if you need more) from both crashes.

[email protected]:~$ show version
Version: VyOS 1.2.5
Built by: Sentrium S.L.
Built on: Sun 12 Apr 2020 15:18 UTC
Build UUID: 1695c660-d785-4b16-a54b-66d6a02ea24f
Build Commit ID: 48cc9fc46569e6

Architecture: x86_64
Boot via: installed image
System type: bare metal

Hardware vendor: PC Engines
Hardware model: apu4
Hardware S/N: 1418430
Hardware UUID: Unknown

Copyright: VyOS maintainers and contributors

Aug 21 21:20:46 R5 ripd[1086]: interface delete ppp-lot29 vrf 0 index 6387 flags 0x10 metric 0 mtu 1492
Aug 21 21:20:46 R5 bgpd[1078]: [EC 100663301] INTERFACE_STATE: Cannot find IF ppp-lot29 in VRF 0
Aug 21 21:20:46 R5 ripngd[1092]: interface delete ppp-lot29 vrf 0 index 6387 flags 0x10 metric 0 mtu 1492
Aug 21 21:20:46 R5 ripd[1086]: interface delete ppp53 vrf 0 index 6388 flags 0x10 metric 0 mtu 1492
Aug 21 21:20:46 R5 ripngd[1092]: interface delete ppp53 vrf 0 index 6388 flags 0x10 metric 0 mtu 1492
Aug 21 21:20:46 R5 bgpd[1078]: Received signal 11 at 1598044846 (si_addr 0x8, PC 0x7f43ac82da1c); aborting...#012x
Aug 21 21:20:46 R5 bgpd[1078]: Backtrace for 13 stack frames:
Aug 21 21:20:46 R5 bgpd[1078]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x67) [0x7f43ac81e3c7]
Aug 21 21:20:46 R5 bgpd[1078]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x113) [0x7f43ac81e823]3ac81e3c7]
Aug 21 21:20:46 R5 bgpd[1078]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x73b85) [0x7f43ac83fb85]ac81e823]3ac81e3c7]
Aug 21 21:20:46 R5 bgpd[1078]: /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890) [0x7f43ab646890]fb85]ac81e823]3ac81e3c7]
Aug 21 21:20:46 R5 bgpd[1078]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(_rb_remove+0x16c) [0x7f43ac82da1c]]3ac81e3c7]
Aug 21 21:20:46 R5 bgpd[1078]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(if_set_index+0xb3) [0x7f43ac8152a3]3ac81e3c7]
Aug 21 21:20:46 R5 bgpd[1078]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(if_destroy_via_zapi+0x25) [0x7f43ac816865]c7]
Aug 21 21:20:46 R5 bgpd[1078]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x9452e) [0x7f43ac86052e][0x7f43ac816865]c7]
Aug 21 21:20:46 R5 bgpd[1078]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x60) [0x7f43ac84d530]c816865]c7]
Aug 21 21:20:46 R5 bgpd[1078]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7f43ac81c658]530]c816865]c7]
Aug 21 21:20:46 R5 bgpd[1078]: /usr/lib/frr/bgpd(main+0x322) [0x55ac6d401ca2]run+0xd8) [0x7f43ac81c658]530]c816865]c7]
Aug 21 21:20:46 R5 bgpd[1078]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f43ab2adb45]530]c816865]c7]
Aug 21 21:20:46 R5 bgpd[1078]: /usr/lib/frr/bgpd(+0x3ad9c) [0x55ac6d403d9c]_main+0xf5) [0x7f43ab2adb45]530]c816865]c7]
Aug 21 21:20:46 R5 bgpd[1078]: in thread zclient_read scheduled from lib/zclient.c:3266#0127f43ac82da1c); aborting...#012x
Aug 21 21:20:46 R5 zebra[1067]: [EC 100663299] stream_read_try: read failed on fd 20: Connection reset by peer
Aug 21 21:20:46 R5 watchfrr[1030]: [EC 268435457] bgpd state -> down : read returned EOF
Aug 21 21:20:46 R5 zebra[1067]: [EC 4043309121] Client 'bgp' encountered an error and is shutting down.
Aug 21 21:20:46 R5 zebra[1067]: [EC 100663299] stream_read_try: read failed on fd 32: Connection reset by peer
Aug 21 21:20:46 R5 zebra[1067]: [EC 4043309121] Client 'vnc' encountered an error and is shutting down.
Aug 21 21:20:48 R5 zebra[1067]: client 20 disconnected. 920462 bgp routes removed from the rib
Aug 21 21:20:49 R5 zebra[1067]: client 32 disconnected. 0 vnc routes removed from the rib
Aug 21 21:20:50 R5 ripngd[1092]: interface delete ppp57 vrf 0 index 6389 flags 0x10 metric 0 mtu 1500
Aug 21 21:20:50 R5 ripd[1086]: interface delete ppp57 vrf 0 index 6389 flags 0x10 metric 0 mtu 1500
Aug 21 21:20:51 R5 watchfrr[1030]: [EC 100663303] Forked background command [pid 10441]: /usr/lib/frr/watchfrr.sh restart bgpd
Aug 21 21:20:51 R5 zebra[1067]: client 20 says hello and bids fair to announce only bgp routes vrf=0
Aug 21 21:20:51 R5 zebra[1067]: client 32 says hello and bids fair to announce only vnc routes vrf=0
Aug 21 21:20:52 R5 watchfrr[1030]: bgpd state -> up : connect succeeded
Aug 21 21:20:52 R5 ripd[1086]: interface delete ppp-lot29 vrf 0 index 6388 flags 0x10 metric 0 mtu 1492
Aug 21 21:20:52 R5 ripngd[1092]: interface delete ppp-lot29 vrf 0 index 6388 flags 0x10 metric 0 mtu 1492
Aug 21 21:20:52 R5 bgpd[10449]: [EC 100663301] INTERFACE_STATE: Cannot find IF ppp-lot29 in VRF 0
Aug 21 21:20:52 R5 ripngd[1092]: interface delete ppp36 vrf 0 index 6387 flags 0x10 metric 0 mtu 1492
Aug 21 21:20:52 R5 ripd[1086]: interface delete ppp36 vrf 0 index 6387 flags 0x10 metric 0 mtu 1492
Aug 21 21:20:52 R5 bgpd[10449]: [EC 100663301] INTERFACE_STATE: Cannot find IF ppp36 in VRF 0
Aug 21 21:20:52 R5 ripd[1086]: interface delete ppp60 vrf 0 index 6390 flags 0x1090 metric 0 mtu 1492
Aug 21 21:20:52 R5 ripngd[1092]: interface delete ppp60 vrf 0 index 6390 flags 0x1090 metric 0 mtu 1492
Aug 21 21:20:52 R5 bgpd[10449]: [EC 100663301] INTERFACE_STATE: Cannot find IF ppp60 in VRF 0

[email protected]:~$ show version
Version: VyOS 1.2.5
Built by: Sentrium S.L.
Built on: Sun 12 Apr 2020 15:18 UTC
Build UUID: 1695c660-d785-4b16-a54b-66d6a02ea24f
Build Commit ID: 48cc9fc46569e6

Architecture: x86_64
Boot via: installed image
System type: Xen HVM guest

Hardware vendor: Xen
Hardware model: HVM domU
Hardware S/N: 2f4fa905-0407-e5cc-11ab-3296d4ed829c
Hardware UUID: 2f4fa905-0407-e5cc-11ab-3296d4ed829c

Copyright: VyOS maintainers and contributors

Aug 22 11:41:41 R6 ripngd[1107]: interface delete ppp-P.sta22 vrf 0 index 3970 flags 0x10 metric 0 mtu 1492
Aug 22 11:41:41 R6 bgpd[1093]: [EC 100663301] INTERFACE_STATE: Cannot find IF ppp-P.sta22 in VRF 0
Aug 22 11:41:41 R6 ripd[1103]: interface delete ppp-P.sta22 vrf 0 index 3970 flags 0x10 metric 0 mtu 1492
Aug 22 11:41:41 R6 ripngd[1107]: interface delete ppp-lot29 vrf 0 index 3972 flags 0x10 metric 0 mtu 1492
Aug 22 11:41:41 R6 ripd[1103]: interface delete ppp-lot29 vrf 0 index 3972 flags 0x10 metric 0 mtu 1492
Aug 22 11:41:41 R6 bgpd[1093]: Received signal 11 at 1598096501 (si_addr 0x8, PC 0x7f8295ab1a1c); aborting...
Aug 22 11:41:41 R6 bgpd[1093]: Backtrace for 13 stack frames:
Aug 22 11:41:41 R6 bgpd[1093]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x67) [0x7f8295aa23c7]
Aug 22 11:41:41 R6 bgpd[1093]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x113) [0x7f8295aa2823]295aa23c7]
Aug 22 11:41:41 R6 bgpd[1093]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x73b85) [0x7f8295ac3b85]95aa2823]295aa23c7]
Aug 22 11:41:41 R6 bgpd[1093]: /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890) [0x7f82948af890]3b85]95aa2823]295aa23c7]
Aug 22 11:41:41 R6 bgpd[1093]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(_rb_remove+0x16c) [0x7f8295ab1a1c]]295aa23c7]
Aug 22 11:41:41 R6 bgpd[1093]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(if_set_index+0xb3) [0x7f8295a992a3]295aa23c7]
Aug 22 11:41:41 R6 bgpd[1093]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(if_destroy_via_zapi+0x25) [0x7f8295a9a865]c7]
Aug 22 11:41:41 R6 bgpd[1093]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x9452e) [0x7f8295ae452e][0x7f8295a9a865]c7]
Aug 22 11:41:41 R6 bgpd[1093]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x60) [0x7f8295ad1530]5a9a865]c7]
Aug 22 11:41:41 R6 ripngd[1107]: interface delete ppp121 vrf 0 index 3974 flags 0x10 metric 0 mtu 1492
Aug 22 11:41:41 R6 ripd[1103]: interface delete ppp121 vrf 0 index 3974 flags 0x10 metric 0 mtu 1492
Aug 22 11:41:41 R6 bgpd[1093]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7f8295aa0658]530]5a9a865]c7]
Aug 22 11:41:41 R6 bgpd[1093]: /usr/lib/frr/bgpd(main+0x322) [0x56354800cca2]run+0xd8) [0x7f8295aa0658]530]5a9a865]c7]
Aug 22 11:41:41 R6 bgpd[1093]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f8294511b45]530]5a9a865]c7]
Aug 22 11:41:41 R6 bgpd[1093]: /usr/lib/frr/bgpd(+0x3ad9c) [0x56354800ed9c]_main+0xf5) [0x7f8294511b45]530]5a9a865]c7]
Aug 22 11:41:41 R6 bgpd[1093]: in thread zclient_read scheduled from lib/zclient.c:3266#0127f8295ab1a1c); aborting...
Aug 22 11:41:41 R6 zebra[1085]: [EC 100663299] stream_read_try: read failed on fd 30: Connection reset by peer
Aug 22 11:41:41 R6 zebra[1085]: [EC 4043309121] Client 'vnc' encountered an error and is shutting down.
Aug 22 11:41:41 R6 watchfrr[1025]: [EC 268435457] bgpd state -> down : read returned EOF
Aug 22 11:41:41 R6 watchfrr[1025]: bgpd state -> up : connect succeeded
Aug 22 11:41:41 R6 zebra[1085]: [EC 100663299] stream_read_try: read failed on fd 20: Connection reset by peer
Aug 22 11:41:41 R6 zebra[1085]: [EC 4043309121] Client 'bgp' encountered an error and is shutting down.
Aug 22 11:41:41 R6 watchfrr[1025]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
Aug 22 11:41:42 R6 zebra[1085]: client 30 disconnected. 0 vnc routes removed from the rib
Aug 22 11:41:42 R6 zebra[1085]: client 20 disconnected. 920320 bgp routes removed from the rib
Aug 22 11:41:45 R6 ripngd[1107]: interface delete ppp-lot29 vrf 0 index 3974 flags 0x10 metric 0 mtu 1492
Aug 22 11:41:45 R6 ripngd[1107]: interface delete ppp55 vrf 0 index 3972 flags 0x10 metric 0 mtu 1492
Aug 22 11:41:45 R6 ripd[1103]: interface delete ppp-lot29 vrf 0 index 3974 flags 0x10 metric 0 mtu 1492
Aug 22 11:41:45 R6 ripd[1103]: interface delete ppp55 vrf 0 index 3972 flags 0x10 metric 0 mtu 1492
Aug 22 11:41:46 R6 watchfrr[1025]: [EC 100663303] Forked background command [pid 38779]: /usr/lib/frr/watchfrr.sh restart bgpd
Aug 22 11:41:46 R6 zebra[1085]: client 20 says hello and bids fair to announce only bgp routes vrf=0
Aug 22 11:41:46 R6 zebra[1085]: client 32 says hello and bids fair to announce only vnc routes vrf=0
Aug 22 11:41:46 R6 watchfrr[1025]: bgpd state -> up : connect succeeded
Aug 22 11:43:48 R6 ripngd[1107]: interface delete ppp-kub2c2 vrf 0 index 3646 flags 0x10 metric 0 mtu 1480
Aug 22 11:43:48 R6 bgpd[38787]: [EC 100663301] INTERFACE_STATE: Cannot find IF ppp-kub2c2 in VRF 0
Aug 22 11:43:48 R6 ripd[1103]: interface delete ppp-kub2c2 vrf 0 index 3646 flags 0x10 metric 0 mtu 1480
Aug 22 11:43:48 R6 ripngd[1107]: interface delete ppp121 vrf 0 index 3974 flags 0x10 metric 0 mtu 1492
Aug 22 11:43:48 R6 bgpd[38787]: [EC 100663301] INTERFACE_STATE: Cannot find IF ppp121 in VRF 0
Aug 22 11:43:48 R6 ripd[1103]: interface delete ppp121 vrf 0 index 3974 flags 0x10 metric 0 mtu 1492

Details

Difficulty level
Unknown (require assessment)
Version
1.2.6-epa1
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)

Event Timeline

It crashed again after 5 days in 1.2.6-epa1, in the same function, also when a dynamic PPPoE interface was deleted.
It happens less frequently after the former customers who repeatedly failed authentication have been physically disconnected.
Again, BGP no longer works after watchfrr has restarted the bgpd process. All works again after reboot.

[email protected]:~$ show version
Version: VyOS 1.2.6-epa1
Release Train: crux

Built by: Sentrium S.L.
Built on: Thu 13 Aug 2020 11:57 UTC
Build UUID: 4e509d46-5370-4be1-82bd-83ccf0f9690c
Build Commit ID: fddba995759ec4

Architecture: x86_64
Boot via: installed image
System type: Xen HVM guest

Hardware vendor: Xen
Hardware model: HVM domU
Hardware S/N: 2f4fa905-0407-e5cc-11ab-3296d4ed829c
Hardware UUID: 2f4fa905-0407-e5cc-11ab-3296d4ed829c

Copyright: VyOS maintainers and contributors

Aug 27 17:30:56 R6 bgpd[1139]: [EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 383 in VRF 0
Aug 27 17:30:56 R6 bgpd[1139]: [EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 383 in VRF 0
Aug 27 17:30:56 R6 ripd[1150]: interface delete ppp-rod24 vrf 0 index 383 flags 0x1090 metric 0 mtu 1492
Aug 27 17:30:56 R6 ripngd[1154]: interface delete ppp-rod24 vrf 0 index 383 flags 0x1090 metric 0 mtu 1492
Aug 27 17:30:56 R6 bgpd[1139]: [EC 100663301] INTERFACE_STATE: Cannot find IF ppp-rod24 in VRF 0
Aug 27 17:31:00 R6 bgpd[1139]: [EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 341 in VRF 0
Aug 27 17:31:00 R6 bgpd[1139]: [EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 341 in VRF 0
Aug 27 17:31:00 R6 ripd[1150]: interface delete ppp-spr38 vrf 0 index 341 flags 0x1090 metric 0 mtu 1480
Aug 27 17:31:00 R6 ripngd[1154]: interface delete ppp-spr38 vrf 0 index 341 flags 0x1090 metric 0 mtu 1480
Aug 27 17:31:00 R6 bgpd[1139]: Received signal 11 at 1598549460 (si_addr 0x18, PC 0x7f0b80519a00); aborting...
Aug 27 17:31:00 R6 bgpd[1139]: Backtrace for 13 stack frames:
Aug 27 17:31:00 R6 bgpd[1139]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x67) [0x7f0b8050a337]
Aug 27 17:31:00 R6 bgpd[1139]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x113) [0x7f0b8050a793]b8050a337]
Aug 27 17:31:00 R6 bgpd[1139]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x73af5) [0x7f0b8052baf5]8050a793]b8050a337]
Aug 27 17:31:00 R6 bgpd[1139]: /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890) [0x7f0b7f317890]baf5]8050a793]b8050a337]
Aug 27 17:31:00 R6 bgpd[1139]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(_rb_remove+0x1e0) [0x7f0b80519a00]]b8050a337]
Aug 27 17:31:00 R6 bgpd[1139]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(if_set_index+0xb3) [0x7f0b80501223]b8050a337]
Aug 27 17:31:00 R6 bgpd[1139]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(if_destroy_via_zapi+0x25) [0x7f0b805027e5]37]
Aug 27 17:31:00 R6 bgpd[1139]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x9449e) [0x7f0b8054c49e][0x7f0b805027e5]37]
Aug 27 17:31:00 R6 bgpd[1139]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x60) [0x7f0b805394a0]05027e5]37]
Aug 27 17:31:00 R6 bgpd[1139]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7f0b805085d8]4a0]05027e5]37]
Aug 27 17:31:00 R6 bgpd[1139]: /usr/lib/frr/bgpd(main+0x322) [0x55da04918c82]run+0xd8) [0x7f0b805085d8]4a0]05027e5]37]
Aug 27 17:31:00 R6 bgpd[1139]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f0b7ef79b45]4a0]05027e5]37]
Aug 27 17:31:00 R6 bgpd[1139]: /usr/lib/frr/bgpd(+0x3ad7c) [0x55da0491ad7c]_main+0xf5) [0x7f0b7ef79b45]4a0]05027e5]37]
Aug 27 17:31:00 R6 bgpd[1139]: in thread zclient_read scheduled from lib/zclient.c:3266#012x7f0b80519a00); aborting...
Aug 27 17:31:00 R6 watchfrr[1087]: [EC 268435457] bgpd state -> down : read returned EOF
Aug 27 17:31:00 R6 watchfrr[1087]: bgpd state -> up : connect succeeded
Aug 27 17:31:00 R6 zebra[1134]: [EC 100663299] stream_read_try: read failed on fd 30: Connection reset by peer
Aug 27 17:31:00 R6 zebra[1134]: [EC 4043309121] Client 'vnc' encountered an error and is shutting down.
Aug 27 17:31:00 R6 watchfrr[1087]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
Aug 27 17:31:00 R6 zebra[1134]: [EC 100663299] stream_read_try: read failed on fd 20: Connection reset by peer
Aug 27 17:31:00 R6 zebra[1134]: [EC 4043309121] Client 'bgp' encountered an error and is shutting down.
Aug 27 17:31:01 R6 zebra[1134]: client 30 disconnected. 0 vnc routes removed from the rib
Aug 27 17:31:02 R6 zebra[1134]: client 20 disconnected. 921348 bgp routes removed from the rib
Aug 27 17:31:02 R6 ripd[1150]: interface delete ppp-jmg22 vrf 0 index 300 flags 0x1090 metric 0 mtu 1480
Aug 27 17:31:02 R6 ripngd[1154]: interface delete ppp-jmg22 vrf 0 index 300 flags 0x1090 metric 0 mtu 1480
Aug 27 17:31:02 R6 ripd[1150]: interface delete ppp-rol81 vrf 0 index 425 flags 0x1090 metric 0 mtu 1480
Aug 27 17:31:02 R6 ripngd[1154]: interface delete ppp-rol81 vrf 0 index 425 flags 0x1090 metric 0 mtu 1480
Aug 27 17:31:02 R6 ripd[1150]: interface delete ppp-rod8 vrf 0 index 322 flags 0x1090 metric 0 mtu 1480
Aug 27 17:31:02 R6 ripngd[1154]: interface delete ppp-rod8 vrf 0 index 322 flags 0x1090 metric 0 mtu 1480
Aug 27 17:31:05 R6 watchfrr[1087]: [EC 100663303] Forked background command [pid 44334]: /usr/lib/frr/watchfrr.sh restart bgpd
Aug 27 17:31:05 R6 zebra[1134]: client 20 says hello and bids fair to announce only bgp routes vrf=0
Aug 27 17:31:05 R6 zebra[1134]: client 32 says hello and bids fair to announce only vnc routes vrf=0
Aug 27 17:31:06 R6 watchfrr[1087]: bgpd state -> up : connect succeeded

marekm changed Version from 1.2.5 to 1.2.6-epa1.Aug 27 2020, 6:21 PM

@marekm
Can you check your BGP configuration if "router-id" is declared?
Also, what is with interface names?

ppp-lot29
ppp-jmg22
ppp-rol81
ppp-rod8

Do you use scripts with renaming? How to reproduce it?

Can you check your BGP configuration if "router-id" is declared?
Also, what is with interface names?

I had no router-id declared for BGP (only for OSPF) but the default is correct, same as loopback IP:
[email protected]:~$ show ip bgp summary

IPv4 Unicast Summary:
BGP router identifier 91.224.224.6, local AS number 56523 vrf-id 0
BGP table version 1738397
RIB entries 1511443, using 265 MiB of memory
Peers 4, using 82 KiB of memory
Peer groups 1, using 64 bytes of memory

Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
91.224.224.1 4 56523 570672 1844 0 0 0 16:07:21 744250
91.224.224.2 4 56523 451471 1844 0 0 0 16:07:21 357147
91.224.224.4 4 56523 1088 1844 0 0 0 16:07:21 82
91.224.224.5 4 56523 2037 1844 0 0 0 16:07:21 150

Total number of neighbors 4

There is a full mesh of iBGP between 5 routers, .1 and .2 are two main routers at two different locations where upstreams and local infrastructure (DNS servers etc.) are connected, .4 and .5 (and now also .6 in a VM for testing) are PPPoE servers where customers are connected.

PPPoE interfaces are renamed by RADIUS, /etc/freeradius/3.0/sites-available/default contains:
post-auth {

update reply {
        Acct-Interim-Interval = 900
        NAS-Port-Id = "ppp-%{Stripped-User-Name}"
}

...

I've just had two different routers (one bare metal and one VM) crash roughly at the same time, triggered by many PPPoE sessions disconnecting at the same time due to a short power failure (routers itself had power all the time, but power was interrupted for about a minute to a switch on the network between the routers and PPPoE clients). Stack traces are very similar (absolute addresses differ, but the same functions and offsets in them). And again, each time watchfrr restarted bgpd but it was not working until reboot. No problems so far with two other BGP routers running a similar configu but without any dynamic interfaces (only OSPF and BGP, no PPPoE servers).

Hello @marekm, I think [ppp]unit-cache=n might help in this case, but the main issue in FRR. Do you want a package for the test with these improvements?

unit-cache=n
By default is disabled: unit-cache=0

Specifies the number of interfaces to keep in cache. It means that don’t destroy the interface after the corresponding session is destroyed, instead place it to cache and use it later for new sessions repeatedly. This should reduce kernel-level interface creation/deletion rate lack.

I tried unit-cache earlier but it seems to have issues too - I've seen duplicate routes if the same client (all have static IP assigned by RADIUS based on username) connects to a different PPPoE server and the old route is not removed, as if the cached (not removed) PPPoE interfaces were not seen as removed in FRR. But I haven't investigated this in more detail as it's a production setup, can't experiment too much on live customers.
I'm considering if I could go back to redistributing PPPoE customers /32 routes in OSPF instead of iBGP - it has been that way for a few years (using MikroTik, before moving to VyOS), but I've recently changed it following "BGP Best Current Practices" http://www.bgp4all.com.au/pfs/_media/workshops/05-bgp-bcp.pdf which recommends using OSPF only for infrastructure, not customers - seems logical to me as BGP was designed for much larger routing tables (all of the Internet), but perhaps OSPF is still good enough for just a few hundreds of customers.

Even with customers routes redistributed by OSPF instead of iBGP, it has just crashed again:

Aug 31 13:20:58 R6 bgpd[1073]: [EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 480 in VRF 0
Aug 31 13:20:58 R6 bgpd[1073]: [EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 480 in VRF 0
Aug 31 13:20:58 R6 ripd[1084]: interface delete ppp-cyn8m1 vrf 0 index 480 flags 0x1090 metric 0 mtu 1492
Aug 31 13:20:58 R6 ripngd[1088]: interface delete ppp-cyn8m1 vrf 0 index 480 flags 0x1090 metric 0 mtu 1492
Aug 31 13:20:58 R6 bgpd[1073]: Received signal 11 at 1598880058 (si_addr 0x8, PC 0x7f38e9b4198c); aborting...
Aug 31 13:20:58 R6 bgpd[1073]: Backtrace for 13 stack frames:
Aug 31 13:20:58 R6 bgpd[1073]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x67) [0x7f38e9b32337]
Aug 31 13:20:58 R6 bgpd[1073]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x113) [0x7f38e9b32793]8e9b32337]
Aug 31 13:20:58 R6 bgpd[1073]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x73af5) [0x7f38e9b53af5]e9b32793]8e9b32337]
Aug 31 13:20:58 R6 bgpd[1073]: /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890) [0x7f38e893f890]3af5]e9b32793]8e9b32337]
Aug 31 13:20:58 R6 bgpd[1073]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(_rb_remove+0x16c) [0x7f38e9b4198c]]8e9b32337]
Aug 31 13:20:58 R6 bgpd[1073]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(if_set_index+0xb3) [0x7f38e9b29223]8e9b32337]
Aug 31 13:20:58 R6 bgpd[1073]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(if_destroy_via_zapi+0x25) [0x7f38e9b2a7e5]37]
Aug 31 13:20:58 R6 bgpd[1073]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x9449e) [0x7f38e9b7449e][0x7f38e9b2a7e5]37]
Aug 31 13:20:58 R6 bgpd[1073]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x60) [0x7f38e9b614a0]9b2a7e5]37]
Aug 31 13:20:58 R6 bgpd[1073]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7f38e9b305d8]4a0]9b2a7e5]37]
Aug 31 13:20:58 R6 bgpd[1073]: /usr/lib/frr/bgpd(main+0x322) [0x55b79684fc82]run+0xd8) [0x7f38e9b305d8]4a0]9b2a7e5]37]
Aug 31 13:20:58 R6 bgpd[1073]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f38e85a1b45]4a0]9b2a7e5]37]
Aug 31 13:20:58 R6 bgpd[1073]: /usr/lib/frr/bgpd(+0x3ad7c) [0x55b796851d7c]_main+0xf5) [0x7f38e85a1b45]4a0]9b2a7e5]37]
Aug 31 13:20:58 R6 bgpd[1073]: in thread zclient_read scheduled from lib/zclient.c:3266#0127f38e9b4198c); aborting...
Aug 31 13:20:58 R6 watchfrr[1018]: [EC 268435457] bgpd state -> down : read returned EOF
Aug 31 13:20:58 R6 zebra[1065]: [EC 100663299] stream_read_try: read failed on fd 30: Connection reset by peer
Aug 31 13:20:58 R6 zebra[1065]: [EC 4043309121] Client 'vnc' encountered an error and is shutting down.
Aug 31 13:20:58 R6 watchfrr[1018]: bgpd state -> up : connect succeeded
Aug 31 13:20:58 R6 watchfrr[1018]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
Aug 31 13:20:58 R6 zebra[1065]: [EC 100663299] stream_read_try: read failed on fd 20: Connection reset by peer
Aug 31 13:20:58 R6 zebra[1065]: [EC 4043309121] Client 'bgp' encountered an error and is shutting down.
Aug 31 13:20:59 R6 zebra[1065]: client 30 disconnected. 0 vnc routes removed from the rib
Aug 31 13:20:59 R6 zebra[1065]: client 20 disconnected. 922535 bgp routes removed from the rib
Aug 31 13:20:59 R6 ripngd[1088]: interface delete ppp-P.sta39a vrf 0 index 421 flags 0x1090 metric 0 mtu 1480
Aug 31 13:20:59 R6 ripd[1084]: interface delete ppp-P.sta39a vrf 0 index 421 flags 0x1090 metric 0 mtu 1480
Aug 31 13:20:59 R6 ripngd[1088]: interface delete ppp-rol40 vrf 0 index 646 flags 0x1090 metric 0 mtu 1480
Aug 31 13:20:59 R6 ripd[1084]: interface delete ppp-rol40 vrf 0 index 646 flags 0x1090 metric 0 mtu 1480
Aug 31 13:21:00 R6 ripd[1084]: interface delete ppp-psz78 vrf 0 index 9 flags 0x1090 metric 0 mtu 1500
Aug 31 13:21:00 R6 ripngd[1088]: interface delete ppp-psz78 vrf 0 index 9 flags 0x1090 metric 0 mtu 1500
Aug 31 13:21:00 R6 ripngd[1088]: interface delete ppp-gzd14m1 vrf 0 index 34 flags 0x1090 metric 0 mtu 1480
Aug 31 13:21:00 R6 ripd[1084]: interface delete ppp-gzd14m1 vrf 0 index 34 flags 0x1090 metric 0 mtu 1480
Aug 31 13:21:00 R6 ripd[1084]: interface delete ppp-psz74 vrf 0 index 436 flags 0x1090 metric 0 mtu 1492
Aug 31 13:21:00 R6 ripngd[1088]: interface delete ppp-psz74 vrf 0 index 436 flags 0x1090 metric 0 mtu 1492
Aug 31 13:21:00 R6 ripd[1084]: interface delete ppp-wes20 vrf 0 index 641 flags 0x1090 metric 0 mtu 1480
Aug 31 13:21:00 R6 ripngd[1088]: interface delete ppp-wes20 vrf 0 index 641 flags 0x1090 metric 0 mtu 1480
Aug 31 13:21:03 R6 watchfrr[1018]: [EC 100663303] Forked background command [pid 14587]: /usr/lib/frr/watchfrr.sh restart bgpd
Aug 31 13:21:03 R6 ripd[1084]: interface delete ppp-rolXnw vrf 0 index 658 flags 0x1090 metric 0 mtu 1480
Aug 31 13:21:03 R6 ripngd[1088]: interface delete ppp-rolXnw vrf 0 index 658 flags 0x1090 metric 0 mtu 1480
Aug 31 13:21:03 R6 zebra[1065]: client 20 says hello and bids fair to announce only bgp routes vrf=0
Aug 31 13:21:03 R6 zebra[1065]: client 30 says hello and bids fair to announce only vnc routes vrf=0
Aug 31 13:21:04 R6 watchfrr[1018]: bgpd state -> up : connect succeeded
Aug 31 13:21:04 R6 ripd[1084]: interface delete ppp-wes44n vrf 0 index 703 flags 0x1090 metric 0 mtu 1492
Aug 31 13:21:04 R6 ripngd[1088]: interface delete ppp-wes44n vrf 0 index 703 flags 0x1090 metric 0 mtu 1492
Aug 31 13:21:04 R6 bgpd[14595]: [EC 100663301] INTERFACE_ADDRESS_DEL: Cannot find IF 703 in VRF 0
Aug 31 13:21:04 R6 bgpd[14595]: [EC 100663301] INTERFACE_STATE: Cannot find IF ppp-wes44n in VRF 0
Aug 31 13:21:04 R6 ripd[1084]: interface delete ppp-klr6 vrf 0 index 645 flags 0x1090 metric 0 mtu 1480
Aug 31 13:21:04 R6 ripngd[1088]: interface delete ppp-klr6 vrf 0 index 645 flags 0x1090 metric 0 mtu 1480
Aug 31 13:21:04 R6 bgpd[14595]: [EC 100663301] INTERFACE_STATE: Cannot find IF ppp-klr6 in VRF 0

So the BGP crash is triggered by removing interfaces, even if their routes are not redistributed (no BGP routes advertised at all, full table of the Internet received).

This test VM is single-homed so I've just added a static default route as a workaround, but this is not optimal in general - PPPoE servers other than this test VM have separate connections to each of two main BGP routers, so that customers have slower but still working Internet access even if one router fails.