Page MenuHomeVyOS Platform

FRR crashing triggered by RPKI
Closed, ResolvedPublicBUG

Description

I currently have a problem with FRR crashing in combination with RPKI. The router is running VyOS 1.2.4-epa1 but a similar error occured when running 1.2.3. The same configuration did not show any errors with version 1.2.2. The RPKI validator used in the backend is Routinator.

Dec 13 08:39:09 rt-1 bgpd[1209]: [EC 100663314] Attempting to process an I/O event but for fd: 45(4) no thread to handle this!
Dec 13 08:43:01 rt-1 bgpd[1209]: Received signal 11 at 1576222981 (si_addr 0x2, PC 0x55aa62e883a5); aborting...
Dec 13 08:43:01 rt-1 bgpd[1209]: Backtrace for 11 stack frames:
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x67) [0x7f9e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x113) [0x7f9e60e52853]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x71305) [0x7f9e60e73305]60e52853]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890) [0x7f9e5fc7c890]3305]60e52853]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/frr/bgpd(bgp_table_range_lookup+0x65) [0x55aa62e883a5]3305]60e52853]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so(+0x5042) [0x7f9e5c0d7042]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x60) [0x7f9e60e80b20]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7f9e60e505d8]b20]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/frr/bgpd(main+0x2ff) [0x55aa62e32b4f]run+0xd8) [0x7f9e60e505d8]b20]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f9e5f8e3b45]b20]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/frr/bgpd(+0x3cb6c) [0x55aa62e34b6c]_main+0xf5) [0x7f9e5f8e3b45]b20]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: in thread bgpd_sync_callback scheduled from bgpd/bgp_rpki.c:509#012883a5); aborting...
Dec 13 08:43:01 rt-1 watchfrr[1154]: [EC 268435457] bgpd state -> down : read returned EOF
Dec 13 08:43:01 rt-1 watchfrr[1154]: bgpd state -> up : connect succeeded
Dec 13 08:43:01 rt-1 zebra[1202]: [EC 4043309116] Client 'vnc' encountered an error and is shutting down.
Dec 13 08:43:01 rt-1 watchfrr[1154]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
Dec 13 08:43:01 rt-1 zebra[1202]: [EC 4043309116] Client 'bgp' encountered an error and is shutting down.
Dec 13 08:43:01 rt-1 zebra[1202]: client 30 disconnected. 0 vnc routes removed from the rib
Dec 13 08:43:01 rt-1 zebra[1202]: client 27 disconnected. 77337 bgp routes removed from the rib
Dec 13 08:43:06 rt-1 watchfrr[1154]: [EC 100663303] Forked background command [pid 4044]: /usr/lib/frr/watchfrr.sh restart bgpd
Dec 13 08:43:06 rt-1 zebra[1202]: client 27 says hello and bids fair to announce only bgp routes vrf=0
Dec 13 08:43:06 rt-1 zebra[1202]: client 30 says hello and bids fair to announce only vnc routes vrf=0
Dec 13 08:43:06 rt-1 watchfrr[1154]: bgpd state -> up : connect succeeded

Details

Difficulty level
Unknown (require assessment)
Version
1.2.4-epa1
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)

Related Objects

StatusSubtypeAssignedTask
ResolvedFEATURE REQUESTc-po
ResolvedBUGc-po

Event Timeline

Hi @MrXermon
Can you describe how we can reproduce this bug?
Can you show share your configuration?

Actually i'm currently unable to reproduce the bug because since i removed the configuration for RPKI everything works fine. Even more interesting is that my second router with the exact same configuration does not have the problem.

We had this bug earlier today on 1.2.4.

We updated to 1.2.5, and removed configuration for RTRR servers (Routinator 3000), and that has been stable for ~12 hours now.

We are going to perform some tests to see if this is reproducible on a non-production router.

We managed to reproduce this on a test instance running VyOS 1.2.4 talking RTRR to Routinator3000 0.6.4:

Apr 18 07:06:16 test bgpd[918]: Backtrace for 11 stack frames:#012.���P#004
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x67) [0x7f0450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x113) [0x7f0450c6a833]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x712e5) [0x7f0450c8b2e5]50c6a833]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890) [0x7f044fa94890]b2e5]50c6a833]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/frr/bgpd(bgp_table_range_lookup+0x65) [0x5579424c4415]b2e5]50c6a833]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so(+0x5042) [0x7f044beef042]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x60) [0x7f0450c98b00]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7f0450c685c8]b00]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/frr/bgpd(main+0x2ff) [0x55794246eb4f]run+0xd8) [0x7f0450c685c8]b00]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f044f6fbb45]b00]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/frr/bgpd(+0x3cb6c) [0x557942470b6c]_main+0xf5) [0x7f044f6fbb45]b00]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: in thread bgpd_sync_callback scheduled from bgpd/bgp_rpki.c:351#012c4415); aborting...
Apr 18 07:06:16 test watchfrr[878]: [EC 268435457] bgpd state -> down : read returned EOF

The configuration was simply:

policy {
    route-map in {
        rule 1 {
            action deny
            match {
                rpki invalid
            }
        }
        rule 2 {
            action permit
        }
    }
}
protocols {
    bgp 41495 {
        neighbor 46.227.201.1 {
            address-family {
                ipv4-unicast {
                    route-map {
                        import in
                    }
                }
            }
            ebgp-multihop 2
            remote-as 41495
        }
    }
    rpki {
        cache slm {
            address 46.227.201.12
            port 3323
        }
    }
}

This is looking like it might be fixed in FRR version 7.2.1 onwards:

https://github.com/FRRouting/frr/commit/5911f65c7bcb05ee81a744bdc8eec5bdae54a591

https://github.com/FRRouting/frr/compare/frr-7.2...stable/7.2 shows that commit 4093d1e is included in FRR versions after 7.2

  • VyOS 1.2.4 uses FRR 7.2 (vulnerable to crashing)
  • VyOS 1.2.5 uses FRR 7.3.1
  • VyOS 1.3-rolling-202002290910 uses FRR 7.3

From Slack;

Marek Isalski 09:25
If anybody is using 1.2.4 or earlier for BGP, and wants to do RPKI, I can highly recommend updating to 1.2.5 which fixes this nasty segfault in FRR's bgpd: https://phabricator.vyos.net/T1874 (and devs can probably consider closing T1874 now that we know it's fixed in 1.2.5)

c-po claimed this task.
c-po added a parent task: T1998: Update FRR to 7.3.
c-po moved this task from Need Triage to Finished on the VyOS 1.3 Equuleus board.
c-po moved this task from Needs Triage to Finished on the VyOS 1.2 Crux (VyOS 1.2.5) board.