Page MenuHomeVyOS Platform

FRR crashing triggered by RPKI
Closed, ResolvedPublicBUG

Description

I currently have a problem with FRR crashing in combination with RPKI. The router is running VyOS 1.2.4-epa1 but a similar error occured when running 1.2.3. The same configuration did not show any errors with version 1.2.2. The RPKI validator used in the backend is Routinator.

Dec 13 08:39:09 rt-1 bgpd[1209]: [EC 100663314] Attempting to process an I/O event but for fd: 45(4) no thread to handle this!
Dec 13 08:43:01 rt-1 bgpd[1209]: Received signal 11 at 1576222981 (si_addr 0x2, PC 0x55aa62e883a5); aborting...
Dec 13 08:43:01 rt-1 bgpd[1209]: Backtrace for 11 stack frames:
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x67) [0x7f9e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x113) [0x7f9e60e52853]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x71305) [0x7f9e60e73305]60e52853]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890) [0x7f9e5fc7c890]3305]60e52853]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/frr/bgpd(bgp_table_range_lookup+0x65) [0x55aa62e883a5]3305]60e52853]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so(+0x5042) [0x7f9e5c0d7042]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x60) [0x7f9e60e80b20]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7f9e60e505d8]b20]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/frr/bgpd(main+0x2ff) [0x55aa62e32b4f]run+0xd8) [0x7f9e60e505d8]b20]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f9e5f8e3b45]b20]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: /usr/lib/frr/bgpd(+0x3cb6c) [0x55aa62e34b6c]_main+0xf5) [0x7f9e5f8e3b45]b20]]e60e523f7]
Dec 13 08:43:01 rt-1 bgpd[1209]: in thread bgpd_sync_callback scheduled from bgpd/bgp_rpki.c:509#012883a5); aborting...
Dec 13 08:43:01 rt-1 watchfrr[1154]: [EC 268435457] bgpd state -> down : read returned EOF
Dec 13 08:43:01 rt-1 watchfrr[1154]: bgpd state -> up : connect succeeded
Dec 13 08:43:01 rt-1 zebra[1202]: [EC 4043309116] Client 'vnc' encountered an error and is shutting down.
Dec 13 08:43:01 rt-1 watchfrr[1154]: [EC 268435457] bgpd state -> down : unexpected read error: Connection reset by peer
Dec 13 08:43:01 rt-1 zebra[1202]: [EC 4043309116] Client 'bgp' encountered an error and is shutting down.
Dec 13 08:43:01 rt-1 zebra[1202]: client 30 disconnected. 0 vnc routes removed from the rib
Dec 13 08:43:01 rt-1 zebra[1202]: client 27 disconnected. 77337 bgp routes removed from the rib
Dec 13 08:43:06 rt-1 watchfrr[1154]: [EC 100663303] Forked background command [pid 4044]: /usr/lib/frr/watchfrr.sh restart bgpd
Dec 13 08:43:06 rt-1 zebra[1202]: client 27 says hello and bids fair to announce only bgp routes vrf=0
Dec 13 08:43:06 rt-1 zebra[1202]: client 30 says hello and bids fair to announce only vnc routes vrf=0
Dec 13 08:43:06 rt-1 watchfrr[1154]: bgpd state -> up : connect succeeded

Details

Difficulty level
Unknown (require assessment)
Version
1.2.4-epa1
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)

Related Objects

StatusSubtypeAssignedTask
ResolvedFEATURE REQUESTc-po
ResolvedBUGc-po

Event Timeline

MrXermon created this task.Dec 13 2019, 7:55 AM

Hi @MrXermon
Can you describe how we can reproduce this bug?
Can you show share your configuration?

Actually i'm currently unable to reproduce the bug because since i removed the configuration for RPKI everything works fine. Even more interesting is that my second router with the exact same configuration does not have the problem.

maznu added a subscriber: maznu.Apr 17 2020, 8:18 PM

We had this bug earlier today on 1.2.4.

We updated to 1.2.5, and removed configuration for RTRR servers (Routinator 3000), and that has been stable for ~12 hours now.

We are going to perform some tests to see if this is reproducible on a non-production router.

maznu added a comment.Apr 18 2020, 7:10 AM

We managed to reproduce this on a test instance running VyOS 1.2.4 talking RTRR to Routinator3000 0.6.4:

Apr 18 07:06:16 test bgpd[918]: Backtrace for 11 stack frames:#012.���P#004
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace_sigsafe+0x67) [0x7f0450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_signal+0x113) [0x7f0450c6a833]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(+0x712e5) [0x7f0450c8b2e5]50c6a833]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890) [0x7f044fa94890]b2e5]50c6a833]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/frr/bgpd(bgp_table_range_lookup+0x65) [0x5579424c4415]b2e5]50c6a833]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/modules/bgpd_rpki.so(+0x5042) [0x7f044beef042]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x60) [0x7f0450c98b00]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7f0450c685c8]b00]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/frr/bgpd(main+0x2ff) [0x55794246eb4f]run+0xd8) [0x7f0450c685c8]b00]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f044f6fbb45]b00]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: /usr/lib/frr/bgpd(+0x3cb6c) [0x557942470b6c]_main+0xf5) [0x7f044f6fbb45]b00]]450c6a3d7]
Apr 18 07:06:16 test bgpd[918]: in thread bgpd_sync_callback scheduled from bgpd/bgp_rpki.c:351#012c4415); aborting...
Apr 18 07:06:16 test watchfrr[878]: [EC 268435457] bgpd state -> down : read returned EOF

The configuration was simply:

policy {
    route-map in {
        rule 1 {
            action deny
            match {
                rpki invalid
            }
        }
        rule 2 {
            action permit
        }
    }
}
protocols {
    bgp 41495 {
        neighbor 46.227.201.1 {
            address-family {
                ipv4-unicast {
                    route-map {
                        import in
                    }
                }
            }
            ebgp-multihop 2
            remote-as 41495
        }
    }
    rpki {
        cache slm {
            address 46.227.201.12
            port 3323
        }
    }
}
maznu added a comment.EditedApr 18 2020, 7:13 AM

This is looking like it might be fixed in FRR version 7.2.1 onwards:

https://github.com/FRRouting/frr/commit/5911f65c7bcb05ee81a744bdc8eec5bdae54a591

https://github.com/FRRouting/frr/compare/frr-7.2...stable/7.2 shows that commit 4093d1e is included in FRR versions after 7.2

  • VyOS 1.2.4 uses FRR 7.2 (vulnerable to crashing)
  • VyOS 1.2.5 uses FRR 7.3.1
  • VyOS 1.3-rolling-202002290910 uses FRR 7.3
c-po added a subscriber: c-po.Apr 18 2020, 7:34 AM

From Slack;

Marek Isalski 09:25
If anybody is using 1.2.4 or earlier for BGP, and wants to do RPKI, I can highly recommend updating to 1.2.5 which fixes this nasty segfault in FRR's bgpd: https://phabricator.vyos.net/T1874 (and devs can probably consider closing T1874 now that we know it's fixed in 1.2.5)

c-po closed this task as Resolved.Apr 18 2020, 7:35 AM
c-po claimed this task.
c-po added a parent task: T1998: Update FRR to 7.3.
c-po moved this task from Need Triage to Finished on the VyOS 1.3 Equuleus board.
c-po moved this task from Needs Triage to Finished on the VyOS 1.2 Crux (VyOS 1.2.5) board.