Page MenuHomePhabricator

OSPF - Process Crash after peer reboot
Needs testing, NormalPublicBUG

Description

I have a system with two routers acting as an HA Pair using VRRP.

I upgraded R2 from 1.1.8 to 1.2.0-rolling+201810190743 and it came up fine.

After upgrading R1 from 1.1.8 to 1.2.0-rolling+201810190743, the ospfd process appears to crash after the neighbor relationship gets deleted:

Oct 19 18:01:54 VYOS-R2 ospfd[1766]: AdjChg: Nbr 172.16.1.7 on eth3.1045:172.16.4.118: Full -> Deleted (InactivityTimer)
Oct 19 18:01:54 VYOS-R2 ospfd[1766]: nsm_change_state:(172.16.1.7, Full -> Deleted): scheduling new router-LSA origination
Oct 19 18:01:54 VYOS-R2 ospfd[1766]: Assertion `node' failed in file ospfd/ospf_packet.c, line 666, function ospf_write
Oct 19 18:01:54 VYOS-R2 zebra[1709]: [EC 4043309116] Client 'ospf' encountered an error and is shutting down.
Oct 19 18:01:54 VYOS-R2 zebra[1709]: client 41 disconnected. 39 ospf routes removed from the rib

I can privately share copies of the configuration on both routers if needed.

The same behavior seems to be present in 1.2.0-rc1 up to the current rolling release.

Details

Difficulty level
Unknown (require assessment)
Version
1.2.0-rc8
Why the issue appeared?
Will be filled on close

Event Timeline

mbailey created this task.Oct 19 2018, 10:12 PM
mbailey created this object in space S1 VyOS Public.

I did a bit of digging on my R1 after reboot and I am seeing the following logs in Zebra, so it looks like it's completely unrelated to the neighbor deletion:

Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [EC 134217739] interface eth2.1032:172.16.4.110: ospf_check_md5 bad sequence 5333618 (expect 5333649)
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: message repeated 3 times: [ [EC 134217739] interface eth2.1032:172.16.4.110: ospf_check_md5 bad sequence 5333618 (expect 5333649)]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: Assertion `node' failed in file ospfd/ospf_packet.c, line 666, function ospf_write
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: Backtrace for 8 stack frames:
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 0] /usr/lib/libfrr.so.0(zlog_backtrace+0x3a) [0x7fef3efe9f8a]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 1] /usr/lib/libfrr.so.0(_zlog_assert_failed+0x61) [0x7fef3efea501]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 2] /usr/lib/frr/ospfd(+0x2f15e) [0x562e0c91815e]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 3] /usr/lib/libfrr.so.0(thread_call+0x60) [0x7fef3f00d430]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 4] /usr/lib/libfrr.so.0(frr_run+0xd8) [0x7fef3efe7938]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 5] /usr/lib/frr/ospfd(main+0x153) [0x562e0c901753]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fef3d83db45]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 7] /usr/lib/frr/ospfd(+0x190be) [0x562e0c9020be]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: Current thread function ospf_write, scheduled from file ospfd/ospf_packet.c, line 881
Oct 19 18:04:00 VYOS-R1 zebra[1771]: [EC 4043309116] Client 'ospf' encountered an error and is shutting down.
Oct 19 18:04:00 VYOS-R1 zebra[1771]: client 41 disconnected. 0 ospf routes removed from the rib
syncer added a subscriber: syncer.Oct 20 2018, 4:06 AM

Hi @mbailey
thanks for reporting
you can use Paste app (on the left side of the menu)
set visibility to subscribers and set yourself and maintainers as subscribers
Also, do you know what runs neighbor?
If you can reproduce it, will be great to have tcpdump capture

syncer triaged this task as Normal priority.Oct 20 2018, 4:06 AM
syncer changed the subtype of this task from "Task" to "Bug".Oct 20 2018, 4:51 AM

Hey @syncer,

Sure I can get you tcpdumps. Do you want to see the ospf packets specifically?

Upstream from my two routers is an Ubiquiti router, downstream from them is another set of two VyOS routers.

It's full mesh of P2P links between these two routers and the downstream VyOS routers.

I'll see if I can get a capture this weekend.

pasik added a subscriber: pasik.Nov 4 2018, 11:21 AM
mbailey added a comment.EditedNov 23 2018, 10:10 PM

@syncer,

I managed to get some reproducible captures today. Life took me for a bit of a ride so I didn't have free time until just now.

When I have the full mesh OSPF adjacency I am able to get the crash to happen on reboot of any one of the peer routers.

It can take a few reboots to get it to trigger, but I've managed to reproduce it with rebooting any individual router in the mesh. At least one of the two 1.2.0-RC8 routers will hit the null assertion and tear down the ospf process.

I've attached a zip file containing (F266853):

  • (Sanitized) config backup of each router
  • Topology diagram
  • tail -f output of the /var/frr/frr.log when the process crashed
  • Packet capture of all ospf packets when the process crashed

I only collected the logs / captures from a single router. If you need it from the other two peers, let me know. I have it specifically shared out to you for privacy purposes. If any of the other maintainers need access, please let me know and I'll add you.

A couple of notes:

  • Right now I am running 1.1.8 on R1 & R2, and 1.2.0-rc8 on FW1 & FW2.
  • Rolling back FW1 & FW2 to 1.1.8, I can reboot all day without issues.
  • The crash does not always impact both 1.2.0-rcX peers
  • the ospf_write code seems to hit an assertion that is expecting the oi_write_q to be non-null.
  • ospf_packet.c seems to have changed a decent amount between the FRR version and tracked Vyos Quagga version, but I haven't really dug into it enough to determine if any of the changes are causing the issue
mbailey renamed this task from OSPF Process Crash to OSPF - Process Crash after peer reboot.Nov 23 2018, 10:35 PM
mbailey changed Version from 1.2.0-rolling+201810190743 to 1.2.0-rc8.
kroy added a subscriber: kroy.Dec 4 2018, 4:16 PM

I'll add here that I've got a reasonably complex OSPF setup with around 10 hosts. I converted it over to VyOS when the first RC came out and I haven't seen this issue at all, and I'm constantly rebooting hosts. Currently upgraded the whole setup to RC10 and not a single host crashed. It's worth adding that I've had a bunch of Mikrotiks in the mix at a time and no problem there either.

So maybe some incompatibility between FRR and Quagga due to 1.1.8 and RC? I don't know, but it's more information at least.

@kroy - I tried doing an upgrade to match all routers to the same version and it ended quite badly.. all four had their OSPF instance die.

I will see if I can replicate this in a closed environment. I have some spare resources on a VM host that I can use to spin up replicas of the existing VMs.

The only thing of note is that the two sets of routers (R1, FW1) and (R2, FW2) live on two separate Hyper-V hosts with switch independent teaming to the upstream top-of-rack switches.

I also notice all four routers frequently log "ospf_check_md5 bad sequence" a couple hundred times a day.

I had a suboptimal VLAN to subnet mapping that I recently corrected that appears to have zero impact on either the ospf_check_md5 or the process crash.

Update for everyone - I built out an exact replica of my lab on a single Hyper-V host with a single internal vSwitch connecting all interfaces (similar, in principle at least).

I let it run for about an hour and tried manually rebooting the hosts. Didn't seem like I was able to reproduce it.

I am wondering if this might be related to my physical network topology connecting the Hyper-V hosts.

syncer changed the task status from Open to Needs testing.Feb 7 2019, 11:33 PM

Can you retest on the latest rolling and see if that repeats?

@syncer Sorry for the delay.. life got a bit hectic the last few months.

I upgraded two of my routers to the Rolling version 2019-04-26-0337.

They've been running fairly well in the background for the last week. I just upgraded my third router (one of the two upstreams for each) and it looks like the OSPF process crashed again:

Here's the relevant log messages:

Apr 30 14:37:15 VYOS-R1 ospfd[1102]: AdjChg: Nbr 172.16.1.8 on eth2.414:172.16.4.102: Full -> Deleted (InactivityTimer)
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: nsm_change_state:(172.16.1.8, Full -> Deleted): scheduling new router-LSA origination
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: Assertion `node' failed in file ospfd/ospf_packet.c, line 666, function ospf_write
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: Backtrace for 8 stack frames:
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: [bt 0] /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(zlog_backtrace+0x3a) [0x7fd2d352266a]
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: [bt 1] /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(_zlog_assert_failed+0x61) [0x7fd2d3522be1]
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: [bt 2] /usr/lib/frr/ospfd(+0x2f5e6) [0x55ae8e2a85e6]
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: [bt 3] /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(thread_call+0x60) [0x7fd2d354d4a0]
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: [bt 4] /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0(frr_run+0xd8) [0x7fd2d351ffa8]
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: [bt 5] /usr/lib/frr/ospfd(main+0x153) [0x55ae8e291a13]
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: [bt 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fd2d1fb8b45]
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: [bt 7] /usr/lib/frr/ospfd(+0x193be) [0x55ae8e2923be]
Apr 30 14:37:15 VYOS-R1 ospfd[1102]: Current thread function ospf_write, scheduled from file ospfd/ospf_packet.c, line 4030
Apr 30 14:37:15 VYOS-R1 watchfrr[1049]: [EC 268435457] ospfd state -> down : read returned EOF
Apr 30 14:37:15 VYOS-R1 zebra[1081]: [EC 4043309117] Client 'ospf' encountered an error and is shutting down.
Apr 30 14:37:15 VYOS-R1 zebra[1081]: client 47 disconnected. 33 ospf routes removed from the rib
Apr 30 14:37:20 VYOS-R1 watchfrr[1049]: [EC 100663303] Forked background command [pid 28604]: /usr/lib/frr/watchfrr.sh restart ospfd
Apr 30 14:37:20 VYOS-R1 watchfrr.sh[28608]: Reading deprecated /etc/frr/daemons.conf.  Please move its settings to /etc/frr/daemons and remove it.
Apr 30 14:37:20 VYOS-R1 watchfrr.sh[28612]: watchfrr_options contains a bash array value. The configured value is intentionally ignored since it is likely wrong. Please remove or fix the setting.
Apr 30 14:37:20 VYOS-R1 watchfrr.sh[28615]: Cannot stop ospfd: pid 1102 not running
Apr 30 14:37:20 VYOS-R1 zebra[1081]: client 47 says hello and bids fair to announce only ospf routes vrf=0
Apr 30 14:37:20 VYOS-R1 watchfrr[1049]: ospfd state -> up : connect succeeded

It looks like watchfrr is correctly restarting the OSPF daemon, but it doesn't seem to be operational, despite the process being alive.

The process crash in the first place is problematic, but if somehow we could get watchfrr to work correctly it would be an acceptable workaround until I can diagnose the error condition:

root@VYOS-R1:~# ps aux | grep ospf
root      1049  0.0  0.5  51412  3820 ?        Ss   Apr29   0:43 /usr/lib/frr/watchfrr -d zebra bgpd ripd ripngd ospfd ospf6d staticd
frr       1108  0.0  1.0  85960  8152 ?        Ss   Apr29   0:05 /usr/lib/frr/ospf6d -d --daemon -A ::1 -M snmp
frr      28617  0.0  0.7  84452  5444 ?        Ss   14:37   0:00 /usr/lib/frr/ospfd -d --daemon -A 127.0.0.1 -M snmp
root     29054  0.0  0.3  12736  2280 pts/0    S+   14:44   0:00 grep ospf

@syncer Which version of frrouting is VyOS currently built against?

It looks like the upstream fixed this issue way back in October: https://github.com/FRRouting/frr/commit/4bb420ab81b8d9cc1e7f9cde4729392bd2b12f1d#diff-7a06c247948630d4c8ef53bab09dd349

I can't seem to find out what version of the upstream is being used though.