Page MenuHomePhabricator

OSPF - Process Crash after peer reboot
Open, NormalPublicBUG

Description

I have a system with two routers acting as an HA Pair using VRRP.

I upgraded R2 from 1.1.8 to 1.2.0-rolling+201810190743 and it came up fine.

After upgrading R1 from 1.1.8 to 1.2.0-rolling+201810190743, the ospfd process appears to crash after the neighbor relationship gets deleted:

Oct 19 18:01:54 VYOS-R2 ospfd[1766]: AdjChg: Nbr 172.16.1.7 on eth3.1045:172.16.4.118: Full -> Deleted (InactivityTimer)
Oct 19 18:01:54 VYOS-R2 ospfd[1766]: nsm_change_state:(172.16.1.7, Full -> Deleted): scheduling new router-LSA origination
Oct 19 18:01:54 VYOS-R2 ospfd[1766]: Assertion `node' failed in file ospfd/ospf_packet.c, line 666, function ospf_write
Oct 19 18:01:54 VYOS-R2 zebra[1709]: [EC 4043309116] Client 'ospf' encountered an error and is shutting down.
Oct 19 18:01:54 VYOS-R2 zebra[1709]: client 41 disconnected. 39 ospf routes removed from the rib

I can privately share copies of the configuration on both routers if needed.

The same behavior seems to be present in 1.2.0-rc1 up to the current rolling release.

Details

Difficulty level
Unknown (require assessment)
Version
1.2.0-rc8
Why the issue appeared?
Will be filled on close
mbailey created this task.Oct 19 2018, 10:12 PM
mbailey created this object in space S1 VyOS Public.

I did a bit of digging on my R1 after reboot and I am seeing the following logs in Zebra, so it looks like it's completely unrelated to the neighbor deletion:

Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [EC 134217739] interface eth2.1032:172.16.4.110: ospf_check_md5 bad sequence 5333618 (expect 5333649)
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: message repeated 3 times: [ [EC 134217739] interface eth2.1032:172.16.4.110: ospf_check_md5 bad sequence 5333618 (expect 5333649)]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: Assertion `node' failed in file ospfd/ospf_packet.c, line 666, function ospf_write
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: Backtrace for 8 stack frames:
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 0] /usr/lib/libfrr.so.0(zlog_backtrace+0x3a) [0x7fef3efe9f8a]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 1] /usr/lib/libfrr.so.0(_zlog_assert_failed+0x61) [0x7fef3efea501]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 2] /usr/lib/frr/ospfd(+0x2f15e) [0x562e0c91815e]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 3] /usr/lib/libfrr.so.0(thread_call+0x60) [0x7fef3f00d430]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 4] /usr/lib/libfrr.so.0(frr_run+0xd8) [0x7fef3efe7938]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 5] /usr/lib/frr/ospfd(main+0x153) [0x562e0c901753]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fef3d83db45]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: [bt 7] /usr/lib/frr/ospfd(+0x190be) [0x562e0c9020be]
Oct 19 18:04:00 VYOS-R1 ospfd[1811]: Current thread function ospf_write, scheduled from file ospfd/ospf_packet.c, line 881
Oct 19 18:04:00 VYOS-R1 zebra[1771]: [EC 4043309116] Client 'ospf' encountered an error and is shutting down.
Oct 19 18:04:00 VYOS-R1 zebra[1771]: client 41 disconnected. 0 ospf routes removed from the rib
syncer added a subscriber: syncer.Oct 20 2018, 4:06 AM

Hi @mbailey
thanks for reporting
you can use Paste app (on the left side of the menu)
set visibility to subscribers and set yourself and maintainers as subscribers
Also, do you know what runs neighbor?
If you can reproduce it, will be great to have tcpdump capture

syncer triaged this task as Normal priority.Oct 20 2018, 4:06 AM
syncer changed the subtype of this task from "Task" to "Bug".Oct 20 2018, 4:51 AM

Hey @syncer,

Sure I can get you tcpdumps. Do you want to see the ospf packets specifically?

Upstream from my two routers is an Ubiquiti router, downstream from them is another set of two VyOS routers.

It's full mesh of P2P links between these two routers and the downstream VyOS routers.

I'll see if I can get a capture this weekend.

mbailey added a comment.EditedFri, Nov 23, 10:10 PM

@syncer,

I managed to get some reproducible captures today. Life took me for a bit of a ride so I didn't have free time until just now.

When I have the full mesh OSPF adjacency I am able to get the crash to happen on reboot of any one of the peer routers.

It can take a few reboots to get it to trigger, but I've managed to reproduce it with rebooting any individual router in the mesh. At least one of the two 1.2.0-RC8 routers will hit the null assertion and tear down the ospf process.

I've attached a zip file containing (F266853):

  • (Sanitized) config backup of each router
  • Topology diagram
  • tail -f output of the /var/frr/frr.log when the process crashed
  • Packet capture of all ospf packets when the process crashed

I only collected the logs / captures from a single router. If you need it from the other two peers, let me know. I have it specifically shared out to you for privacy purposes. If any of the other maintainers need access, please let me know and I'll add you.

A couple of notes:

  • Right now I am running 1.1.8 on R1 & R2, and 1.2.0-rc8 on FW1 & FW2.
  • Rolling back FW1 & FW2 to 1.1.8, I can reboot all day without issues.
  • The crash does not always impact both 1.2.0-rcX peers
  • the ospf_write code seems to hit an assertion that is expecting the oi_write_q to be non-null.
  • ospf_packet.c seems to have changed a decent amount between the FRR version and tracked Vyos Quagga version, but I haven't really dug into it enough to determine if any of the changes are causing the issue
mbailey renamed this task from OSPF Process Crash to OSPF - Process Crash after peer reboot.Fri, Nov 23, 10:35 PM
mbailey changed Version from 1.2.0-rolling+201810190743 to 1.2.0-rc8.
kroy added a subscriber: kroy.Tue, Dec 4, 4:16 PM

I'll add here that I've got a reasonably complex OSPF setup with around 10 hosts. I converted it over to VyOS when the first RC came out and I haven't seen this issue at all, and I'm constantly rebooting hosts. Currently upgraded the whole setup to RC10 and not a single host crashed. It's worth adding that I've had a bunch of Mikrotiks in the mix at a time and no problem there either.

So maybe some incompatibility between FRR and Quagga due to 1.1.8 and RC? I don't know, but it's more information at least.

@kroy - I tried doing an upgrade to match all routers to the same version and it ended quite badly.. all four had their OSPF instance die.

I will see if I can replicate this in a closed environment. I have some spare resources on a VM host that I can use to spin up replicas of the existing VMs.

The only thing of note is that the two sets of routers (R1, FW1) and (R2, FW2) live on two separate Hyper-V hosts with switch independent teaming to the upstream top-of-rack switches.

I also notice all four routers frequently log "ospf_check_md5 bad sequence" a couple hundred times a day.

I had a suboptimal VLAN to subnet mapping that I recently corrected that appears to have zero impact on either the ospf_check_md5 or the process crash.

Update for everyone - I built out an exact replica of my lab on a single Hyper-V host with a single internal vSwitch connecting all interfaces (similar, in principle at least).

I let it run for about an hour and tried manually rebooting the hosts. Didn't seem like I was able to reproduce it.

I am wondering if this might be related to my physical network topology connecting the Hyper-V hosts.