- User Since
- Apr 24 2019, 5:50 AM (90 w, 5 d)
Sun, Jan 3
Here are my conclusions about the last week's shenanigans.
And a slightly longer-term traffic graph, showing CPU usage vs traffic levels across VyOS 1.2.5 to 1.3-rolling on the same XL710 box:
@drac we're a typical ISP/NSP, with a fair amount of eyeball traffic behind us, so expecting to see a fairly high amount of UDP for QUIC (but it's not the bulk of our traffic on our VyOS boxes which are BGP peering/transit edge). Each of our six VyOS boxes is pushing around 300-500Mbit/sec, of which two have XL710 NICs (the rest are a mix of ixgbe and qlcnic).
Sat, Jan 2
@drac are you seeing Slab in /proc/meminfo gradually increasing before the panic? If so, the sourceforge post at the top recommends disabling TUPLE "acceleration". It seems that the more traffic you have, the quicker the crash. We were getting them every ~6 hours.
Amending /etc/snmp/snmpd.conf as follows got it working for me (albeit temporarily). Our snmp listen-address is 10.13.0.56 in this instance.
Similar issue for snmpd:
Fri, Jan 1
Alternatively, we've got an i40e VyOS box in production which is stable with:
i40e is a tyre fire.
Frustratingly, 2.13.10 seems to have some other — very nasty — bugs in it. We've had three kernel crashes on the latest VyOS 1.3 releases (from around Christmas) as a result, and I currently believe they are the same as those problems described here:
Sun, Dec 27
We had problems with ospf6d crashing on VyOS 1.3 using FRR 7.3 (from around August 2020). However, according to FRR #6086 and FRR #6735 this might have been fixed in FRR 7.5 (which is in latest/current VyOS 1.3).
Sep 1 2020
@Dmitry in various reboots and real-config-tests we've seen it settle in a few seconds, and we've seen it do 121 failed again today:
Aug 31 2020
As per @Dmitry's suggestions, I did exactly the above. Upon reboot it did not look promising:
Aug 29 2020
Any news on this one? Have posted some of the pain I've been having in T291 where VyOS is neither behaving as per documentation (match on hw-id) nor consistently across reboots.
Neither does VyOS have predicable interface names, nor does it behave as per VyOS' documentation.
According to documentation — https://wiki.vyos.net/wiki/Troubleshooting — specifying the hw-id of an interface should be tell udev (or similar) to ensure that the interface with the MAC-address specified gets the name of e.g. eth0.
May 6 2020
The good news is that this can be fixed with:
May 4 2020
Would love to see this resolved — a large (but reasonable) configuration doing IRR-based filtering from BGP peers took 9 hours to boot up.
We don't do any firewalling — we have lots of prefix-lists for filtering eBGP sessions. Right now we're looking at a router that's taken more than 1h20minutes to boot up — and it is still not finished — on modern Xeon CPUs. That's doubled in length since adding a prefix-list of around 5000 entries (roughly double the total number of prefix-list entries as before).
Apr 28 2020
We've got full IPv4 and IPv6 routing tables on our VyOS boxes, and we *definitely* needed to increase net.ipv6.route.max_size (we picked 256k to give us some headroom).
Apr 18 2020
While testing T1874 the procedure we followed was:
This is looking like it might be fixed in FRR version 7.2.1 onwards:
We managed to reproduce this on a test instance running VyOS 1.2.4 talking RTRR to Routinator3000 0.6.4:
Apr 17 2020
We saw something similar to this, but it seems like FRR eventually connected to RTRR. I think it has a timeout parameter — is that how often (slowly) it tries to re-establish?
We had this bug earlier today on 1.2.4.
Apr 4 2020
Can highly recommend: http://repo.saltstack.com/2019.2.html#debian (includes Jessie)
Mar 25 2020
I'm not expecting a persisted-across-reboots FRR config — hence suggesting tmpfs — so when the system boots there is nothing there. Obviously something would need to create the (empty) FRR config files in tmpfs before running FRR, otherwise I expect all the FRR daemons will fail to start.
We've seen this recently on bleeding-edge (yesterday's version) of 1.3. I'm currently investigating what tripped ospf6d, but I suspect it's going to be some Ubiquiti routers spewing their nasty OSPFv3 implementation.
Sep 29 2019
Agreed, I'm going to workaround with set system sysctl custom, but also submit a PR: https://github.com/vyos/vyatta-cfg-system/pull/107
…or, indeed, it'd be great to be able to restart FRR and have it get a new config when this happened just now:
Sep 23 2019
That's fixed the problem we had, but we've encountered some other strangeness.
Thank you, @c-po, I'll go deploy it now, then! :-)
Has this been merged into 1.2, or just 1.3? Because all of the 1.2-rolling images currently available from downloads.vyos.io right now have this bug in them :-(
MikroTik RouterOS supports something like this:
Why does this BGP neighbor need to be configred in the VyOS CLI? Wouldn't it be added automatically as a side-effect of wanting netflow data to have ASNs? Maybe add a flag to netflow, for those of us who are carrying full tables.
Having had bgpd peg a core to 100% (for no discernible reason), I'd welcome the ability to give quag^WFRR a kick, rather than rebooting the entire VyOS box.
We run ntop on a separate device, and export netflow data to the ntop/nprobe box from our routers (VyOS included). Would that work in your scenario too?
Symptoms which cause no configuration of the device after booting into 1.2:
PR to fix this: https://github.com/vyos/vyos-1x/pull/136