- User Since
- Sep 10 2018, 3:30 PM (71 w, 6 d)
Fri, Jan 24
Mon, Jan 13
The described problem exists in stable FRR 7.2, but fixed in FRR master branch by https://github.com/FRRouting/frr/pull/5184
We have tested 7.2 with this PR applied, and the bug was gone, so we can apply this PR to our FRR package and solve the problem.
In FRR 7.0.1 (VyOS 1.2.3) was some bug, due to which static routes were not updated (maybe, not in all cases or environments) after the next-hop state change. In VyOS 1.2.4 we use stable FRR 7.2, which processes this situation without problems. An example (key point from FRR debug log):
Jan 13 15:29:51 vyos zebra: 0:10.230.230.0/30: Adding route rn 0x5612ea69d1f0, re 0x5612ea69d370 (type 2) Jan 13 15:29:51 vyos zebra: 0:10.230.230.0/30: Redist update re 0x5612ea69d370 (type 2), old (nil) (type -1) Jan 13 15:29:51 vyos zebra: 0:10.230.230.0/30: Adding route rn 0x5612ea69d490, re 0x5612ea69e110 (type 2) Jan 13 15:29:51 vyos zebra: 0:10.230.230.0/30: Redist update re 0x5612ea69e110 (type 2), old (nil) (type -1) Jan 13 15:29:51 vyos zebra: NHT processing check for zvrf default Jan 13 15:29:51 vyos zebra: 0:10.230.230.1/32: Evaluate RNH, type 0 Jan 13 15:29:51 vyos zebra: 0:10.230.230.1/32: NH resolved over route 10.230.230.0/30 Jan 13 15:29:51 vyos zebra: 0:10.230.230.1/32: Notifying client static about NH Jan 13 15:29:51 vyos zebra: 0:192.168.20.1/32: Evaluate RNH, type 0
Jan 13 15:33:23 vyos zebra: 0:10.230.230.0/30: Adding route rn 0x5574620a18b0, re 0x5574620a1930 (connected) Jan 13 15:33:23 vyos zebra: 0:10.230.230.0/30: Adding route rn 0x5574620a29b0, re 0x5574620a1850 (connected) Jan 13 15:33:23 vyos zebra: 0:10.230.230.0/30 update_from_ctx(): no fib nhg Jan 13 15:33:23 vyos zebra: 0:10.230.230.0/30 update_from_ctx(): rib nhg matched, changed 'true' Jan 13 15:33:23 vyos zebra: 0:10.230.230.0/30: Redist update re 0x5574620a1930 (connected), old 0x0 (None) Jan 13 15:33:23 vyos zebra: 0:10.230.230.1/32: Evaluate RNH, type Nexthop Jan 13 15:33:23 vyos zebra: 0:10.230.230.1/32: NH resolved over route 10.230.230.0/30 Jan 13 15:33:23 vyos zebra: 0:10.230.230.1/32: Notifying client static about NH Jan 13 15:33:23 vyos zebra: rib_add_multipath: 0:10.0.0.0/8: Inserting route rn 0x5574620a1b10, re 0x5574620a1a30 (static) existing (nil) Jan 13 15:33:23 vyos zebra: 0:10.0.0.0/8: Adding route rn 0x5574620a1b10, re 0x5574620a1a30 (static) Jan 13 15:33:23 vyos zebra: netlink_route_multipath(): RTM_NEWROUTE 10.0.0.0/8 vrf 0(254) Jan 13 15:33:23 vyos zebra: netlink_route_multipath() (single-path): nexthop via 10.230.230.1 if 3(0) Jan 13 15:33:23 vyos zebra: netlink_talk: netlink-dp (NS 0) type RTM_NEWROUTE(24), len=60 seq=10 flags 0x501 Jan 13 15:33:23 vyos zebra: 0:10.0.0.0/8 update_from_ctx(): no fib nhg Jan 13 15:33:23 vyos zebra: 0:10.0.0.0/8 update_from_ctx(): rib nhg matched, changed 'true' Jan 13 15:33:23 vyos zebra: 0:10.0.0.0/8: Redist update re 0x5574620a1a30 (static), old 0x0 (None)
So, configured static routes updating properly.
Thu, Jan 2
Tue, Dec 31
The problem is fixed in 1.3.
Unfortunately, I cannot find any other reliable way to configure vyos-hostsd service to be running before the vyos-router. In fact, vyos-hostsd is really necessary to be running for proper work of the VyOS system, so we can consider this even from the other point of view - how to keep all services operable after the vyos-router restart?
If you will have any ideas, which can help to decrease the overall impact of this situation, we would be happy to get them.
Dec 19 2019
Thank you for pointing our attention to this issue! It is really bad that such simple action as changing hostname in some cases (well, in fact not only this but it is easy to reproduce) leads to the whole router crash.
The problem consists of several parts:
- In old systemd versions (which is used in Debian Jessie and VyOS 1.2) exists a problem, when during a restart of systemd-journald all pipes between this daemon and systemd services are disconnecting.
- In vyos-hostsd, which is responsible for hostname and DNS and controlled by systemd we used print() for logging and debug purposed without enough handling of errors.
So, when arises the situation when there is no PIPE connection between vyos-hostsd and systemd-journald, vyos-hostsd not able to print messages and crashes. :(
Dec 18 2019
As I see, NAT events can be recorded only by nfacctd, and therefore this is not possible with the current way to capture traffic (by NFLOG + uacctd). Fix me, if I was missed something, please.
Dec 17 2019
Hello, @elbuit !
We almost ready to release rewritten flow-accounting, and maybe we will be able to include your request into it. Can you describe more detailed what exactly records you want to have? It would be good to see an example pmacct configuration for your case.
Dec 13 2019
Dec 9 2019
Thanks, @trae32566 for the information! I would be happy to change this fix in that way, which does not allow to place unwanted records to resolv.conf at all, but I cannot catch the same situation like yours to collect enough diagnostics data to be sure in the reason of such behavior.
Dec 6 2019
I have tried multiple times to reproduce this with 1.2-rolling-201912060217 with no luck. It would be great if together with logs you will provide a detailed description of the environment. Because, possible that even CPU cores count or memory size can lead to some condition, in which dhclient-script cannot get proper values from config and add unwanted servers to the resolv.conf.
Dec 5 2019
Could you provide the log output in a case when DNS servers, received from DHCP appears in resolv.conf? As I understand, it should happen immediately after the boot.
Also, please, check if they are not deleting after the first DHCP lease renewal.
Nov 25 2019
Nov 12 2019
Nov 2 2019
Oct 25 2019
Oct 21 2019
Oct 8 2019
BGP scan-time parameter is unneeded in current FRRouting and VyOS - there are used modern next-hop tracking instead. You must avoid using this option. I have prepared PR to delete this option and migrate the old configuration, where it exists:
Sep 24 2019
Sep 16 2019
Sep 11 2019
Sep 4 2019
IP6GRE tunnels are supported in 1.2-rolling-201909041703. You are welcome to test.
Aug 27 2019
Pull request for fixing this problem: https://github.com/vyos/vyatta-netflow/pull/4
Aug 26 2019
Aug 21 2019
The problem is in FRRouting itself. It can be reproduced in 7.0.1-20190820-04-g047efd6, 7.1-20190820-02-g1ed807a. But in 7.2-dev-20190820-03-g9316c82 everything work as expected.
We should try to find which changes fixed this problem and reapply it to one of the current stable FRR versions or wait for the next stable.
Aug 9 2019
I have added two PRs with some fixes and new features. The most valuable changes:
- Fixed the bug, which prevents to change or delete BFD peers with custom options. For example, when any of source address/interface, multihop was used, such peers could not be deleted or changed.
- Added configuration checks, which should prevent adding BFD option to BGP neighbors or peer-groups without corresponding peers configuration in protocols bfd. If BGP and BFD configurations are out of sync, BGP sessions could be very unstable.
- Added configuration check, which should prevent to delete peers from protocols bfd if they are still used in BGP.
- Some other small fixes and changes.
Also, was added several new options:
set protocols bfd peer IP echo-mode set protocols bfd peer IP interval echo-interval
set protocols bgp ASN neighbor IP bfd check-control-plane-failure
Aug 1 2019
@mb300sd. create please a new one task with detailed description for BGP, if there are still some problems with it.
@thinkl33t, recommended way is using dynamic-dns-update, all other ways are not recommended to use at this moment.
Confirmed: the problem is not reproducible anymore in 1.2.0-rolling+201908010337 with keepalived 1:2.0.17+vyos1.2.
@ekim, we have never met with such a problem and cannot reproduce it in our environments. The better way to continue investigation would be getting access to this installation. If this would be possible, we could continue debugging.
Jul 18 2019
@mb300sd, yes - nothing newer yet. Just test, please, when a new build will be available. :)
The problem, which leads to the malformed hostname in Hostname Capability was fixed in T1531. I am marking this as "Resolved", because the problem with DNS servers was resolved also, according to feedback.
Thank you, @ekim!
What exactly you were doing at the moment of this log record? I see a lot of scripts, which are almost permanently do something in the system. Is it possible that this system contains some custom scripts (in /config/scripts folder, for example)?
If yes, you should check the schedule of execution, and requirements of modification for 1.2 version command syntax.
Jul 16 2019
Jul 13 2019
Hello, @ekim! I see now. This is more looks like the waiting due to I/O. Do, please, the next:
- Run sudo atop -w /tmp/atop-mon.log -a 5 60 in dedicated terminal.
- Try to work several minutes in the terminal. It must freezing at this moment, otherwise, collected test data will be wrong.
- Wait until atop finish its work (~ 5 min from the start).
- Copy the file /tmp/atop-mon.log from the router and send to us for analysis.
Jul 10 2019
@ekim, could you make a screen record with this CLI delay?
I have prepared the pull requests for fixing this bug. They add hooks for two situations:
- if VRRP configuration changed;
- if firewall settings for interface changed.
Jul 8 2019
Such a significant increase of boot time with the same configuration is very strange, but still possible - even small changes can easily cause such behavior if you have a huge or some specific configuration. But what is more strange - CLI response time. Regardless of configuration, CLI should work without visible delays, except for autocompleting or commit operations.
Could you check the current load of the host at that moment, when CLI is slow? We must be sure that the system is not overloaded.
Jun 28 2019
I am confirming that the problem is not reproducing in the 2.0.17. We should upgrade keepalived distribution.
Jun 26 2019
I have checked behavior in 4.3.5 and 4.4.1 versions. The information about hostname is still not synced from primary to secondary.
As I see from the information in the Debian bug report, it is about the other bug - when hostname not rewritten after offering lease. From the ISC-DHCP changelog:
@dongjunbo, show please the configuration of this router so we could check why gcdomestic pool does not count correctly.
Jun 24 2019
Provided configuration from the first message was successfully loaded in 1.2.0-rolling+201906240337.
@csalcedo, could you test new rolling to check if the problem is solved for you too?
Have you tried current rolling releases to check if leases information view work correctly now?
The safest solution will be waiting for 2.0.17, test compatibility with VyOS again, and then update keepalived package inside the VyOS.
As I see, from current VyOS scripts, keepalived restart only at router startup or if all VRRP groups were deleted. In case of configuration change we use reload, which is correct.
This means that we get nothing from the keeping state in case of the restart - there is no sense to keep states of deleted groups, and we have nothing to keep at first startup.
Jun 23 2019
Jun 22 2019
I agree with you on many arguments. I just wanted to say my point of view.
There are many differences between vendors: in terminology, in behavior, in technologies. This is a normal situation. The same applying to engineers - we are very diversified, and this is beautiful :).
I strongly disagree about mixing dummy/loopback at any of level (CLI or under the hood). Now we have two different types of interfaces: dummy and loopback with the according to names. And they are equal only if you use a dummy for the /32 addresses. In general situation:
- Loopback interface will be used to reach any of the address inside the configured network;
- If the IP address assigned to a dummy interface, the system will respond only to this address, not for the whole network.
Example, to be more precise:
set interfaces loopback lo address 192.168.8.1/24 set interfaces dummy dum1 address 192.168.9.1/24
Jun 21 2019
Jun 20 2019
Example of the output when value is below 10000000:
vyos@test-06:~$ show firewall name TESTFW rule 50
The problem was fixed in https://phabricator.vyos.net/R6:97c5ad3dca756635e83eb3bf667f742457d85d74.
Jun 19 2019
Jun 5 2019
Jun 4 2019
May 30 2019
Problems with DHCP server status viewing can be fixed with the next patch for show_dhcp.py :
--- orig/show_dhcp.py 2019-05-30 22:45:01.625708032 +0300 +++ T1416/show_dhcp.py 2019-05-30 22:40:33.302777881 +0300 @@ -55,15 +55,28 @@ return data
May 22 2019
May 16 2019
The solution was tested and fully worked.
@hagbard, everything works fine now. Thank you!
May 15 2019
diff -Naur origin/dhclient-script pull2/dhclient-script --- origin/dhclient-script 2019-05-15 19:32:59.001598203 +0300 +++ pull2/dhclient-script 2019-05-15 19:33:47.533181873 +0300 @@ -39,7 +39,6 @@ echo " " > $new_resolv_conf fi
May 14 2019
This is not the same.
set protocols static table 100 interface-route 10.100.100.0/24 next-hop-interface eth1
test-06# show running-config staticd Building configuration...
May 13 2019
@hagbard solution works. Please, add it also to the stable branch and to the set protocols static table section.
May 3 2019
Apr 18 2019
Unfortunately, current Linux GRE and network stack implementations don't support Cisco-style of GRE keepalives (GRE inside GRE, with spoofed IP addresses). From the Linux point of view, those packets look like martians, and the kernel drop them, information about what you can see inside a log.
Try to disable the keepalive at the Cisco side, after this tunnel must be fully functional.
Apr 5 2019
Sorry, I must reopen this task. Absolutely the same situation with multiple "lower" interfaces:
OPTIONS="-6 -l ::%eth1.100-l ::%eth1.102 -u 2001:db8:0:feed::2%eth2.88 -u 2001:db8:0:feed::3%eth2.88 " ^^ here