Page MenuHomeVyOS Platform

1.4.0-RC3 deleting portions of config in error (migration script)
Needs testing, HighPublicBUG

Description

In 1.4.0-rc3, entire portions of the config are being deleted when an interface is either not up yet, or no longer present. This behavior is not present in 1.4.0-rc1.

If an interface comes up after the migration script is run, like those created by containers, the migration script is deleting not just the config related to the "missing" interface, but the entire parent config as well (MPLS, OSPF, etc...). This would also impact interfaces that may change due to the use of a USB Ethernet adapter, the pulling of an unused PCIe card from the system (causing interface numbering to change), and even missing logical interfaces like dummy interfaces This is very undesired behavior as it makes any remote reboot potentially outage causing.

vyos@vyos# run show log | match migrat
Jan 26 17:14:07 vyos-router[974]: Starting VyOS router: migrate configure failed!

vyos@vyos# compare commands 0
delete interfaces ethernet eth10 mtu '1500'
delete protocols mpls interface 'eth10'
delete protocols ospf interface eth1 area '0'
delete protocols ospf interface eth10 area '0'
delete protocols ospf interface eth10 network 'point-to-point'
delete protocols pim interface eth1
delete protocols pim interface eth10
delete protocols pim rp address 10.0.0.1 group '224.0.0.0/4'

The interface that is not yet up in that example is eth10 (ZeroTier interface created from a container), but you can see the entire parent configs for OSPF and PIM are deleted. Nothing should be deleted, as eth10 will be up shortly.

NOTE: This is from a fresh installation of rc3, and a reboot. There is no actual migration occurring from an earlier version.

Details

Difficulty level
Unknown (require assessment)
Version
1.4.0-rc3
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Event Timeline

Viacheslav changed the task status from Open to Needs testing.Jan 27 2024, 4:10 PM
Viacheslav triaged this task as High priority.

Firstly, note that this is a failure in boot configuration, and is not related to migration: the log output of vyos-router is misleading, reporting the steps following initialization. On success:
Starting VyOS router: migrate configure
on failure:
Starting VyOS router: migrate configure failed !

Adding vyos-config-debug to the boot options on first boot after installation will provide more information on the config error, if it is possible to try that; cf.:
https://docs.vyos.io/en/sagitta/contributing/debugging.html#kernel

Firstly, note that this is a failure in boot configuration, and is not related to migration: the log output of vyos-router is misleading, reporting the steps following initialization. On success:
Starting VyOS router: migrate configure
on failure:
Starting VyOS router: migrate configure failed !

Adding vyos-config-debug to the boot options on first boot after installation will provide more information on the config error, if it is possible to try that; cf.:
https://docs.vyos.io/en/sagitta/contributing/debugging.html#kernel

Here's the trace. The main portions of concern are the container and OSPF configuration. eth10 will not be present upon boot, as it is created by the container. This is what is causing the failure.

root@vyos:/tmp# cat boot-config-trace 
Traceback (most recent call last):
  File "/usr/libexec/vyos/vyos-boot-config-loader.py", line 144, in <module>
    commit_out = session.commit()
                 ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/vyos/configsession.py", line 187, in commit
    out = self.__run_command([COMMIT])
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/vyos/configsession.py", line 143, in __run_command
    raise ConfigSessionError(output)
vyos.configsession.ConfigSessionError:  Processing the Priority Queue
  Entering the _commit_check_cfg_node
   Executing the "system host-name vyos" ...
   Elapsed 0.006 sec: 
  Elapsed 0.006 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "system host-name vyos" ...
   Elapsed 0.689 sec: 
  Elapsed 0.689 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
   Executing the "system console device ttyS0" ...
   Elapsed 0.009 sec: 
   Executing the "system console device ttyS0 speed 115200" ...
   Elapsed 0.008 sec: 
  Elapsed 0.018 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "system console" ...
   Elapsed 0.758 sec: 
  Elapsed 0.758 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
  Elapsed 0.000 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "system conntrack" ...
   Elapsed 0.557 sec: 
  Elapsed 0.557 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
   Executing the "interfaces loopback lo" ...
   Elapsed 0.006 sec: 
  Elapsed 0.006 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "interfaces loopback lo" ...
   Elapsed 0.366 sec: 
  Elapsed 0.366 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
   Executing the "interfaces ethernet eth0" ...
   Elapsed 0.009 sec: 
   Executing the "interfaces ethernet eth0 address dhcp" ...
   Elapsed 0.031 sec: 
   Executing the "interfaces ethernet eth0 hw-id 0c:d2:3b:a9:00:00" ...
   Elapsed 0.026 sec: 
  Elapsed 0.067 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "interfaces ethernet eth0" ...
   Elapsed 1.201 sec: 
  Elapsed 1.201 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
   Executing the "interfaces ethernet eth1" ...
   Elapsed 0.004 sec: 
   Executing the "interfaces ethernet eth1 hw-id 0c:d2:3b:a9:00:01" ...
   Elapsed 0.011 sec: 
  Elapsed 0.015 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "interfaces ethernet eth1" ...
   Elapsed 0.278 sec: 
  Elapsed 0.278 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
   Executing the "interfaces ethernet eth2" ...
   Elapsed 0.008 sec: 
   Executing the "interfaces ethernet eth2 hw-id 0c:d2:3b:a9:00:02" ...
   Elapsed 0.017 sec: 
  Elapsed 0.026 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "interfaces ethernet eth2" ...
   Elapsed 0.432 sec: 
  Elapsed 0.432 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
   Executing the "system syslog global facility all" ...
   Elapsed 0.009 sec: 
   Executing the "system syslog global facility all level info" ...
   Elapsed 0.009 sec: 
   Executing the "system syslog global facility local7" ...
   Elapsed 0.009 sec: 
   Executing the "system syslog global facility local7 level debug" ...
   Elapsed 0.009 sec: 
  Elapsed 0.037 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "system syslog" ...
   Elapsed 0.823 sec: 
  Elapsed 0.823 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
   Executing the "system login user vyos" ...
   Elapsed 0.005 sec: 
   Executing the "system login user vyos authentication encrypted-password $6$rounds=656000$ocvDebsZwyFKVA1r$bgseK4PwPCZj62R6SmgrZ2O3iTdrOUdSIjw9Ks2qgad3nW0TClXPFU7n0lfGWDusSeXdrgkAC.rjLDBwadyro1" ...
   Elapsed 0.004 sec: 
  Elapsed 0.009 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "system login" ...
   Elapsed 2.157 sec: 
  Elapsed 2.157 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
   Executing the "system name-server 4.2.2.2" ...
   Elapsed 0.128 sec: 
  Elapsed 0.128 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "system name-server 4.2.2.2" ...
   Elapsed 0.241 sec: 
   Executing the "system name-server 4.2.2.2" ...
   Elapsed 0.244 sec: 
  Elapsed 0.485 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
   Executing the "system config-management commit-revisions 100" ...
   Elapsed 0.050 sec: 
  Elapsed 0.050 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "system config-management" ...
   Elapsed 0.033 sec: 
  Elapsed 0.033 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
   Executing the "container name zt1" ...
   Elapsed 0.006 sec: 
   Executing the "container name zt1 cap-add net-admin" ...
   Elapsed 0.005 sec: 
   Executing the "container name zt1 cap-add sys-admin" ...
   Elapsed 0.004 sec: 
  Elapsed 0.017 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "container" ...
   Elapsed 1.318 sec: 
  Elapsed 1.318 sec: _commit_exec_cfg_node
  Entering the _commit_check_cfg_node
   Executing the "protocols ospf interface eth1" ...
   Elapsed 0.020 sec: 
   Executing the "protocols ospf interface eth1 area 0" ...
   Elapsed 0.020 sec: 
   Executing the "protocols ospf interface eth10" ...
   Elapsed 0.017 sec: 
   Executing the "protocols ospf interface eth10 area 0" ...
   Elapsed 0.018 sec: 
  Elapsed 0.076 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "protocols ospf" ...
   Elapsed 0.039 sec: 
  Elapsed 0.039 sec: _commit_exec_cfg_node
[[protocols ospf]] failed**
  Entering the _commit_check_cfg_node
   Executing the "service ntp allow-client address 0.0.0.0/0" ...
   Elapsed 0.018 sec: 
   Executing the "service ntp allow-client address ::/0" ...
   Elapsed 0.020 sec: 
   Executing the "service ntp server time1.vyos.net" ...
   Elapsed 0.030 sec: 
   Executing the "service ntp server time2.vyos.net" ...
   Elapsed 0.016 sec: 
   Executing the "service ntp server time3.vyos.net" ...
   Elapsed 0.025 sec: 
  Elapsed 0.110 sec: _commit_check_cfg_node
  Entering the _commit_exec_cfg_node
   Executing the "service ntp" ...
   Elapsed 1.016 sec: 
  Elapsed 1.016 sec: _commit_exec_cfg_node
 Elapsed 10.738 sec: Commit execute priority tree
Commit failed

There's 2 primary issues with this:

  • The interface is not really missing, and will be up shortly. It's just not up when the OSPF configuration is attempted to be applied. This would cause an outage in a production deployment.
  • While that is frustrating, the bigger problem is that the entire OSPF portion of the config is removed rather than just the offending line. eth1 is still up, but is removed along with any other OSPF configuration.
vyos@vyos# compare 0
[protocols]
- ospf {
-     interface eth1 {
-         area "0"
-     }
-     interface eth10 {
-         area "0"
-     }
- }

[edit]
vyos@vyos# show protocols ospf
Configuration under specified path is empty

@L0crian there are no changes that would allow this in 1.4-rc1 and not 1.4-rc3 but, perhaps, for a matter of timing in bringing the interface up (see below), but if you confirm that is the case and can provide the container portion of your config, I can attempt a reproducer.

Regarding the general issue of support of hotplug device/interface, that is not currently implemented or planned, though has been considered. I'll consider what alternative approaches may be possible.

@jestabro I appreciate you looking into this. It does seem to be a timing issue. I checked on a few of my systems....it almost always does it in a VM on my slow server, but I have a faster mini-pc where it doesn't appear to happen as often, though it does still happen on some reboots. Now that I have done more testing, it does happen on 1.4-rc1 as well. It does not however happen in 1.3.5.

I'll place my container config at the bottom, but a really easy way to reproduce this is to call a dummy under a protocol config, commit;save, delete the dummy interface, and then commit;save and reboot. This is the same behavior as the missing ethernet interface that the docker creates. The router will reboot and perform commit-checks for the initial config, and finding the dummy interface not present, delete the entire config for that protocol section, rather than just the offending line.

1.3.5:

[email protected]:~$ configure 
[email protected]# set protocols mpls interface eth0
[email protected]# set protocols mpls interface dum0
[email protected]# set interfaces dummy dum0 address 10.0.0.1/32
[email protected]# commit;save
[email protected]# delete interfaces dummy 
[email protected]# commit;save
[email protected]:~$ reboot now

Welcome to VyOS - 1.3.5 ttyS0

[email protected]:~$ configure 
[email protected]# show protocols mpls 
 interface eth0
 interface dum0
[email protected]# compare 0
No changes between working and revision 0 configurations

1.4-rc3:

[email protected]:~$ configure 
[email protected]# set protocols mpls interface eth0
[email protected]# set protocols mpls interface dum0
[email protected]# set interfaces dummy dum0 address 10.0.0.1/32
[email protected]# commit;save
[email protected]# delete interfaces dummy 
[email protected]# commit;save
[email protected]# exit
[email protected]:~$ reboot now 

The system will reboot now!

Welcome to VyOS - 1.4-rc3 ttyS0

[email protected]:~$ configure 
WARNING: There was a config error on boot: saving the configuration now could overwrite data.
You may want to check and reload the boot config

[email protected]# show protocols mpls
Configuration under specified path is empty
[email protected]# compare commands 0
delete protocols mpls interface 'eth0'
delete protocols mpls interface 'dum0'

The container I'm using is ZeroTier, which will create an ethernet interface.
Container Config:

set container name zt1 allow-host-networks
set container name zt1 cap-add 'net-admin'
set container name zt1 cap-add 'sys-admin'
set container name zt1 device tun destination '/dev/net/tun'
set container name zt1 device tun source '/dev/net/tun'
set container name zt1 image 'zerotier/zerotier:latest'
set container name zt1 volume ZT_Path destination '/var/lib/zerotier-one'
set container name zt1 volume ZT_Path source '/config/containers/zt1'

Using a container is useful in creating tunnel interfaces from software like ZeroTier, Tailscale, NetBird, etc..., since the software can easily survive updating the VyOS version, but I understand that using these may be atypical for a lot of people.

A more common scenario where this would be problematic and outage causing is if someone had a PCIe network card which wasn't recognized upon a reboot. Rather than just delete the config associated with those interfaces, the entire config for a configured protocol could be missing. I would imagine a great deal of users probably have deployments similar to that.

I would love to see the container part of this be fixed, because that's very useful to me, but the more pressing issue in this is deleting far more of the config than just an offending line. Based on this output, it seems like the full config for that protocol section is checked, and if any line fails, it considers that full section invalid:

Entering the _commit_check_cfg_node
   Executing the "protocols ospf interface eth1" ...
   Elapsed 0.020 sec: 
   Executing the "protocols ospf interface eth1 area 0" ...
   Elapsed 0.020 sec: 
   Executing the "protocols ospf interface eth10" ...
   Elapsed 0.017 sec: 
   Executing the "protocols ospf interface eth10 area 0" ...
   Elapsed 0.018 sec: 
  Elapsed 0.076 sec: _commit_check_cfg_node

Hopefully that helps!

I raised this issue for discussion last week, and am testing more 'lenient' verification in cases such as above; I'll add results and plans here, as available.

I raised this issue for discussion last week, and am testing more 'lenient' verification in cases such as above; I'll add results and plans here, as available.

Roger that, sounds good! Here's the napkin thought I had to solve the issue with interfaces that are not up at the time the config is checked (like those created by containers). Add something like a 'force-persist' switch under the interface config, which would create a list of interfaces that would be immune to being invalid under the config check (or at least the check that a configured interface exists). That way you can ensure it's an explicit action from the engineer rather than an accidental config/typo. So something like:

set interfaces ethernet eth0 force-persist

At least for FRR, it doesn't really care if you configure non-existent interfaces, so it's at least generally safe for sections that generate FRR configs:

interface dontdeleteme
 ip ospf area 0
 mpls enable
!
interface iwannalive
 ip ospf area 0
 mpls enable

The other half of the problem was removing the entire config section when an interface is absent. So if an interface were to fail on a PCIe adapter (or the whole adapter), it would delete an entire config section rather than just that offending line. Maybe if a failure occurs in a section, fall back to a line by line validation for that section (probably wouldn't scale to something like firewall rules, but would maybe be fine for a section under protocols). I currently have a 25Gbe card with one of the interfaces that doesn't show, so somewhere in its lifetime, it had that failure condition.

Just throwing out ideas; I can't promise that they're good ideas.

I'm curious, are the config checks at boot part of the added migration processes? Or were they added for a separate reason?

The configuration checks are the 'verify' stage of the respective config mode script; general structure here:
https://docs.vyos.io/en/sagitta/contributing/development.html#configuration-script-structure-and-behaviour

Unrelated to migration, each stage is run for configuration at boot.

I wanted to add this post from the forums to this thread. It appears to be the same behavior where the failure of a single line deletes the logical section:
https://forum.vyos.io/t/upgrade-from-1-3-6-to-1-4-0-epa1-looses-complete-dhcp-configuration/13935

Config sections not applied after failed verification is not a surprise, as discussed above, though it is understandably frustrating: warnings are provided on failure and upon entering config-mode subsequent to failure, to help avoid overwriting the saved config before investigation. As we continue this discussion we should distinguish the various issues (and I will open subtasks as appropriate):
(1) missing config sections on migration need to be debugged, as with T6076, ongoing ...
(2) verification of 'late' arriving interfaces (general); configuring ZeroTier (specific): I'm looking into what may be considered for the general issue, as mentioned, and I have some initial ideas; thanks to the details provided by @L0crian I'll reproduce the specific configuration for experiment
(3) support for hotplugging interfaces: not planned at the moment, but may be discussed again for 1.5

@jestabro, do you guys consider the process of deleting the entire config section for a single offending line working as intended? Or do you consider it a design flaw/bug that will get fixed? I'm trying to understand if that is an intentional design choice and will permanently be part of VyOS going forward, or if it's just part of growing pains as VyOS moves from 1.3 to 1.4.

Well, nothing is deleted, rather the config section is not applied if the verification steps do not pass: if the user ignores the error and warnings and saves the subsequent working configuration, the resident config.boot file will be overwritten. So I agree that it an issue for consideration, if we distinguish those cases in which this may occur, as above. Considering those cases, we can not or would not want to (1) automatically fix migration errors, in some sense or other, to avoid non-application of config sections on migration failure; bugs in the migration system should be fixed, and improvements to the migration system itself are part of the current development plans (2) support non-verified/non-existent interfaces without clarification of when if ever that is appropriate, such as the current investigation. There is current development of the design to refine the verification stage, unrelated to this specific task. Given the preceding, I would not say either that it is a case of 'everything's fine' or of 'design flaw'; design evolution is appropriate perhaps.

Gotcha! The thing I keep struggling to understand is why is the full section of the config removed? This will 100% cause outages. If there's part of the config that is invalid, only that part should not be applied. You shouldn't remove 19 valid lines of config due to a single invalid line. Particularly with almost all of the instances I've seen of this, the invalid line is just cosmetically invalid, and doesn't cause any issues with the operation of the system.

That's the part I'm trying to understand whether it is intended behavior or not. I would think that isn't the intent and only invalid lines of config should be removed, but if it is intentional and will not change going forward, I want to know before I recommend 1.4 to any customer.

The short answer is that it will change going forward, and the work within this task, and the ongoing work mentioned above will avoid failing viable configs that have the symptom of 'full section of the config not applied'; that's what we are addressing here. The longer answer above attempted to distinguish several unrelated situations that have that same symptom, and in the extreme case of a user providing a corrupted, partial, or nonsensical config, there is not much that can be divined without some user feedback and suggested debugging: if critical dependencies are missing, that means that the system would be non-operational. What interests me here is some immediate improvements that may be made, and in proceeding with that, it is necessary to make these distinctions.

Perfect, that's what I was trying to better understand. Thanks for the clarification!

@jestabro @c-po

I tested https://github.com/vyos/vyos-1x/pull/3173 today, and it definitely helps the original use-case of this task. Configuration calling interfaces that come up post-boot, like those installed from a container, persist. The one problem is the interface level config is still deleted on-boot from some check that is performed.

Initial config without interface level config:

vyos@vyos# run show interfaces ethernet eth5 brief 
Interface        IP Address                        S/L  Description
---------        ----------                        ---  -----------
eth5             10.13.1.121/16                    u/u 

vyos@vyos# set protocols mpls interface 'eth5'
vyos@vyos# set protocols ospf interface eth5 area '0'
vyos@vyos# set protocols pim interface eth5
vyos@vyos# commit;save
vyos@vyos# exit
vyos@vyos# reboot now

Config calling the unconfigured interface is deleted from running config as expected:

vyos@vyos:~$ configure 
WARNING: There was a config error on boot: saving the configuration now could overwrite data.
You may want to check and reload the boot config
vyos@vyos# compare commands 0
delete protocols mpls interface 'eth5'
delete protocols ospf interface eth5 area '0'
delete protocols pim interface eth5

Interface level config:

vyos@vyos# rollback-soft 0
vyos@vyos# commit

vyos@vyos# set interfaces ethernet eth5 description 'ZeroTier'
vyos@vyos# commit;save
vyos@vyos# exit
vyos@vyos# reboot now

Config calling interface survives (protocol configs), but the interface level config is removed from running.

vyos@vyos:~$ configure 
WARNING: There was a config error on boot: saving the configuration now could overwrite data.
You may want to check and reload the boot config
[edit]
vyos@vyos# compare commands 0
delete interfaces ethernet eth5 description 'ZeroTier'

vyos@vyos# show | commands | match eth5
set protocols mpls interface 'eth5'
set protocols ospf interface eth5 area '0'
set protocols pim interface eth5

I'm not sure if it's a useful clue, but here is what is in the log:

Mar 25 12:58:08 vyos-configd[767]: Received message: {"type": "node", "last": false, "data": "VYOS_TAGNODE_VALUE=eth5/usr/libexec/vyos/conf_mode/interfaces_ethernet.py"}
Mar 25 12:58:08 vyos-configd[767]: 'Interface' object has no attribute 'iftype'

It is not an actual ethernet interface, but you name it as an ethernet interface.
You can check the output sudo ip link show type wireguard or sudo ip link show type tun

It is not an actual ethernet interface, but you name it as an ethernet interface.
You can check the output sudo ip link show type wireguard or sudo ip link show type tun

The interface type is Ethernet. The interface is being created by ZeroTier, which creates an interface of type ether (it creates a virtual LAN). I rename the interface to 'eth5' in this case so I can apply interface level configs like Description, MTU, VRF Assignment, etc...

If I revert back to the default ZT interface name, this is what is shown. You can see it is an Ethernet interface.

ztks5u3nzd: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2800
        ether 4e:60:35:c5:5d:2d  txqueuelen 1000  (Ethernet)

The issue that I had previously is that I instantiate the ZeroTier interface using a container, so it'll be persistent across upgrades. But the verify checks are performed before the container fully starts, causing the interface to not be present and the verify checks to fail. This deleted whole blocks of config rather than individual lines, which is a separate issue.

c-po created a PR that helped this behavior, but it's not fully there for what I need it to do, though I don't think his change was specifically meant to solve this particular use-case.

Once the interface is up, I can configure anything under eth5, so my guess is the failure on-boot is a trying to call 'iftype' when it is null, since the interface isn't up yet.

@L0crian thanks for the details: c-po's PR is part of the solution, but there is more to the story, as you point out.

@jestabro Thanks, a solution definitely seems close.