Page MenuHomeVyOS Platform

http api: "Failed to generate committed config"
Closed, ResolvedPublicBUG

Description

When using the http api, I sometimes come into a "Failed to generate committed config"-Status. Only a reboot helps, in order to resolve the issue. I am using the current rolling version.

I am creating and deleting firewall, policies, bgp peers and wireguard tunnels. I am mixing deletion and setting of value commands per API-Call (in order to start from a clean state). I didnt see a pattern yet, when the error will occour. But it will eventually occur after some requests.

I know, this bug report is not very specific, but I am having a hard time to pin it down. Any idea how I can debug this?

Details

Difficulty level
Unknown (require assessment)
Version
VyOS 1.3-rolling-202004020117
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)

Event Timeline

In this failure case, there are some expected "failed" lines in /var/log/vyatta/cfg-stdout.log. Although they are not detailed log messages, they may help narrow down the source of the failure, if you could share those.

Does this get cleared on each reboot? Because I usually rebooted after it "locked".

cp[/opt/vyatta/config/tmp/new_config_1310]->[/opt/vyatta/config/tmp/tmp_1310/work]
no tap[/opt/vyatta/config/tmp/tmp_1310/active]
cp[/opt/vyatta/config/tmp/new_config_1310]->[/opt/vyatta/config/tmp/tmp_1310/active]
cp[/opt/vyatta/config/tmp/new_config_2811]->[/opt/vyatta/config/tmp/tmp_2811/work]
no tap[/opt/vyatta/config/tmp/tmp_2811/active]
cp[/opt/vyatta/config/tmp/new_config_2811]->[/opt/vyatta/config/tmp/tmp_2811/active]
cp[/opt/vyatta/config/tmp/new_config_2811]->[/opt/vyatta/config/tmp/tmp_2811/work]
no tap[/opt/vyatta/config/tmp/tmp_2811/active]
cp[/opt/vyatta/config/tmp/new_config_2811]->[/opt/vyatta/config/tmp/tmp_2811/activ

These are the only lines in /var/log/vyatta/cfg-stdout.log. I will try to reproduce and get the logs.

I couldn't find anything in /var/log/vyatta the last few times I've ran into boot-time commit errors. It seems like the python code doesn't log commit errors into a file or syslog, or at least I couldn't find it. I had to log into the console, go into config mode, load the config (usually pre-migration when there are migrator script errors) and commit it, then the errors are printed to the console. Someone can correct me if I'm wrong.

Regarding the first question, yes, cfg-stdout.log is cleared on reboot, unfortunately. The commit error logging is currently a mixture of (un-verbose) reporting from the backend, and limited reporting from python; improvements pending. Firstly, if there are migration errors, those need to be investigated; secondly, the error that you initially reported is (generally) related to stale information in the config hierarchy itself, but may be obscuring earlier errors.

What migration errors are you seeing ? If you were able to capture post-migration errors with your workaround, can you post those? Thanks.

Note, there is /opt/vyatta/etc/config/vyos-migrate.log (/config/vyos-migrate.log after image installation) that will list failed migration scripts, if any). This is created before /var/log is available, hence the non-canonical location.

Re: logging, DANOS mounts a permanent /var/log at boot (I think it's shared by all images). If any work on logging is going to be done, /var/log should be mounted before any vyos config migration or load is done so it can be used as the destination for logging.

@fetzerms I am able to reproduce this, in a manner that's not completely clean, but which will allow me to investigate further. Feel free to add any other details you run across; thanks.

@fetzerms I was mistaken: cfg-stdout.log is logrotated, but not removed on boot, and this is useful info. When you are able to reproduce, please share. I believe the corner case I am seeing is distinct but related to what you are seeing. Thanks.

I think I can pinpoint it down to the mesh generation using wireguard. Please see the logs of two failed nodes attached. I dont find older rotations of the log file. Thats all I have.

The api calls look like this:

[{"op": "set", "path": ["interfaces", "wireguard", "wg12", "address"], "value": "10.10.10.250/30"},{"op": "set", "path": ["interfaces", "wireguard", "wg12", "fwmark"], "value": "100"},{"op": "set", "path": ["interfaces", "wireguard", "wg12", "port"], "value": "11212"},{"op": "set", "path": ["interfaces", "wireguard", "wg12", "peer", "somepeer", "allowed-ips"], "value": "0.0.0.0/0"},{"op": "set", "path": ["interfaces", "wireguard", "wg12", "peer", "somepeer", "address"], "value": "xxx.xxx.xxx.162"},{"op": "set", "path": ["interfaces", "wireguard", "wg12", "peer", "somepeer", "port"], "value": "11221"},{"op": "set", "path": ["interfaces", "wireguard", "wg12", "peer", "somepeer", "pubkey"], "value": "pubkey"},{"op": "set", "path": ["firewall", "group", "address-group", "WIREGUARD-PEERS-V4", "address"], "value": "xxx.xxx.xxx.xxx"}]

@fetzerms Thank you for the logs; that was the output I was hoping to confirm. The WSGI server in python3-bottle is showing occasional instabilities; I am not able to confirm with collections of set commands similar to yours, though I am able to induce it artificially. There are a few recent changes that will help avoid triggering this, while I track down a root cause. If you have a chance, it would be helpful if you could try a recent rolling (>= vyos-1.3-rolling-202004100117). If your are able to monitor the output of 'journalctl -fx' on the system, while executing remote configuration, that would be very useful, though that may not be feasible in your setting.

I can try on four nodes with a recent rolling release in the next few days. I just need to find some proper maintenance window.

@fetzerms I believe this is resolved by the fix in T1899, which will be in the next rolling release.

jestabro changed the task status from Open to Needs testing.May 5 2020, 3:00 AM
jestabro triaged this task as High priority.

@fetzerms Have you seen this problem with the fix mentioned above ? If not, I will close.

@jestabro I installed new rolling images today and will verify tomorrow.

@jestabro Sorry for the delay, I did not find a proper maintenance window... Today I verified the fix and it seems to work. My automation scripts run through just fine, without the error.

So I think this issue could be marked as resolved.