Page MenuHomeVyOS Platform

many dnat rules makes the vyos http api crash, even showConfig op timeouts
Closed, ResolvedPublic

Description

vyos info:

  • Hypervisor: VMware and Xen HVM, same problem with both / doesn't seem to matter.
  • VM specs: 1 vcpu, 1 GB RAM, 2x network interfaces.
  • Hardware: older lab hardware running Intel Xeon 5600 series CPU, but same problem seems to also happen on modern hardware.
  • version: vyos-1.3-beta-202101260443-amd64.iso

I have a scenario where thousands of DNAT rules are needed, and it seems already 215 DNAT rules completely breaks the vyos http API.. eg. even showConfig op/request gets broken when you have only 215 DNAT rules added to the configuration.

vyos http API enabled with:

set service https api debug
set service https api keys id MY-HTTP-API-ID key MY-HTTP-API-PLAINTEXT-KEY

Vyos configuration otherwise is very simple: ssh service enabled, and IPs configured for the two network interfaces, and that's all.

steps to reproduce the problem:

first check and verify the http api showConfig request works with curl:

linux# time curl -k -X POST -F data='{"op": "showConfig", "path": []}' -F key=MY-HTTP-API-PLAINTEXT-KEY https://vyos-ip/retrieve 1> /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1199  100   911  100   288   2436    770 --:--:-- --:--:-- --:--:--  2435

real    0m0.406s
user    0m0.087s
sys     0m0.121s

all good, let's add 50 DNAT rules:

vyos@vyos:~$ configure 
[edit]
vyos@vyos# for i in `seq 1 50`; do set nat destination rule $i description "test $i"; set nat destination rule $i protocol tcp; set nat destination rule $i destination port $i; set nat destination rule $i translation port $i; set nat destination rule $i inbound-interface eth0; set nat destination rule $i translation address 192.168.1.5 ; done
[edit]
vyos@vyos# commit

and then check the vyos http api showConfig op:

linux# time curl -k -X POST -F data='{"op": "showConfig", "path": []}' -F key=MY-HTTP-API-PLAINTEXT-KEY https://vyos-ip/retrieve 1> /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9699  100  9411  100   288   4933    150  0:00:01  0:00:01 --:--:--  4932

real    0m1.939s
user    0m0.097s
sys     0m0.114s

1.9 seconds, not too bad. Let's add 50 more DNAT rules and check again with 100 DNAT rules total:

linux# time curl -k -X POST -F data='{"op": "showConfig", "path": []}' -F key=MY-HTTP-API-PLAINTEXT-KEY https://vyos-ip/retrieve 1> /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18203  100 17915  100   288   5317     85  0:00:03  0:00:03 --:--:--  5319

real    0m3.401s
user    0m0.092s
sys     0m0.118s

3.4 seconds.. let's add 100 more, so total 200 DNAT rules now:

linux# time curl -k -X POST -F data='{"op": "showConfig", "path": []}' -F key=MY-HTTP-API-PLAINTEXT-KEY https://vyos-ip/retrieve 1> /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 35603  100 35315  100   288   5386     43  0:00:06  0:00:06 --:--:--  8017

real    0m6.587s
user    0m0.081s
sys     0m0.129s

6.5 seconds.. ok, now let's try with 300 DNAT rule total:

linux# time curl -k -X POST -F data='{"op": "showConfig", "path": []}' -F key=MY-HTTP-API-PLAINTEXT-KEY https://vyos-ip/retrieve
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   471    0   183    0   288      0      0 --:--:--  0:10:00 --:--:--     0
<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.14.2</center>
</body>
</html>

real    10m0.321s
user    0m0.111s
sys     0m0.123s

So for some reason with 300 DNAT rules added it now takes up to 10 minutes (!) and then nginx timeouts the request.. I wonder what's going on here? while the curl showConfig request is waiting (and nothing happening) the vyos VM seems to be mostly idle, "top" shows most of the time the cpu is 99% idle and no suspicious processes running.

to reproduce the problem you can of course directly add 300 DNAT rules using the following:

vyos@vyos# for i in `seq 1 300`; do set nat destination rule $i description "test $i"; set nat destination rule $i protocol tcp; set nat destination rule $i destination port $i; set nat destination rule $i translation port $i; set nat destination rule $i inbound-interface eth0; set nat destination rule $i translation address 192.168.1.5 ; done
[edit]
vyos@vyos# commit

and then try fetching the configuration using the vyos http api showConfig op.

I actually tried playing with different number of DNAT rules, and it seems already 215 DNAT rules is too much and that also results in the nginx gateway timeout @ 10 minutes.

214 DNAT rules works OK and I get the showConfig op response in 7.5 seconds, but adding one more DNAT rule to total of 215 DNAT rules makes the http api get stuck and hit the timeout @ 10 mins.

Any idea where the problem might be ?

For reference here's the full vyos configuration *before* adding the DNAT rules using the script above:

vyos@vyos:~$ show configuration | strip-private
interfaces { 
    ethernet eth0 {
        address dhcp
        hw-id XX:XX:XX:XX:XX:b7
    }
    ethernet eth1 {
        address 192.168.1.1/24
        hw-id XX:XX:XX:XX:XX:0f
    }
    loopback lo {
    }
}
service {
    https {  
        api {
            debug
            keys {
                id MY-HTTP-API-ID {
                    key xxxxxx
                }
            }
        }
    }
    ssh {
    }
}
system {
    config-management {
        commit-revisions 100
    }
    console {
        device ttyS0 {
            speed 115200
        }
    }
    host-name xxxxxx
    login {  
        user xxxxxx {
            authentication {
                encrypted-password xxxxxx
                plaintext-password xxxxxx
            }
        }
    }
    name-servers-dhcp eth0
    ntp {
        server xxxxx.tld {
        }
        server xxxxx.tld {
        }
        server xxxxx.tld {
        }
    }
    syslog { 
        global {
            facility all {
                level info
            }
            facility protocols {
                level debug
            }
        }
    }
}

Details

Difficulty level
Unknown (require assessment)
Version
vyos-1.3-beta-202101260443-amd64.iso
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Perfectly compatible

Event Timeline

pasik created this object in space S1 VyOS Public.
pasik updated the task description. (Show Details)

Any comments here? Is this a known issue/limitation? I can look into the internals and do some debugging if this should be working in the current state..

@pasik this is not a known issue, and your report is much appreciated. I have not drilled down into the problem yet; any debugging that you have a chance to do is welcome !

It possible nginx limitation, need to change/set/tunning some timeouts.

@Viacheslav , good point, and unfortunately, that's not an exact science: cf. T2252, wherein the timeout was raised to 10 minutes, the time of the timeout reported above.

Yeah.. I don't think it's a timeout issue really.. because of this:

"214 DNAT rules works OK and I get the showConfig op response in 7.5 seconds, but adding one more DNAT rule to total of 215 DNAT rules makes the http api get stuck and hit the timeout @ 10 mins."

There's some bug here which needs to be figured out..

And thanks for the replies btw!

It seems that the stall happens in ConfigSession.show_config() which calls /bin/cli-shell-api showConfig.

With 500 NAT rules showConfig when running manually on the shell it runs in a timely manner:

vyos@vyos# time /bin/cli-shell-api showConfig | wc -l
6031

real	0m1.825s
user	0m0.210s
sys	0m0.492s

Slightly modifying /usr/lib/python3/dist-packages/vyos/configsession.py to add some debug:

def show_config(self, path, format='raw'):
    print('in: show_config')
    config_data = self.__run_command(SHOW_CONFIG + path)

    print('exiting: show_config')
    if format == 'raw':
        return config_data

Makes the logs show that it doesn't reach the "exiting" line in a reasonable time:

Feb 16 21:51:28 vyos vyos-http-api[29146]: Serving on http://localhost:8080
Feb 16 21:51:39 vyos vyos-http-api[29146]: in retrieve_op
Feb 16 21:51:39 vyos vyos-http-api[29146]: got env
Feb 16 21:51:40 vyos vyos-http-api[29146]: got config
Feb 16 21:51:40 vyos vyos-http-api[29146]: in: show_config
<still waiting>

Tried to mess around/instrument https://github.com/vyos/vyatta-cfg/blob/current/src/cnode/cnode-algorithm.cpp#L924,
ended up loosing the active config and /opt/vyatta/config/active becoming empty.

Not sure if that was a result of installing an incompatible version, or just if upgrading the package live will drop the config since they're not meant to be upgraded with dpkg/apt instead of booting a new image

Seems like the issue manifests in cnode-algorithm.cpp:_diff_print_indent:

   228 static void
   229 _diff_print_indent(const CfgNode *cfg1, const CfgNode *cfg2, int level,
   230                    const char *pfx_diff)
   231 {
+  232   std::ofstream debug; debug.open("/tmp/cfg-debug", std::ios_base::app); 
+  233   debug << "in _diff_print_indent\n";
   234   /* note: activate/deactivate state output was handled here. pending
   235    *       redesign, the output notation will be changed to "per-subtree"
   236    *       marking, so the output will be handled with the rest of the node.
   237    */
+  238   debug << "_diff_print_indent: printing pfx_diff\n";
+  239   debug << "_diff_print_indent: pfx_diff: " << std::hex << static_cast<const void*>(pfx_diff) << "\n"; 
   240   printf("%s", pfx_diff);
+  241   debug << "_diff_print_indent: flag1\n";
   242   for (int i = 0; i < level; i++) {
+  243     debug << "_diff_print_indent: loooop\n"; 
   244     printf("    ");
   245   }
+  246   debug << "exiting _diff_print_indent\n"; debug.close();
   247 }

The end of the debug "trace" shows:

in _diff_check_and_show_leaf
in _diff_show_other
in _diff_print_indent
_diff_print_indent: printing pfx_diff
_diff_print_indent: pfx_diff: 0x7f038a303fb0
_diff_print_indent: flag1
_diff_print_indent: loooop
_diff_print_indent: loooop
exiting _diff_print_indent
in _show_diff
_show_diff: flag1
in _diff_check_and_show_leaf
_diff_check_and_show_leaf: flag1
_diff_check_and_show_leaf: flagB
_diff_check_and_show_leaf: flagB1
_diff_check_and_show_leaf: flagB2
_diff_check_and_show_leaf: flagB3
in _diff_print_indent
_diff_print_indent: printing pfx_diff
_diff_print_indent: pfx_diff: 0x7f038a303fb0

So it looks like printf("%s", pfx_diff); never finished. Missing null termination? Overflowing some buffer?

GDB however tells a different story:

vyos@vyos# ps aux | grep cli
root      1624  0.1  1.9  35212 19508 ?        S    12:11   0:00 /bin/cli-shell-api showConfig
root      1684  0.0  0.5   9492  5720 ?        Ss   Feb16   0:00 /sbin/dhclient -4 -nw -cf /var/lib/dhcp/dhclient_eth0.conf -pf /var/lib/dhcp/dhclient_eth0.pid -lf /var/lib/dhcp/dhclient_eth0.leases eth0
vyos      2534  0.0  0.0   6084   892 pts/0    S+   12:22   0:00 grep cli
vyos@vyos# sudo gdb -p 1624
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 1624
Reading symbols from /opt/vyatta/sbin/my_cli_shell_api...done.
Reading symbols from /lib/libvyatta-cfg.so.1...done.
[...]
Reading symbols from /lib/x86_64-linux-gnu/libgpg-error.so.0...(no debugging symbols found)...done.
0x00007f0389d05504 in write () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f0389d05504 in write () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f0389c963bd in _IO_file_write () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f0389c9575f in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f0389c97509 in _IO_do_write () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f0389c96a8f in _IO_file_xsputn () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f0389c6b1fb in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f0389d2339c in __printf_chk () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007f038a2ccc0f in boost::filesystem::recursive_directory_iterator::increment (this=0x1) at /usr/include/c++/8/bits/stl_iterator.h:783
#8  0x0000000000000000 in ?? ()

I'm not a C++ developer but the stack probably shouldn't look like that.

As compared to quickly interrupting some other call of the cli-shell-api:

# sudo gdb -p $(pgrep cli-shell-api)
[...]
(gdb) bt
#0  0x00007f85329ba1ae in open64 () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f853294be52 in _IO_file_open () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f853294bffd in _IO_file_fopen () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f8532940159 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f8532ce2ee0 in std::__basic_file<char>::open(char const*, std::_Ios_Openmode, int) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f8532d1f80a in std::basic_filebuf<char, std::char_traits<char> >::open(char const*, std::_Ios_Openmode) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f8532d2021e in std::basic_ifstream<char, std::char_traits<char> >::basic_ifstream(char const*, std::_Ios_Openmode) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f8532f74304 in cstore::unionfs::UnionfsCstore::read_whole_file (this=<optimized out>, fpath=..., data="") at src/cstore/svector.hpp:76
#8  0x00007f8532f745ce in cstore::unionfs::UnionfsCstore::read_value_vec (this=0x563c74f46cb0, vvec=std::vector of length 0, capacity 0, active_cfg=<optimized out>) at src/cstore/unionfs/cstore-unionfs.cpp:1001
#9  0x00007f8532f626ae in cstore::Cstore::cfgPathGetValueDA (this=this@entry=0x563c74f46cb0, path_comps=..., value="", active_cfg=<optimized out>, include_deactivated=<optimized out>) at src/cstore/cstore.cpp:1461
#10 0x00007f8532f8114a in cnode::CfgNode::CfgNode (this=0x563c751accc0, cstore=..., path_comps=..., active=<optimized out>, recursive=<optimized out>) at src/cnode/cnode.cpp:182
#11 0x00007f8532f80afd in cnode::CfgNode::CfgNode (this=0x563c751ac680, cstore=..., path_comps=..., active=<optimized out>, recursive=<optimized out>) at src/cnode/cnode.cpp:220
#12 0x00007f8532f80afd in cnode::CfgNode::CfgNode (this=0x563c74f54ad0, cstore=..., path_comps=..., active=<optimized out>, recursive=<optimized out>) at src/cnode/cnode.cpp:220
#13 0x00007f8532f80afd in cnode::CfgNode::CfgNode (this=0x563c74f53c10, cstore=..., path_comps=..., active=<optimized out>, recursive=<optimized out>) at src/cnode/cnode.cpp:220
#14 0x00007f8532f80afd in cnode::CfgNode::CfgNode (this=0x563c74f536a0, cstore=..., path_comps=..., active=<optimized out>, recursive=<optimized out>) at src/cnode/cnode.cpp:220
#15 0x00007f8532f80afd in cnode::CfgNode::CfgNode (this=0x563c74f484f0, cstore=..., path_comps=..., active=<optimized out>, recursive=<optimized out>) at src/cnode/cnode.cpp:220
#16 0x00007f8532f84f99 in cnode::showConfig (cfg1="@WORKING", cfg2="@WORKING", path=..., show_def=<optimized out>, hide_secret=<optimized out>, context_diff=<optimized out>, show_cmds=false, ignore_edit=true)
    at src/cnode/cnode-algorithm.cpp:949
#17 0x0000563c74a79da1 in showConfig (cstore=..., args=...) at src/cli_shell_api.cpp:546
#18 0x0000563c74a77b85 in main (argc=<optimized out>, argv=<optimized out>) at src/cli_shell_api.cpp:776

This actually has nothing to do with the http api, but rather is a deadlock coming from ConfigSession show_config(), namely, the use of Popen.wait() in __run_command().

It is known that when using stdout.PIPE and the child process generates enough data, it can block. (Python docs)

A reproducer is below, which is __run_command extracted from configsession.py.
Run from within the config session, with 300 dnat rules, recreates the deadlock:

#!/usr/bin/env python3

import subprocess

def run_command(cmdlist):
    p = subprocess.Popen(cmdlist, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    print('run_command: before p.wait()')
    result = p.wait()
    print('run_command: after p.wait()')
    output = p.stdout.read().decode()
    p.communicate()
    if result != 0:
        print(f'error')
    return output

run_command(['/bin/cli-shell-api', 'showConfig'])

The following modification resolves the deadlock:

#!/usr/bin/env python3

import subprocess

def run_command(cmdlist):
    p = subprocess.Popen(cmdlist, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    (stdout_data, stderr_data) = p.communicate()
    output = stdout_data.decode()
    result = p.wait()
    if result != 0:
        print(f'error')
    return output

out = run_command(['/bin/cli-shell-api', 'showConfig'])

print(out)

A similar fix, or making use of the Popen() wrappers from util.py, will resolve the issue. This was discovered based on all the debugging provided above.

jestabro changed the task status from Open to In progress.Wed, Feb 17, 4:50 PM
jestabro triaged this task as Normal priority.

Good job debugging the issue @varesa and @jestabro ! It looks like the actual cause for the deadlock was found already..

@pasik I should have a fix committed soon; thanks for the report !

jestabro moved this task from In Progress to Finished on the VyOS 1.3 Equuleus board.
jestabro changed Is it a breaking change? from Unspecified (possibly destroys the router) to Perfectly compatible.

Thanks a lot @jestabro ! I'll give it a go with the latest version(s).

and indeed the fix works, I'm now able to add more than 215 dnat rules, and still fetch the config over the vyos http api! Thanks a lot everyone.