Page MenuHomePhabricator

GCP Networking Failure
Open, NormalPublicBUG

Description

Creation of multiple 1.2.1 instances in GCP all result in networking failure as soon as the following error is reported by cloud-init during boot. This is repeatable across reboots on the same instance and for every instance I attempt to create using the 1.2.1 image.

[ 17.698879] cloud-init[1284]: /sbin/dhclient-script: line 183: [: too many arguments

PING 35.236.63.255 (35.236.63.255) 56(84) bytes of data.
64 bytes from 35.236.63.255: icmp_seq=11 ttl=62 time=71.6 ms
64 bytes from 35.236.63.255: icmp_seq=12 ttl=62 time=69.4 ms
64 bytes from 35.236.63.255: icmp_seq=13 ttl=62 time=68.8 ms
64 bytes from 35.236.63.255: icmp_seq=14 ttl=62 time=149 ms
64 bytes from 35.236.63.255: icmp_seq=17 ttl=62 time=85.7 ms
64 bytes from 35.236.63.255: icmp_seq=18 ttl=62 time=119 ms
^C
--- 35.236.63.255 ping statistics ---
25 packets transmitted, 6 received, 76% packet loss, time 24019ms
rtt min/avg/max/mdev = 68.823/94.119/149.675/30.413 ms
[   12.753101] cloud-init[766]: Unknown option: path
[   12.771860] cloud-init[766]: /sbin/dhclient-script: line 183: [: too many arguments
[   13.079253] cloud-init[766]: Exiting: failed to connect to any daemons.
[   13.209247] cloud-init[766]: Exiting: failed to connect to any daemons.
[   13.296243] cloud-init[766]: Exiting: failed to connect to any daemons.
[   13.351845] cloud-init[766]: Cloud-init v. 18.4-0ubuntu1~16.04.2 running 'init-local' at Mon, 06 May 2019 20:22:29 +0000. Up 12.31 seconds.
[  OK  ] Started Initial cloud-init job (pre-networking).
[  OK  ] Reached target Network (Pre).
         Starting LSB: Raise network interfaces....
[  OK  ] Started LSB: Raise network interfaces..
         Starting Initial cloud-init job (metadata service crawler)...
[  OK  ] Reached target Network.
[   14.071591] cloud-init[963]: Unknown option: path
[   14.086850] cloud-init[963]: /sbin/dhclient-script: line 183: [: too many arguments
[   14.219200] cloud-init[963]: Exiting: failed to connect to any daemons.
[   14.305910] cloud-init[963]: Exiting: failed to connect to any daemons.
[   14.394096] cloud-init[963]: Exiting: failed to connect to any daemons.
[   14.562575] cloud-init[963]: Cloud-init v. 18.4-0ubuntu1~16.04.2 running 'init' at Mon, 06 May 2019 20:22:31 +0000. Up 13.91 seconds.
[   14.582832] cloud-init[963]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++
[   14.602865] cloud-init[963]: ci-info: +--------+------+--------------------------+-----------------+--------+-------------------+
[   14.623328] cloud-init[963]: ci-info: | Device |  Up  |         Address          |       Mask      | Scope  |     Hw-Address    |
[   14.646858] cloud-init[963]: ci-info: +--------+------+--------------------------+-----------------+--------+-------------------+
[   14.666919] cloud-init[963]: ci-info: |  eth0  | True |        10.168.0.3        | 255.255.255.255 | global | 42:01:0a:a8:00:03 |
[  OK  ] Started Initial cloud-init job (metadata service crawler).
[  OK  ] Reached target Cloud-config availability.
[  OK  ] Reached target System Initialization.
[  OK  ] Listening on UUID daemon activation socket.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[   14.687154] cloud-init[963]: ci-info: |  eth0  | True | fe80::4001:aff:fea8:3/64 |        .        |  link  | 42:01:0a:a8:00:03 |
[   14.786706] cloud-init[963]: ci-info: |   lo   | True |        127.0.0.1         |    255.0.0.0    |  host  |         .         |
[   14.806713] cloud-init[963]: ci-info: |   lo   | True |         ::1/128          |        .        |  host  |         .         |
[   14.826723] cloud-init[963]: ci-info: +--------+------+--------------------------+-----------------+--------+-------------------+
[   14.846741] cloud-init[963]: ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++
[  OK  ] Reached target Timers.
[   14.867004] cloud-init[963]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
[   14.898751] cloud-init[963]: ci-info: | Route | Destination |  Gateway   |     Genmask     | Interface | Flags |
[  OK  ] Listening on ACPID Listen Socket.
[   14.919010] cloud-init[963]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
[   14.954769] cloud-init[963]: ci-info: |   0   |   0.0.0.0   | 10.168.0.1 |     0.0.0.0     |    eth0   |   UG  |
[  OK  ] Reached target Sockets.
[   14.975000] cloud-init[963]: ci-info: |   1   |  10.168.0.1 |  0.0.0.0   | 255.255.255.255 |    eth0   |   UH  |
[   15.006713] cloud-init[963]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
[   15.026740] cloud-init[963]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[   15.046801] cloud-init[963]: ci-info: +-------+-------------+---------+-----------+-------+
[  OK  ] Reached target Paths.
[   15.066991] cloud-init[963]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[   15.098755] cloud-init[963]: ci-info: +-------+-------------+---------+-----------+-------+
[  OK  ] Reached target Basic System.
[   15.119018] cloud-init[963]: ci-info: |   1   |  fe80::/64  |    ::   |    eth0   |   U   |
[   15.150713] cloud-init[963]: ci-info: |   3   |    local    |    ::   |    eth0   |   U   |
         Starting Entropy daemon using the HAVEGE algorithm...
[  OK  ] Started Entropy daemon using the HAVEGE algorithm.
[   15.170889] cloud-init[963]: ci-info: |   4   |   ff00::/8  |    ::   |    eth0   |   U   |
[   15.216264] cloud-init[963]: ci-info: +-------+-------------+---------+-----------+-------+
[   15.234685] cloud-init[963]: No 'init' modules to run under section 'cloud_init_modules'
         Starting LSB: Cleans up pppoe peers files...
         Starting Deferred execution scheduler...
[  OK  ] Started Deferred execution scheduler.
         Starting Regular background program processing daemon...
[  OK  ] Started Regular background program processing daemon.
         Starting OpenBSD Secure Shell session cleanup...
         Starting /etc/rc.local Compatibility...
         Starting Login Service...
         Starting LSB: QEMU Guest Agent startup script...
         Starting LSB: Start the GNUstep distributed object mapper...
         Starting LSB: Monitor for system resources and process activity...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting System Logging Service...
         Starting ACPI event daemon...
[  OK  ] Started ACPI event daemon.
[  OK  ] Started LSB: Cleans up pppoe peers files.
[  OK  ] Started OpenBSD Secure Shell session cleanup.
[  OK  ] Started /etc/rc.local Compatibility.
[  OK  ] Started LSB: QEMU Guest Agent startup script.
[  OK  ] Started LSB: Start the GNUstep distributed object mapper.
[  OK  ] Started LSB: Monitor for system resources and process activity.
[  OK  ] Started Login Service.
[  OK  ] Reached target Network is Online.
         Starting Apply the settings specified in cloud-config...
         Starting LSB: Brings up/down network automatically...
[  OK  ] Started System Logging Service.
[  OK  ] Started LSB: Brings up/down network automatically.
[   16.224217] cloud-init[1126]: Unknown option: path
[   16.238857] cloud-init[1126]: /sbin/dhclient-script: line 183: [: too many arguments
[   16.371555] cloud-init[1126]: Exiting: failed to connect to any daemons.
[   16.458360] cloud-init[1126]: Exiting: failed to connect to any daemons.
[   16.543878] cloud-init[1126]: Exiting: failed to connect to any daemons.
[   16.584306] cloud-init[1126]: Cloud-init v. 18.4-0ubuntu1~16.04.2 running 'modules:config' at Mon, 06 May 2019 20:22:33 +0000. Up 16.08 seconds.
[  OK  ] Started Apply the settings specified in cloud-config.
         Starting VyOS Router...
[  OK  ] Started VyOS Router.
         Starting Permit User Sessions...
[  OK  ] Started Permit User Sessions.
         Starting Getty on tty1...
[  OK  ] Started Getty on tty1.
         Starting Serial Getty on ttyS0...
[  OK  ] Started Serial Getty on ttyS0.
[   16.762172] vyos-router[1220]: Reading deprecated /etc/frr/daemons.conf. Please move its settings to /etc/frr/daemons and remove it. ... (warning).
[  OK  ] Reached target Login Prompts.
[  OK  ] Reached target Multi-User System.
[   16.809175] vyos-router[1220]: watchfrr_options contains a bash array value. The configured value is intentionally ignored since it is likely wrong. Please remove or fix the setting. ... (warning).
         Starting Execute cloud user/final scripts...
[  OK  ] Reached target VyOS target.
[  OK  ] Reached target Graphical Interface.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Started Update UTMP about System Runlevel Changes.
[   17.682538] cloud-init[1284]: Unknown option: path
[   17.698879] cloud-init[1284]: /sbin/dhclient-script: line 183: [: too many arguments
[   18.994422] cloud-init[1284]: Cloud-init v. 18.4-0ubuntu1~16.04.2 running 'modules:final' at Mon, 06 May 2019 20:22:34 +0000. Up 17.38 seconds.
[   19.078920] vyos-router[1220]: Started watchfrr.
[   19.099946] vyos-router[1220]: Mounting VyOS Config...done.
[  OK  ] Started Execute cloud user/final scripts.
[  OK  ] Reached target Cloud-init target.
[   23.736240] vyos-router[1220]: Starting VyOS router: migrate rl-system firewall configure.
[   24.191514] vyos-config[1249]: Configuration success

Welcome to VyOS - instance-2 ttyS0

Details

Difficulty level
Unknown (require assessment)
Version
1.2.1
Why the issue appeared?
Will be filled on close

Event Timeline

joshua created this task.May 6 2019, 8:31 PM
joshua added a comment.May 8 2019, 8:49 PM

I made some progress here using the debug image, however, it looks like cloud-init is still starting a dhclient process that doesn't need to exist as part of it's fallback networking.

Manual fix on post boot on serial console :

debuguser@vyos-121-debug> ifconfig
eth0      Link encap:Ethernet  HWaddr 42:01:0a:a8:00:06  
          inet addr:10.168.0.6  Bcast:10.168.0.6  Mask:255.255.255.255
          inet6 addr: fe80::4001:aff:fea8:6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1460  Metric:1
          RX packets:160 errors:0 dropped:0 overruns:0 frame:0
          TX packets:84 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:19220 (18.7 KiB)  TX bytes:9079 (8.8 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

debuguser@vyos-121-debug> netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
debuguser@vyos-121-debug> sudo route add 10.168.0.1/32 dev eth0
debuguser@vyos-121-debug> netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.168.0.1      0.0.0.0         UG        0 0          0 eth0
10.168.0.1      0.0.0.0         255.255.255.255 UH        0 0          0 eth0
debuguser@vyos-121-debug> ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=122 time=0.925 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=122 time=0.985 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=122 time=0.322 ms
^C
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss,

Root cause appears to be redundant dhclients due to cloud init :

debuguser@vyos-121-debug> sudo -s
sh-4.3# ps aux | grep dhc
root      1444  0.0  0.4  25404  7912 ?        Ss   18:55   0:00 /sbin/dhclient eth0
root      1828  0.0  0.5  25404  8808 ?        Ss   18:55   0:00 /sbin/dhclient -q -nw -cf /var/lib/dhcp/dhclient_eth0.conf -pf /var/lib/dhcp/dhclient_eth0.pid -lf /var/lib/dhcp/dhclient_eth0.leases eth0
ntp       2013  0.0  0.2  29176  4316 ?        Ss   18:55   0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -c /var/lib/ntp/ntp.conf.dhcp -u 105:112
root      2181  0.0  0.1  12732  2180 ttyS0    S+   18:56   0:00 grep dhc

sh-4.3# cat /etc/network/interfaces.d/50-cloud-init.cfg 
# This file is generated from information provided by
# the datasource.  Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet dhcp

Resolution 1 (Lazy) - Remove DHCP from eth0 configuration in config file and let cloud init defaults handle addressing (This will immediately break if dhcp is configured on eth0 from the config file) :

debuguser@vyos-121-debug> cat /config/config.boot | grep interfaces -A 3
interfaces {
    ethernet "eth0"
    loopback "lo"
}

debuguser@vyos-121-debug> netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.168.0.1      0.0.0.0         UG        0 0          0 eth0
10.168.0.1      0.0.0.0         255.255.255.255 UH        0 0          0 eth0

More proper resolution (Still spawns un-needed dhclient on eth0) - Remove Cloud init networking and fix DHCLIENT generation in pl script to allow handling subnet data for googles networking design (The /32 with host routes) :

modify vyatta-interfaces.pl to enable RFC3442 array reading and enable the request of static routes via DHCLIENT config :
sh-4.3# cat /opt/vyatta/sbin/vyatta-interfaces.pl | grep "sub dhcp_update_config" -A 10
sub dhcp_update_config {
  my ($conf_file, $intf) = @_;

  my $output = dhcp_conf_header();
  my $hostname = get_hostname();
  $output .= "option rfc3442-classless-static-routes code 121 = array of unsigned integer 8;\n";  <======
  $output .= "interface \"$intf\" {\n";
  if (defined($hostname)) {
    $output .= "\tsend host-name \"$hostname\";\n";
  }
  $output .= "\trequest subnet-mask, broadcast-address, routers, rfc3442-classless-static-routes, domain-name-servers"; <======

Disable Cloud-Init networking :

sh-4.3# cat /etc/cloud/cloud.cfg.d/01_disable-network-config.cfg
network: {config: disabled}

Remove Cloud-Init generated file : 
sh-4.3# rm /etc/network/interfaces.d/50-cloud-init.cfg 

Make sure config is now set to DHCP :
sh-4.3# cat /config/config.boot | grep interfaces -A 5
interfaces {
    ethernet "eth0" {
        address dhcp
    }
    loopback "lo"
}

Reboot :

sh-4.3# ifconfig
eth0      Link encap:Ethernet  HWaddr 42:01:0a:a8:00:06
          inet addr:10.168.0.6  Bcast:10.168.0.6  Mask:255.255.255.255
          inet6 addr: fe80::4001:aff:fea8:6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1460  Metric:1
          RX packets:566 errors:0 dropped:0 overruns:0 frame:0
          TX packets:497 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:58367 (56.9 KiB)  TX bytes:59611 (58.2 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

sh-4.3# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.168.0.1      0.0.0.0         UG        0 0          0 eth0
0.0.0.0         10.168.0.1      0.0.0.0         UG        0 0          0 eth0
10.168.0.1      0.0.0.0         255.255.255.255 UH        0 0          0 eth0

The only thing left to resolve is the duplicate dhclient instance that I believe is being spawned by cloud-init's fallback networking. I am not too terribly familiar with cloud-init other than completely disabling it on most of my own servers.

sh-4.3# ps aux | grep dhclient
root      1436  0.0  0.4  25404  7840 ?        Ss   19:58   0:00 /sbin/dhclient eth0
root      1820  0.0  0.5  25404  8732 ?        Ss   19:58   0:00 /sbin/dhclient -q -nw -cf /var/lib/dhcp/dhclient_eth0.conf -pf /var/lib/dhcp/dhclient_eth0.pid -lf /var/lib/dhcp/dhclient_eth0.leases eth0
root      2259  0.0  0.1  12732  2176 pts/0    S+   20:04   0:00 grep dhclient

sh-4.3# ls /etc/network/interfaces.d/
sh-4.3#

sh-4.3# cat /etc/network/interfaces
# interfaces(5) file used by ifup(8) and ifdown(8)
# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

# The loopback network interface
auto lo
iface lo inet loopback
joshua added a comment.May 9 2019, 5:31 PM

Made some more progress, however, well either need to use the workaround for GCE images or come up with something more elegant that resolves the cloud-init behavior.

Testing :

Killing cloud-init stops the extra dhclient :

sh-4.3# touch /etc/cloud/cloud-init.disabled
sh-4.3# reboot
debuguser@vyos-121-debug> sudo -s
sh-4.3# ps aux | grep dhclient
root      1373  0.0  0.5  25404  8820 ?        Ss   16:11   0:00 /sbin/dhclient -q -nw -cf /var/lib/dhcp/dhclient_eth0.conf -pf /var/lib/dhcp/dhclient_eth0.pid -lf /var/lib/dhcp/dhclient_eth0.leases eth0
root      1714  0.0  0.1  12732  2240 pts/0    S+   16:12   0:00 grep dhclient

I tried messing with interface configuration files to get eth0 up in an unconfigured state at boot, however, cloud-init still starts fallback networking in that case.

Work around with script (Includes previous modifications to VyOS interface.pl script and cloud-init networking disable :

debuguser@vyos-121-debug> cat /config/scripts/vyos-postconfig-bootup.script
#!/bin/sh
# This script is executed at boot time after VyOS configuration is fully applied.
# Any modifications required to work around unfixed bugs
# or use services not available through the VyOS CLI system can be placed here.

kill $(ps aux | grep dhclient | grep -v .conf | grep -v grep |awk '{print $2}')


debuguser@vyos-121-debug> ps aux | grep dhcli
root      1822  0.0  0.5  25404  8736 ?        Ss   17:21   0:00 /sbin/dhclient -q -nw -cf /var/lib/dhcp/dhclient_eth0.conf -pf /var/lib/dhcp/dhclient_eth0.pid -lf /var/lib/dhcp/dhclient_eth0.leases eth0
debugus+  2191  0.0  0.1  12732  2064 pts/0    S+   17:23   0:00 grep dhcli
pasik added a subscriber: pasik.May 9 2019, 9:07 PM

We are closer to a resolution here, however, some legacy code references are causing issues with the 1st boot before our dhcp patches are effective. This code reference is in the dhclient-script being called by cloud-init on the 1st boot. With this failure cloud-init cannot hit the GCE data-source on the 1st boot causing some issues. This will require more work tomorrow, however, I feel a resolution is rather close.

I decided to patch this during build on my local workstation while building rolling for testing as shown below.

root@931f0dd4ff60:/vyos# cat tools/cloud-init/GCE/98-gce-dhcp-patch.chroot 
#!/bin/bash
grep 'rfc3442' /opt/vyatta/sbin/vyatta-interfaces.pl
if [ $? != 0 ]; then
    perl -pi.back -e 's/get_hostname\(\)\;/get_hostname\(\)\;\n  \$output \.\= \"option rfc3442-classless-static-routes code 121 \= array of unsigned integer 8\;\\n\"\;/' /opt/vyatta/sbin/vyatta-interfaces.pl
    perl -pi -e 's/\$output \.\= \"\\trequest subnet-mask\, broadcast-address\, routers\, domain-name-servers\"\;/\$output \.\= \"\\trequest subnet-mask\, broadcast-address\, routers\, rfc3442-classless-static-routes\, domain-name-servers\"\;/' /opt/vyatta/sbin/vyatta-interfaces.pl
fi

root@931f0dd4ff60:/vyos# cat Makefile | grep 98-
	cp tools/cloud-init/GCE/98-gce-dhcp-patch.chroot build/config/hooks/live/

jbrooks@ubuntu:~/vyos-build/build$ cat chroot/opt/vyatta/sbin/vyatta-interfaces.pl | grep rfc
  $output .= "option rfc3442-classless-static-routes code 121 = array of unsigned integer 8;\n";
  $output .= "\trequest subnet-mask, broadcast-address, routers, rfc3442-classless-static-routes, domain-name-servers";

1st boot failure example :

1st boot failure due to missing legacy script still being referenced by /sbin/dhclient-script 
/sbin/dhclient-script: line 183: [: too many arguments
/sbin/dhclient-script: line 42: /opt/vyatta/sbin/vyatta_update_resolv.pl: No such file or directory
2019-05-14 02:15:03,359 - DataSourceGCE.py[WARNING]: address "http://metadata.google.internal/computeMetadata/v1/" is not resolvable
debuguser@instance-3> head -42 /sbin/dhclient-script | tail -1
    /opt/vyatta/sbin/vyatta_update_resolv.pl --dhclient-script 1

This corrects on reboot as the system recognizes DHCP is on eth0 and creates the configuration.  Then our patched DHCP client takes over. 

debuguser@instance-3> cat /config/config.boot | grep eth -A 2
    ethernet eth0 {
        address "dhcp"
    }

This is resolvable via the following two fixes. Once they are implemented GCP should work as expected.

T1379 T1381

syncer assigned this task to zsdc.Aug 31 2019, 12:36 AM
syncer triaged this task as Normal priority.
syncer edited projects, added VyOS 1.3 Equuleus; removed VyOS 1.2 Crux.