Page MenuHomePhabricator

GCP Networking Failure
Open, Requires assessmentPublicBUG

Description

Creation of multiple 1.2.1 instances in GCP all result in networking failure as soon as the following error is reported by cloud-init during boot. This is repeatable across reboots on the same instance and for every instance I attempt to create using the 1.2.1 image.

[ 17.698879] cloud-init[1284]: /sbin/dhclient-script: line 183: [: too many arguments

PING 35.236.63.255 (35.236.63.255) 56(84) bytes of data.
64 bytes from 35.236.63.255: icmp_seq=11 ttl=62 time=71.6 ms
64 bytes from 35.236.63.255: icmp_seq=12 ttl=62 time=69.4 ms
64 bytes from 35.236.63.255: icmp_seq=13 ttl=62 time=68.8 ms
64 bytes from 35.236.63.255: icmp_seq=14 ttl=62 time=149 ms
64 bytes from 35.236.63.255: icmp_seq=17 ttl=62 time=85.7 ms
64 bytes from 35.236.63.255: icmp_seq=18 ttl=62 time=119 ms
^C
--- 35.236.63.255 ping statistics ---
25 packets transmitted, 6 received, 76% packet loss, time 24019ms
rtt min/avg/max/mdev = 68.823/94.119/149.675/30.413 ms
[   12.753101] cloud-init[766]: Unknown option: path
[   12.771860] cloud-init[766]: /sbin/dhclient-script: line 183: [: too many arguments
[   13.079253] cloud-init[766]: Exiting: failed to connect to any daemons.
[   13.209247] cloud-init[766]: Exiting: failed to connect to any daemons.
[   13.296243] cloud-init[766]: Exiting: failed to connect to any daemons.
[   13.351845] cloud-init[766]: Cloud-init v. 18.4-0ubuntu1~16.04.2 running 'init-local' at Mon, 06 May 2019 20:22:29 +0000. Up 12.31 seconds.
[  OK  ] Started Initial cloud-init job (pre-networking).
[  OK  ] Reached target Network (Pre).
         Starting LSB: Raise network interfaces....
[  OK  ] Started LSB: Raise network interfaces..
         Starting Initial cloud-init job (metadata service crawler)...
[  OK  ] Reached target Network.
[   14.071591] cloud-init[963]: Unknown option: path
[   14.086850] cloud-init[963]: /sbin/dhclient-script: line 183: [: too many arguments
[   14.219200] cloud-init[963]: Exiting: failed to connect to any daemons.
[   14.305910] cloud-init[963]: Exiting: failed to connect to any daemons.
[   14.394096] cloud-init[963]: Exiting: failed to connect to any daemons.
[   14.562575] cloud-init[963]: Cloud-init v. 18.4-0ubuntu1~16.04.2 running 'init' at Mon, 06 May 2019 20:22:31 +0000. Up 13.91 seconds.
[   14.582832] cloud-init[963]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++
[   14.602865] cloud-init[963]: ci-info: +--------+------+--------------------------+-----------------+--------+-------------------+
[   14.623328] cloud-init[963]: ci-info: | Device |  Up  |         Address          |       Mask      | Scope  |     Hw-Address    |
[   14.646858] cloud-init[963]: ci-info: +--------+------+--------------------------+-----------------+--------+-------------------+
[   14.666919] cloud-init[963]: ci-info: |  eth0  | True |        10.168.0.3        | 255.255.255.255 | global | 42:01:0a:a8:00:03 |
[  OK  ] Started Initial cloud-init job (metadata service crawler).
[  OK  ] Reached target Cloud-config availability.
[  OK  ] Reached target System Initialization.
[  OK  ] Listening on UUID daemon activation socket.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[   14.687154] cloud-init[963]: ci-info: |  eth0  | True | fe80::4001:aff:fea8:3/64 |        .        |  link  | 42:01:0a:a8:00:03 |
[   14.786706] cloud-init[963]: ci-info: |   lo   | True |        127.0.0.1         |    255.0.0.0    |  host  |         .         |
[   14.806713] cloud-init[963]: ci-info: |   lo   | True |         ::1/128          |        .        |  host  |         .         |
[   14.826723] cloud-init[963]: ci-info: +--------+------+--------------------------+-----------------+--------+-------------------+
[   14.846741] cloud-init[963]: ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++
[  OK  ] Reached target Timers.
[   14.867004] cloud-init[963]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
[   14.898751] cloud-init[963]: ci-info: | Route | Destination |  Gateway   |     Genmask     | Interface | Flags |
[  OK  ] Listening on ACPID Listen Socket.
[   14.919010] cloud-init[963]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
[   14.954769] cloud-init[963]: ci-info: |   0   |   0.0.0.0   | 10.168.0.1 |     0.0.0.0     |    eth0   |   UG  |
[  OK  ] Reached target Sockets.
[   14.975000] cloud-init[963]: ci-info: |   1   |  10.168.0.1 |  0.0.0.0   | 255.255.255.255 |    eth0   |   UH  |
[   15.006713] cloud-init[963]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
[   15.026740] cloud-init[963]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[   15.046801] cloud-init[963]: ci-info: +-------+-------------+---------+-----------+-------+
[  OK  ] Reached target Paths.
[   15.066991] cloud-init[963]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[   15.098755] cloud-init[963]: ci-info: +-------+-------------+---------+-----------+-------+
[  OK  ] Reached target Basic System.
[   15.119018] cloud-init[963]: ci-info: |   1   |  fe80::/64  |    ::   |    eth0   |   U   |
[   15.150713] cloud-init[963]: ci-info: |   3   |    local    |    ::   |    eth0   |   U   |
         Starting Entropy daemon using the HAVEGE algorithm...
[  OK  ] Started Entropy daemon using the HAVEGE algorithm.
[   15.170889] cloud-init[963]: ci-info: |   4   |   ff00::/8  |    ::   |    eth0   |   U   |
[   15.216264] cloud-init[963]: ci-info: +-------+-------------+---------+-----------+-------+
[   15.234685] cloud-init[963]: No 'init' modules to run under section 'cloud_init_modules'
         Starting LSB: Cleans up pppoe peers files...
         Starting Deferred execution scheduler...
[  OK  ] Started Deferred execution scheduler.
         Starting Regular background program processing daemon...
[  OK  ] Started Regular background program processing daemon.
         Starting OpenBSD Secure Shell session cleanup...
         Starting /etc/rc.local Compatibility...
         Starting Login Service...
         Starting LSB: QEMU Guest Agent startup script...
         Starting LSB: Start the GNUstep distributed object mapper...
         Starting LSB: Monitor for system resources and process activity...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting System Logging Service...
         Starting ACPI event daemon...
[  OK  ] Started ACPI event daemon.
[  OK  ] Started LSB: Cleans up pppoe peers files.
[  OK  ] Started OpenBSD Secure Shell session cleanup.
[  OK  ] Started /etc/rc.local Compatibility.
[  OK  ] Started LSB: QEMU Guest Agent startup script.
[  OK  ] Started LSB: Start the GNUstep distributed object mapper.
[  OK  ] Started LSB: Monitor for system resources and process activity.
[  OK  ] Started Login Service.
[  OK  ] Reached target Network is Online.
         Starting Apply the settings specified in cloud-config...
         Starting LSB: Brings up/down network automatically...
[  OK  ] Started System Logging Service.
[  OK  ] Started LSB: Brings up/down network automatically.
[   16.224217] cloud-init[1126]: Unknown option: path
[   16.238857] cloud-init[1126]: /sbin/dhclient-script: line 183: [: too many arguments
[   16.371555] cloud-init[1126]: Exiting: failed to connect to any daemons.
[   16.458360] cloud-init[1126]: Exiting: failed to connect to any daemons.
[   16.543878] cloud-init[1126]: Exiting: failed to connect to any daemons.
[   16.584306] cloud-init[1126]: Cloud-init v. 18.4-0ubuntu1~16.04.2 running 'modules:config' at Mon, 06 May 2019 20:22:33 +0000. Up 16.08 seconds.
[  OK  ] Started Apply the settings specified in cloud-config.
         Starting VyOS Router...
[  OK  ] Started VyOS Router.
         Starting Permit User Sessions...
[  OK  ] Started Permit User Sessions.
         Starting Getty on tty1...
[  OK  ] Started Getty on tty1.
         Starting Serial Getty on ttyS0...
[  OK  ] Started Serial Getty on ttyS0.
[   16.762172] vyos-router[1220]: Reading deprecated /etc/frr/daemons.conf. Please move its settings to /etc/frr/daemons and remove it. ... (warning).
[  OK  ] Reached target Login Prompts.
[  OK  ] Reached target Multi-User System.
[   16.809175] vyos-router[1220]: watchfrr_options contains a bash array value. The configured value is intentionally ignored since it is likely wrong. Please remove or fix the setting. ... (warning).
         Starting Execute cloud user/final scripts...
[  OK  ] Reached target VyOS target.
[  OK  ] Reached target Graphical Interface.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Started Update UTMP about System Runlevel Changes.
[   17.682538] cloud-init[1284]: Unknown option: path
[   17.698879] cloud-init[1284]: /sbin/dhclient-script: line 183: [: too many arguments
[   18.994422] cloud-init[1284]: Cloud-init v. 18.4-0ubuntu1~16.04.2 running 'modules:final' at Mon, 06 May 2019 20:22:34 +0000. Up 17.38 seconds.
[   19.078920] vyos-router[1220]: Started watchfrr.
[   19.099946] vyos-router[1220]: Mounting VyOS Config...done.
[  OK  ] Started Execute cloud user/final scripts.
[  OK  ] Reached target Cloud-init target.
[   23.736240] vyos-router[1220]: Starting VyOS router: migrate rl-system firewall configure.
[   24.191514] vyos-config[1249]: Configuration success

Welcome to VyOS - instance-2 ttyS0

Details

Difficulty level
Unknown (require assessment)
Version
1.2.1
Why the issue appeared?
Will be filled on close

Event Timeline

joshua created this task.May 6 2019, 8:31 PM
joshua added a comment.May 8 2019, 8:49 PM

I made some progress here using the debug image, however, it looks like cloud-init is still starting a dhclient process that doesn't need to exist as part of it's fallback networking.

Manual fix on post boot on serial console :

debuguser@vyos-121-debug> ifconfig
eth0      Link encap:Ethernet  HWaddr 42:01:0a:a8:00:06  
          inet addr:10.168.0.6  Bcast:10.168.0.6  Mask:255.255.255.255
          inet6 addr: fe80::4001:aff:fea8:6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1460  Metric:1
          RX packets:160 errors:0 dropped:0 overruns:0 frame:0
          TX packets:84 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:19220 (18.7 KiB)  TX bytes:9079 (8.8 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

debuguser@vyos-121-debug> netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
debuguser@vyos-121-debug> sudo route add 10.168.0.1/32 dev eth0
debuguser@vyos-121-debug> netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.168.0.1      0.0.0.0         UG        0 0          0 eth0
10.168.0.1      0.0.0.0         255.255.255.255 UH        0 0          0 eth0
debuguser@vyos-121-debug> ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=122 time=0.925 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=122 time=0.985 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=122 time=0.322 ms
^C
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss,

Root cause appears to be redundant dhclients due to cloud init :

debuguser@vyos-121-debug> sudo -s
sh-4.3# ps aux | grep dhc
root      1444  0.0  0.4  25404  7912 ?        Ss   18:55   0:00 /sbin/dhclient eth0
root      1828  0.0  0.5  25404  8808 ?        Ss   18:55   0:00 /sbin/dhclient -q -nw -cf /var/lib/dhcp/dhclient_eth0.conf -pf /var/lib/dhcp/dhclient_eth0.pid -lf /var/lib/dhcp/dhclient_eth0.leases eth0
ntp       2013  0.0  0.2  29176  4316 ?        Ss   18:55   0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -c /var/lib/ntp/ntp.conf.dhcp -u 105:112
root      2181  0.0  0.1  12732  2180 ttyS0    S+   18:56   0:00 grep dhc

sh-4.3# cat /etc/network/interfaces.d/50-cloud-init.cfg 
# This file is generated from information provided by
# the datasource.  Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet dhcp

Resolution 1 (Lazy) - Remove DHCP from eth0 configuration in config file and let cloud init defaults handle addressing (This will immediately break if dhcp is configured on eth0 from the config file) :

debuguser@vyos-121-debug> cat /config/config.boot | grep interfaces -A 3
interfaces {
    ethernet "eth0"
    loopback "lo"
}

debuguser@vyos-121-debug> netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.168.0.1      0.0.0.0         UG        0 0          0 eth0
10.168.0.1      0.0.0.0         255.255.255.255 UH        0 0          0 eth0

More proper resolution (Still spawns un-needed dhclient on eth0) - Remove Cloud init networking and fix DHCLIENT generation in pl script to allow handling subnet data for googles networking design (The /32 with host routes) :

modify vyatta-interfaces.pl to enable RFC3442 array reading and enable the request of static routes via DHCLIENT config :
sh-4.3# cat /opt/vyatta/sbin/vyatta-interfaces.pl | grep "sub dhcp_update_config" -A 10
sub dhcp_update_config {
  my ($conf_file, $intf) = @_;

  my $output = dhcp_conf_header();
  my $hostname = get_hostname();
  $output .= "option rfc3442-classless-static-routes code 121 = array of unsigned integer 8;\n";  <======
  $output .= "interface \"$intf\" {\n";
  if (defined($hostname)) {
    $output .= "\tsend host-name \"$hostname\";\n";
  }
  $output .= "\trequest subnet-mask, broadcast-address, routers, rfc3442-classless-static-routes, domain-name-servers"; <======

Disable Cloud-Init networking :

sh-4.3# cat /etc/cloud/cloud.cfg.d/01_disable-network-config.cfg
network: {config: disabled}

Remove Cloud-Init generated file : 
sh-4.3# rm /etc/network/interfaces.d/50-cloud-init.cfg 

Make sure config is now set to DHCP :
sh-4.3# cat /config/config.boot | grep interfaces -A 5
interfaces {
    ethernet "eth0" {
        address dhcp
    }
    loopback "lo"
}

Reboot :

sh-4.3# ifconfig
eth0      Link encap:Ethernet  HWaddr 42:01:0a:a8:00:06
          inet addr:10.168.0.6  Bcast:10.168.0.6  Mask:255.255.255.255
          inet6 addr: fe80::4001:aff:fea8:6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1460  Metric:1
          RX packets:566 errors:0 dropped:0 overruns:0 frame:0
          TX packets:497 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:58367 (56.9 KiB)  TX bytes:59611 (58.2 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

sh-4.3# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.168.0.1      0.0.0.0         UG        0 0          0 eth0
0.0.0.0         10.168.0.1      0.0.0.0         UG        0 0          0 eth0
10.168.0.1      0.0.0.0         255.255.255.255 UH        0 0          0 eth0

The only thing left to resolve is the duplicate dhclient instance that I believe is being spawned by cloud-init's fallback networking. I am not too terribly familiar with cloud-init other than completely disabling it on most of my own servers.

sh-4.3# ps aux | grep dhclient
root      1436  0.0  0.4  25404  7840 ?        Ss   19:58   0:00 /sbin/dhclient eth0
root      1820  0.0  0.5  25404  8732 ?        Ss   19:58   0:00 /sbin/dhclient -q -nw -cf /var/lib/dhcp/dhclient_eth0.conf -pf /var/lib/dhcp/dhclient_eth0.pid -lf /var/lib/dhcp/dhclient_eth0.leases eth0
root      2259  0.0  0.1  12732  2176 pts/0    S+   20:04   0:00 grep dhclient

sh-4.3# ls /etc/network/interfaces.d/
sh-4.3#

sh-4.3# cat /etc/network/interfaces
# interfaces(5) file used by ifup(8) and ifdown(8)
# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

# The loopback network interface
auto lo
iface lo inet loopback
joshua added a comment.May 9 2019, 5:31 PM

Made some more progress, however, well either need to use the workaround for GCE images or come up with something more elegant that resolves the cloud-init behavior.

Testing :

Killing cloud-init stops the extra dhclient :

sh-4.3# touch /etc/cloud/cloud-init.disabled
sh-4.3# reboot
debuguser@vyos-121-debug> sudo -s
sh-4.3# ps aux | grep dhclient
root      1373  0.0  0.5  25404  8820 ?        Ss   16:11   0:00 /sbin/dhclient -q -nw -cf /var/lib/dhcp/dhclient_eth0.conf -pf /var/lib/dhcp/dhclient_eth0.pid -lf /var/lib/dhcp/dhclient_eth0.leases eth0
root      1714  0.0  0.1  12732  2240 pts/0    S+   16:12   0:00 grep dhclient

I tried messing with interface configuration files to get eth0 up in an unconfigured state at boot, however, cloud-init still starts fallback networking in that case.

Work around with script (Includes previous modifications to VyOS interface.pl script and cloud-init networking disable :

debuguser@vyos-121-debug> cat /config/scripts/vyos-postconfig-bootup.script
#!/bin/sh
# This script is executed at boot time after VyOS configuration is fully applied.
# Any modifications required to work around unfixed bugs
# or use services not available through the VyOS CLI system can be placed here.

kill $(ps aux | grep dhclient | grep -v .conf | grep -v grep |awk '{print $2}')


debuguser@vyos-121-debug> ps aux | grep dhcli
root      1822  0.0  0.5  25404  8736 ?        Ss   17:21   0:00 /sbin/dhclient -q -nw -cf /var/lib/dhcp/dhclient_eth0.conf -pf /var/lib/dhcp/dhclient_eth0.pid -lf /var/lib/dhcp/dhclient_eth0.leases eth0
debugus+  2191  0.0  0.1  12732  2064 pts/0    S+   17:23   0:00 grep dhcli
pasik added a subscriber: pasik.May 9 2019, 9:07 PM

We are closer to a resolution here, however, some legacy code references are causing issues with the 1st boot before our dhcp patches are effective. This code reference is in the dhclient-script being called by cloud-init on the 1st boot. With this failure cloud-init cannot hit the GCE data-source on the 1st boot causing some issues. This will require more work tomorrow, however, I feel a resolution is rather close.

I decided to patch this during build on my local workstation while building rolling for testing as shown below.

root@931f0dd4ff60:/vyos# cat tools/cloud-init/GCE/98-gce-dhcp-patch.chroot 
#!/bin/bash
grep 'rfc3442' /opt/vyatta/sbin/vyatta-interfaces.pl
if [ $? != 0 ]; then
    perl -pi.back -e 's/get_hostname\(\)\;/get_hostname\(\)\;\n  \$output \.\= \"option rfc3442-classless-static-routes code 121 \= array of unsigned integer 8\;\\n\"\;/' /opt/vyatta/sbin/vyatta-interfaces.pl
    perl -pi -e 's/\$output \.\= \"\\trequest subnet-mask\, broadcast-address\, routers\, domain-name-servers\"\;/\$output \.\= \"\\trequest subnet-mask\, broadcast-address\, routers\, rfc3442-classless-static-routes\, domain-name-servers\"\;/' /opt/vyatta/sbin/vyatta-interfaces.pl
fi

root@931f0dd4ff60:/vyos# cat Makefile | grep 98-
	cp tools/cloud-init/GCE/98-gce-dhcp-patch.chroot build/config/hooks/live/

jbrooks@ubuntu:~/vyos-build/build$ cat chroot/opt/vyatta/sbin/vyatta-interfaces.pl | grep rfc
  $output .= "option rfc3442-classless-static-routes code 121 = array of unsigned integer 8;\n";
  $output .= "\trequest subnet-mask, broadcast-address, routers, rfc3442-classless-static-routes, domain-name-servers";

1st boot failure example :

1st boot failure due to missing legacy script still being referenced by /sbin/dhclient-script 
/sbin/dhclient-script: line 183: [: too many arguments
/sbin/dhclient-script: line 42: /opt/vyatta/sbin/vyatta_update_resolv.pl: No such file or directory
2019-05-14 02:15:03,359 - DataSourceGCE.py[WARNING]: address "http://metadata.google.internal/computeMetadata/v1/" is not resolvable
debuguser@instance-3> head -42 /sbin/dhclient-script | tail -1
    /opt/vyatta/sbin/vyatta_update_resolv.pl --dhclient-script 1

This corrects on reboot as the system recognizes DHCP is on eth0 and creates the configuration.  Then our patched DHCP client takes over. 

debuguser@instance-3> cat /config/config.boot | grep eth -A 2
    ethernet eth0 {
        address "dhcp"
    }

This is resolvable via the following two fixes. Once they are implemented GCP should work as expected.

T1379 T1381