Page MenuHomePhabricator

Suspending and resuming VyOS in VMware will result in loss of static ip addresses
Needs testing, LowPublicBUG

Description

It seems that of VyOS 1.2.x the updated "open-vm-tools" package has a script that is called when you suspend the VM under VMware.

This script is located here: /etc/vmware-tools/scripts/vmware/network and is provided by the open-vm-tools package.

One of the function is del_intf_ip(), which basically runs:

ip addr flush dev $nic

This results in that the interfaces will lose its assigned ip addresses when you suspend the VM.

When you resume the VM, the ip addresses are not re-assigned by VyOS. This is not a problem with DHCP interfaces but is with static ip addresses.

Making the script not flush the ip addresses makes it work in my case. (eg by return 0 early in the function).
However i'm not sure what the preferred way is to make it work as intended.

Details

Difficulty level
Unknown (require assessment)
Version
v1.2.0-rc7
Why the issue appeared?
Will be filled on close

Event Timeline

yun created this task.Nov 20 2018, 3:36 PM
syncer assigned this task to hagbard.Dec 1 2018, 5:50 PM
syncer triaged this task as High priority.
syncer edited projects, added VyOS 1.2 Crux (VyOS 1.2.0-rc10); removed VyOS 1.2 Crux.
syncer added a subscriber: syncer.

@hagbardI think we need remove or disable such behavior

The vmware tools scripts work as expected, they are stopping and starting the network config as they are supposed to do, but are using debian defaults. So they are not executing the config. I'm going to check of we can extend it a little somewhere to execute the config again when 'resume' happens. In general that won't be an easy fix.

Ok, so I think I know how to attack that. either we use the tools then from debian directly and add an extra package with the user scripts, ot we do it directly in the tools package which is forked anyway. I would personally lean torwards option 2, since debian would take care of patching.

  1. Handy reference/shorthand used in this doc/scripts:
  2. TOOLS_CONFDIR ::= Depends on platform and installation settings. Likely
  3. "/etc/vmware-tools" or
  4. "/Library/Application Support/VMware Tools"
  5. powerOp ::= One of "poweron-vm", "poweroff-vm", "suspend-vm", and
  6. "resume-vm".
  7. vmwScriptDir ::= $TOOLS_CONFDIR/scripts/vmware
  8. userScriptDir ::= $TOOLS_CONFDIR/scripts/${powerOp}-default.d #
  9. End users may install scripts of their own under $userScriptDir. They
  10. are executed in alphabetical order with "$powerOp" as the only argument.

Hey @yun, could you test it under vmware, I just used the scripts to trigger resume and suspend.

https://github.com/hagbard-01/vyos-vmwaretools-scripts

If you need help building the package/installing it let me know. I see that it is going to be integrated into the rolling releases, which will however take a little.

hagbard changed the task status from Open to Needs testing.Dec 7 2018, 12:08 AM
yun added a comment.Dec 7 2018, 2:40 PM

@hagbard I tested the script, it works perfect for interfaces with static addresses. However interfaces with "dhcp" remain without an ip address after resuming. This is caused by the following issue I reported: T894

If I install netplugd with the correct modifications I commented in that ticket, it works great :)

I will add the dhcp functionality too. The problem is that the network config is very different from the approach within the OS and additional software is written the way to work with whatever has the OS, it doesn't know about vyos and it's cli etc.
Thanks for testing, I will integrate the dhcp functionality asap and see that I can quickly get it into the rolling branch.

yun added a comment.Dec 7 2018, 6:25 PM

Will you then use the netplugd way mentioned in T894 or also issue a dhcp renew in the resume vmware script? I prefer the netplug way as this also fixes issues when you switch network. I can imagine we want to avoid double renewing.

hagbard added a subscriber: c-po.Dec 7 2018, 7:20 PM

I don't think That I use netplugd for now. I have just a check in the script how an address had been setup on the system, if it's been dhcp then I send a release (suspend doesn't do that) and a new dhcp request. I have to chat with @c-po about T894 first.
If netplugd is being introduced to the OS, I can simply remove an else clause.

Here is the latest I requested to have into the rolling releases added:
https://github.com/vyos/vyos-vmwaretools-scripts

c-po added a comment.Dec 7 2018, 7:31 PM

Netplugd seems to be the thing we want

hagbard changed the task status from Needs testing to Blocked.Dec 11 2018, 6:22 PM

pending ci integration

pasik added a subscriber: pasik.Dec 16 2018, 11:22 AM
hagbard reassigned this task from hagbard to Unicron.Dec 18 2018, 6:05 PM
hagbard added subscribers: Unicron, hagbard.

@Unicron Can you please integrate the package below into ci?
https://github.com/vyos/vyos-vmwaretools-scripts

syncer reassigned this task from Unicron to UnicronNL.Jan 6 2019, 3:08 PM
hagbard closed this task as Resolved.Jan 17 2019, 7:57 PM
yun reopened this task as Needs testing.EditedApr 21 2019, 6:41 PM

Hi,

I want to set this ticket back to "Needs testing" or even "Open", I have downloaded and tested vyos-rolling-2019-04-16 and it seems it is not properly fixed.

Steps to redproduce from livecd (i'm on VMware Fusion):

  1. Create a VM, have one interface attached to it (eg, VMWare NAT interface)
  2. boot VyOS livecd (I tested with vyos-rolling-2019-04-16), login and configure eth0:
config
set interfaces ethernet eth0 address dhcp
commit
  1. Test if networking works. (eg: run show interfaces)
  2. Let a ping running: ping 8.8.8.8
  3. Now suspend and resume the VM
  4. you will see that the ping fails: ping: sendmsg: Network is unreachable
  5. Test if networking works. run show interfaces. You should see that the interface remains in "Admin Down" state.
  6. Command like sudo ifup eth0 or sudo ifconfig eth0 up don't work either.

This might be closely related to T894 as well.
So it seems that the interface down bug also affects "dhcp" addresses now and not only static ip addresses. I'm not sure since which version this happened as I tested both T894 and T1028 before, and confirmed fix.

If you want I can open up a new ticket.

yun added a comment.EditedApr 21 2019, 8:30 PM

I have a working fix, which is comprised of earlier suggested fixes I mentioned in T1028 and T894:

  • https://phabricator.vyos.net/T894#25889 (add working netplugd script, currently ifup/ifdown are missing and probe is commented, re-adding them is necessary)
  • In this original ticket i mentioned /etc/vmware-tools/scripts/vmware/network, ensure that del_intf_ip() doesn't flush addresses (eg; by returning early)
  • Remove the added vmware script: `/etc/vmware-tools/scripts/resume-vm-default.d/ether-resume.py
    • this Python script seems to try to do what netplugd tries to do (netplug probe, netplug in, netplug run-parts linkup.d)
    • I think removing this script and just relying on netplugd works better (https://phabricator.vyos.net/T1028#28225)

The only thing i'm not sure about is avoiding the ip addr flush dev $nic from the original vmware-scripts. But I haven't found a better fix yet.

yun added a comment.EditedApr 21 2019, 11:19 PM

Attempt two of the fix, so disregard everything in above attempt.

  • Fix /etc/netplug/netplug: prefix run-parts with full path to binary: /bin/run-parts
    • this ensures that /etc/netplug/linkup.d/dhclient and /etc/netplug/linkdown.d/dhclient is actually ran on interface up and down.
    • ifup and ifdown calls in original netplug script are actually not needed anymore, don't know why.. maybe netplug does it internally without relying on external tools?
  • Edit /etc/vmware/scripts/resume-vm-default.d/ether-resume.py and remove the dhclient parts
    • This ensures that static address interfaces get their addresses added after interface up (because it was removed by del_intf_ip().
    • However, dhclient stuff is removed as it's still handled by netplug scripts, and they do a better job as they use specific configuration flags.

I tested this, and this all seems to work:

  • dhcp interface -> suspend -> resume: dhcp is properly renewed after resumption
  • static address interface -> suspend -> resume: static addresses are properly readded to interface after resumption
  • dhcp interface -> switch network interface: dhcp is properly renewed, different subnet is assigned.
syncer lowered the priority of this task from High to Low.Apr 22 2019, 3:08 AM
syncer added a subscriber: dmbaturin.

suspend shouldn't be supported at all?
@c-po @hagbard @dmbaturin you thoughts on that?

It should work, at least it does for me. ether-resume is my script, it just makes sure that dhcp is called when it was configured, if not is just call the interface up and reapplies the IP address. Routes should be in frr anyway and with the interface up again, these would become active again as well. Netplugd calls scripts on event, up and down and doesn't call anything within open-vm-tools.

yun added a comment.Apr 22 2019, 4:15 PM

@hagbard Can you please test the steps I mentioned mentioned here, to see if you can reproduce: https://phabricator.vyos.net/T1028#35591
Without any modifications to any scripts, it will bring the interface into permanent down state after suspend and resume.

Also ether-resume.py just call dhclient -r (for release) and dhclient -q (for starting it again). But actually dhclient is called with a lot of options that are not configured in this script. It would be better to keep this consistent throughout VyOS. Preferable it would call the same command that is done by: renew dhcp interface X and release dhcp interface X, which is the case with the netplug dhclient scripts

The dhcp release and renew actually is already done by netplugd scripts, but they were not called because run-parts was not in the path (which is a bug on it self). So I think if we fix this we can just omit it from ether-resume.py as it will already be properly handled by netplug as intended.

yun added a comment.Apr 23 2019, 8:02 PM

Ok final attempt and trivial fix.
It seems that changing run-parts to /bin/run-parts was not needed. So netplug works fine as it is.

So the only thing we need to do is remove the dhclient parts from /etc/vmware/scripts/resume-vm-default.d/ether-resume.py. I can confirm this works and doesn't kill the interface.

Do you want me to create a Pull Request (currently don't know how) or can you make this small change @hagbard?

@yun yes, please create a PR, I have a look the asap.

hagbard added a comment.EditedApr 23 2019, 10:25 PM

I left you a few comments on the PR. Tested it now as well, your code doesn't work from what I see. But I see that dhcp stopped working, I have a look and see what I can find out. Looks like netplugd in the latest rolling has an issue too.

The interference seems to come from networkd, which is executed via ./scripts/vmware/network resume-vm executed by vmware-toolsd. So that looks like a longer mission.

yun added a comment.EditedApr 24 2019, 7:29 AM

I wonder what changed then, will also test with latest rolling

Update: Works for me in 1.2.0-rolling-201904240337. What was exactly changed in netplug?

I didn't changed anything. I did the netplug changes via T894 i think, I would have to look it up. The only change happened for the vmware-tools itself, we switched to the debian jessie package and now to the bpo jessie one.
That package contains the suspend and resume, poweroff and poweron scripts/structure. Netplug is entirely separate and the package comes from our pool. It contains a linkdown and a linkup scripts, which basically triggers the link up/down scripts which are the original vyatta ones, which came previously via vyatta-cfg-system or so.
So, there was basically a huge cleanup, plus making netplugd available again (was removed for an unknown reason before), repakage and the latest open-vm-tools plus the script we deploy for it for the resume/suspend mechanism.
So, right now I'm not sure how stable it is, please let me know if you uncover further issues, it should be logged via syslog so we have a chance ti investigate what it may does when it's blocked.

yun added a comment.Apr 24 2019, 7:10 PM

Thanks for the detailed history, that makes things more clear.
So for me the latest rolling worked, do you know what part from networkd is interfering with dhcp for you? Did you see if netplug called dhcp correctly after resume?

You can check with:

journalctl -fu vyos-netplug

nteplugd is just fine. vmwaretoolsd tries to start (resume-vm) the interfaces via systemd-networkd (it looks for the interface files), then the ether-resume kicks in and starts the dhclient, so far so good. Netplugd isn't the issue here. I also have seen it 2-3 times, right now it works in the same environment I used yesterday. I think there might be some interference with systemd-networkd called by the vmware scripts.
If you observe it again, please let me know the image you used, so I can reproduce it better. Also, you should find /var/log/vmware-net.... logs on the sytem, they basically trace all calls from the vmware supplied scripts when you trigger an action via the vmwaretoolsd. If it happens again, let's have a look at these files.

hagbard added a comment.EditedApr 24 2019, 7:20 PM

I've closed your PR without merging, since it can't be the script. Shall I close this bug here for now and you open a new one when you hit the road bumps again?

yun added a comment.EditedApr 24 2019, 8:03 PM

Hi hagbard, I don't understand why you close the PR so early without me testing the latest iso. Please when you refer to "latest" iso, to also note the rolling date. This makes it easier for everyone who tries to contribute i think.

I didn't say netplug is the issue here, i just asked if you see it renewing the DHCP lease. Because in that case there is no need for ether-resume.py to renew it.. or is there?

In my case the "bare minimum" dhclient -r and dhclient -q in ether-resume.py breaks the interface. So in that case it's better to just rely on netplug, don't you agree? (or others?)

I wonder what changed then, will also test with latest rolling

Update: Works for me in 1.2.0-rolling-201904240337. What was exactly changed in netplug?

So does it work for you, or does it not?
If it works, since nothing has changed in terms of the scripts, the scripts can't be an issue. The dhcp issue seems to be a runtime issue.

yun added a comment.EditedApr 24 2019, 8:21 PM

@hagbard But we were talking about my patch, and that it didn't work for you in latest rolling... So i tested my patch in the latest rolling (and noted the date) that it worked. Should I have made it more clear that I was testing my patch?

So to be clear: My patch works for 1.2.0-rolling-201904240337. Without the patch it still breaks as I mentioned in the bug report.

Ah, I see. You didn't mention that, so I was quite confused about your statement.

yun added a comment.EditedApr 24 2019, 8:31 PM

When I read it back, I can understand the confusion. Sorry, will try to be more clear next time.

So for me the issue still exists, even in vyos-1.2.0-rolling+201904240337-amd64 and I am sure it is because of "dhclient -r" and restarting dhclient with just "dhclient -q" in ether-resume.py. Because dhclient is actually started with a lot of different flags and now it just starts without any extra flags.

I can confirm that after removing dhclient bits from ether-resume.py it works for me and fixes this issue. (so basically the PR)

Ok, just reopen the PR. I'll review and merge it in then.

yun added a comment.Apr 24 2019, 9:04 PM

Thanks, I tested it, my findings below:

  • From a LiveCD boot: installed new package, and fixes the issue. Without the new package the issue remains.
  • However, I tested it more thoroughly by also installing it to disk and updating the package. But then after suspend and resume, the issue remains... weird??

I wonder if there anything different from a livecd boot and installed to disk image boot. I will investigate.

yun added a comment.EditedApr 24 2019, 9:12 PM

Because you mentioned networkd earlier, I looked into this immediately and found the following differences:

Installed image:

vyos@vyos:~$ sudo systemctl status systemd-networkd
● systemd-networkd.service - Network Service
   Loaded: loaded (/lib/systemd/system/systemd-networkd.service; disabled)
   Active: active (running) since Tue 2019-04-23 22:00:21 UTC; 23h ago
     Docs: man:systemd-networkd.service(8)
 Main PID: 9824 (systemd-network)
   Status: "Processing requests..."
   CGroup: /system.slice/systemd-networkd.service
           └─9824 /lib/systemd/systemd-networkd

Freshly booted livecd:

vyos@vyos:~$ sudo systemctl status systemd-networkd
● systemd-networkd.service - Network Service
   Loaded: loaded (/lib/systemd/system/systemd-networkd.service; disabled)
   Active: inactive (dead)
     Docs: man:systemd-networkd.service(8)

So as you mentioned earlier, networkd is probably causing issues as well. I will continue debugging..

hagbard added a comment.EditedApr 24 2019, 9:49 PM

Check for the /var/log/vmware-network.log files, the tool creates for each type a log and rotates it once the command finished.

@yun I think I found something, vmware-tools won't even call ether-resume.py, it only does sometimes and sometimes not. I tested it with 1.2.0-rolling+201904240337 and did a suspend and resume multiple times with the old ether-resume.py and everything is just working fine.

Can you please test with this image too? (https://downloads.vyos.io/rolling/current/amd64/vyos-1.2.0-rolling%2B201904240337-amd64.iso)

Let me know your findings.

yun added a comment.EditedApr 25 2019, 8:24 PM

hi @hagbard, I did some extensive testing. Actually I was already testing with "1.2.0-rolling+201904240337". So here are my findings.

  • I can consistently reproduce interface down state with the old "ether-resume.py" version (where it does dhclient). Both in livecd or installed image.
    • With the dhclient parts commented out, or the new version this doesn't happen.
  • In some freshly installed VyOS VM, netplug is giving issues and does not bring the interface back up. it goes into PROBE and then just makes the interface INACTIVE.
    • This seems to be some inconsistency with VMware, as when I install a new fresh VM with the iso it sometimes just works flawlessly with suspend and resume. (with the dhclient part commented out in ether-resume.py)
    • I'm baffled what could cause these weird issue as I create the VyOS VMs identical afaik.
    • I debugged more netplug stuff in these "broken" vms, and it seems that when triggering "dhclient up" script it can cause the interface to go back into a "dclient up event" again. Eventually making the interface from PROBE to INACTIVE state.

Also, the only difference (besides the Python code refactor) is that the new one does not do any "dhclient" stuff, because it is already handled by netplug. You can confirm this when you tcpdump on the interface that is on dhcp. You will see double DHCP REQUEST packets with the old script.

Another way is to just check journalctl of vyos-netplug, and check if you see dhclient action there.

If you can confirm this, then maybe we can both agree that "dhclient" is not needed in ether-resume.py.

Which VMware do you test with? I'm on VMware Fusion Professional Version 11.0.1 (10738065)

Hi @yun,

I had mixed results with the dhclient part, but that's not the major issue and April 25 iso should have the refactored script on board. I see exactly the same issues with netplug you see, it went a few rolling iso's back to test with but couldn't determine yet when it has started. Even an ip link set up dev <device> doesn't bring the interface back up. netplug get the status information via the netlink interface from the kenrel, so I'm going to start looking there to see if anything has changed. Going forward, I think systemd-networkd will be the successor sooner or later anyway, I gotta play around with it at one point anyway. It usually monitors the interfaces via netlink as well, but has more filters and rules you can therefore apply.
I'm still not too sure why it sometimes breaks and sometimes works, I didn't find anything useful in the log too, only the information we have already.
I was using esxi 6.7.

yun added a comment.EditedApr 25 2019, 10:42 PM

Yes, it's pretty vague bug, and seems it's more related on how the VM was initially created if it will work or not.

Don't think it's really worth the effort to debug netplug then as it's quite old and outdated. systemd-networkd support would be pretty sweet. I actually got it working for dhcp as a test:

# stop and disable netplug
sudo systemctl stop vyos-netplug
sudo systemctl disable vyos-netplug

# create eth0.network
cat /etc/systemd/network/10-eth0.network
-------------
[Match]
Name=eth0

[Network]
DHCP=v4
------------

# check log output
sudo journalctl -fu systemd-networkd

suspend and resume, or change network adapter. It wil work pretty flawlessly :)

I was thinking we could actually generate a /run/systemd/network/10-dhcp-interfaces.network file that just contains all the interfaces that are configured for dhcp. Like this:

[Match]
Name=eth0 eth1

[Network]
DHCP=v4

Ofcourse we also need to check if we need other dhcp client settings. If this works it would be a good replacement for netplug and more future proof as you said :)

UPDATE: actually whitespace separated list doesn't seem to be supported by the systemd version in VyOS 1.2.

Not sure if you're still looking into this, but the following script works perfectly for me on the crux branch:

https://github.com/unauthorized-access-bv/vyos-vmwaretools-scripts/blob/crux/scripts/resume-vm-default.d/ether-resume.py

yun added a comment.Aug 10 2019, 5:22 PM

Hi Donny,

Nice to see you here :)

I'm still on VyOS 1.2.0-rolling+201904160337 and didn't have the problem after my patch and a good install. I assume this suddenly broke in a newer version because VyOS Python API changed? Looks like return_effective_values() now returns a list instead of a string by the looks of it.

Regards,
Yun