Page MenuHomePhabricator

open-vm-tools causing 100% CPU load
Closed, ResolvedPublicBUG

Description

VMWare Tools are using 100% CPU after installing VyOS 1.2.0-rc11 within VMWare.

Tasks: 165 total,   4 running,  72 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.5 us, 65.4 sy,  0.0 ni, 18.5 id,  2.6 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:   1017016 total,   409800 used,   607216 free,    31236 buffers
KiB Swap:        0 total,        0 used,        0 free.   134692 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 2209 root      20   0  197004  15076   8464 R 95.3  1.5   0:58.97 vmtoolsd
    1 root      20   0  110612   5140   3276 S  0.0  0.5   0:01.05 systemd
    2 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kthreadd
    3 root       0 -20       0      0      0 I  0.0  0.0   0:00.00 rcu_gp

Details

Difficulty level
Unknown (require assessment)
Version
VyOS 1.2.0-rc11
Why the issue appeared?
Will be filled on close

Event Timeline

MrXermon updated the task description. (Show Details)
pasik added a subscriber: pasik.Dec 20 2018, 10:36 PM

IIRC I worked around this by editing /etc/vmware-tools/tools.conf and adding:

[guestinfo]
poll-interval=0

See if it works for you also.

MrXermon renamed this task from open-vm-tool causing 100% CPU load to open-vm-tools causing 100% CPU load.Dec 20 2018, 10:48 PM

Yep, seems to do the job.

Do you have a full bgp table on this box?

Yes, but only IPv6. It seems that the VMware Tools only produce the load when the BGP sessions are established. Pretty strange! Maybe some sort of memory problem?

danhusan added a comment.EditedDec 20 2018, 11:16 PM

It is vmware-tools collecting all the routes from VyOS and reporting them to the hypervisor so you can see them in the ESXi GUI. Not really optimal for full tables.

VyOS devs: I am not sure what else goes missing by setting 'poll-interval=0' but I have been running with it for many months without any ill effects.
A way of just disabling the route-polling would be this patch here: https://github.com/vmware/open-vm-tools/issues/231

Uh. That's pretty crazy! I'll play around with the option on the other VMs I am migrating from Proxmox VE to VMWare ESXi.

syncer triaged this task as High priority.Dec 21 2018, 10:16 AM
syncer edited projects, added VyOS 1.2 Crux ( VyOS 1.2.0-EPA); removed VyOS 1.2 Crux.
hagbard claimed this task.Dec 26 2018, 10:34 PM

Hi @danhusan, did you ever try another poll value, like 3 secs or 5 or anything like that? If set to 0, the host system won't show you any updated meta data, like if you change the ip address etc.
(https://github.com/vmware/open-vm-tools/blob/master/open-vm-tools/services/plugins/guestInfo/guestInfoServer.c#L1662)
I'm therefore not entirely sure if that should be treated as a special case scenario (we could publish a kb if you run into that condition), or if it is a general issue since you 2 were the only ones experience that issue as far as I know.
I'm also not sure it only is triggered by your situation (full bgp table) or if it can happen on other occasions as well, if you came across more issues regarding that value, please let me know.

Hi @hagbard,
i played with different values and in my case (full table IPv6 router) the error continues during the following values:
3s, 5s, 10s, 20s, 30s
At 60s the CPU load starts to cycle between >30s full load, than it drops for a few seconds and raises again.

If i remove the full table from the router and only advertise a default route the vmware-tools run fine.

Maybe it's the best way to have a option to disable only the route gathering module within the VMWare Tools.

@MrXermon , yes that sounds reasonable. I found in the code that they limit it to 100 routes, can you please try the following:
(https://github.com/vmware/open-vm-tools/blob/master/open-vm-tools/lib/include/conf.h#L138)

Within the tools.conf you can test with:
max-ipv4-routes and max-ipv6-routes, they have default 100, 0 disables it entirely. If that does the trick I think I extend the tools-script package since vyos is supposed to be a router which usually deal with more that 1 route and to avoid vmware load issues, I guess that would be the right way to go.

tools.conf (please remove the poll-interval setting to test).
max-ipv6-routes=0
max-ipv4-routes=

danhusan added a comment.EditedDec 28 2018, 7:19 PM

Hi @danhusan, did you ever try another poll value, like 3 secs or 5 or anything like that? If set to 0, the host system won't show you any updated meta data, like if you change the ip address etc.
(https://github.com/vmware/open-vm-tools/blob/master/open-vm-tools/services/plugins/guestInfo/guestInfoServer.c#L1662)
I'm therefore not entirely sure if that should be treated as a special case scenario (we could publish a kb if you run into that condition), or if it is a general issue since you 2 were the only ones experience that issue as far as I know.
I'm also not sure it only is triggered by your situation (full bgp table) or if it can happen on other occasions as well, if you came across more issues regarding that value, please let me know.

I did, my experience was much the same as MrXermon's. Except that I have 700k routes and not ~60k. So the CPU was constantly pegged regardless.

Running with

[guestinfo]
max-ipv6-routes=0
max-ipv4-routes=0

does not seem to help, the process is still pegged at 99%

And FYI max-ipv6-routes=10 and max-ipv4-routes=10 doesn't seem to help either.

Thanks for testing that guys.

hagbard changed the task status from Open to In progress.Dec 29 2018, 7:09 PM

@syncer I was thinking to add a cli menu for vmwaretoolsd mitgation like these things. It seems that not many were affected by that but if there is anything in the cli available, it can configure the toolsd to prevent things like that plus the toolsd has tons of options. So, I'm not really sure how I should go forward with this one.

Hi all,

Just as additional info: stopping the polling causes some interaction between ESXi host and the guestOS to fail, as the guestOS stops reporting some useful info like guest (id, family, full name), hostname, IP addresses or disks. Here you have a compare of guest properties between 1.1.8 and 1.2.0:

1.1.8

ToolsStatus                     : toolsOk
ToolsVersionStatus              : guestToolsUnmanaged
ToolsVersionStatus2             : guestToolsUnmanaged
ToolsRunningStatus              : guestToolsRunning
ToolsVersion                    : 2147483647
ToolsInstallType                : guestToolsTypeUnknown
GuestId                         : debian6_64Guest
GuestFamily                     : linuxGuest
GuestFullName                   : Debian GNU/Linux 6 (64-bit)
HostName                        : us-lxa-sl4ng1fw17-01
IpAddress                       : 10.4.73.42
Net                             : {4000, 4001, 4002, 4003...}
IpStack                         : {VMware.Vim.GuestStackInfo}
Disk                            : {VMware.Vim.GuestDiskInfo, VMware.Vim.GuestDiskInfo, VMware.Vim.GuestDiskInfo}
Screen                          : VMware.Vim.GuestScreenInfo
GuestState                      : running
AppHeartbeatStatus              : appStatusGray
GuestKernelCrashed              : False
AppState                        : none
GuestOperationsReady            : True
InteractiveGuestOperationsReady : False
GuestStateChangeSupported       : True
GenerationInfo                  :

1.2.0

ToolsStatus                     : toolsOk
ToolsVersionStatus              : guestToolsUnmanaged
ToolsVersionStatus2             : guestToolsUnmanaged
ToolsRunningStatus              : guestToolsRunning
ToolsVersion                    : 2147483647
ToolsInstallType                : guestToolsTypeUnknown
GuestId                         : 
GuestFamily                     : 
GuestFullName                   : 
HostName                        : 
IpAddress                       : 
Net                             : 
IpStack                         : 
Disk                            : 
Screen                          : VMware.Vim.GuestScreenInfo
GuestState                      : running
AppHeartbeatStatus              : appStatusGray
GuestKernelCrashed              : False
AppState                        : none
GuestOperationsReady            : True
InteractiveGuestOperationsReady : False
GuestStateChangeSupported       : True
GenerationInfo                  :

If you try to make some operations through VMware Tools like running scripts or copying files it may fail as ESXi/vCenter is not able to retrieve information about the guest so that it's not aware of what OS type is running inside (Linux/Windows, etc.) and some warnings appear when you try to do so.

I know there is an already open bug with open-vm-tools but I would recommend to keep tracking it just in case it's resolved so that we could remove the CPU workaround and let the tools report about guest info.

syncer reopened this task as Open.Apr 9 2019, 10:08 AM

This requires fix, since workaround not fully acceptable

@syncer The only thing I can do it monitoring upstream, which hasn't released a patch for over a year to address that issue.

hagbard removed hagbard as the assignee of this task.Apr 16 2019, 9:42 PM
hagbard added a subscriber: hagbard.
syncer changed the task status from Open to Needs testing.Apr 17 2019, 9:01 PM

updated tools to 10.3 that solves the issue

Now it seems to report guest info succesfully:

1.2.0-rolling+201904250337

ToolsStatus                     : toolsOk
ToolsVersionStatus              : guestToolsUnmanaged
ToolsVersionStatus2             : guestToolsUnmanaged
ToolsRunningStatus              : guestToolsRunning
ToolsVersion                    : 10346
ToolsInstallType                : guestToolsTypeOpenVMTools
GuestId                         : debian8_64Guest
GuestFamily                     : linuxGuest
GuestFullName                   : Debian GNU/Linux 8 (64-bit)
HostName                        : vyos
IpAddress                       : 10.5.73.222
Net                             : {4000}
IpStack                         : {VMware.Vim.GuestStackInfo}
Disk                            : {VMware.Vim.GuestDiskInfo, VMware.Vim.GuestDiskInfo, VMware.Vim.GuestDiskInfo, VMware.Vim.GuestDiskInfo...}
Screen                          : VMware.Vim.GuestScreenInfo
GuestState                      : running
AppHeartbeatStatus              : appStatusGray
GuestKernelCrashed              : False
AppState                        : none
GuestOperationsReady            : True
InteractiveGuestOperationsReady : False
GuestStateChangeSupported       : True
GenerationInfo                  :

About CPU pikes when there are large routing tables, we didn't detect that behavior in our productive systems. Anyway, I will try to setup a test system with thousands of static routes and let you know how it performs.

Thanks for the fix!

syncer assigned this task to dmbaturin.May 18 2019, 8:51 AM
syncer moved this task from Needs Triage to In Progress on the VyOS 1.2 Crux (VyOS 1.2.2) board.
syncer added a subscriber: dmbaturin.

@dmbaturin merge if it's not already there

With the new versions of the open-vm-tools you can explicitly disable the pulling of the routing-table. Maybe that's a better way then disabling the whole pulling of information.

https://github.com/vmware/open-vm-tools/issues/186
https://github.com/vmware/open-vm-tools/blob/fbfe0cbc116b6a171c29f29560d6dd1891a0870d/open-vm-tools/lib/include/conf.h#L146

/**
 * Define a custom max IPv4 routes to gather.
 *
 * @note Illegal values result in a @c g_warning and fallback to the default
 * NICINFO_MAX_ROUTES.
 *
 * @param int   User-defined max routes with range [0, NICINFO_MAX_ROUTES].
 *              Set to 0 to disable gathering.
 */
#define CONFNAME_GUESTINFO_MAXIPV4ROUTES "max-ipv4-routes"

/**
 * Define a custom max IPv6 routes to gather.
 *
 * @note Illegal values result in a @c g_warning and fallback to the default
 * NICINFO_MAX_ROUTES.
 *
 * @param int   User-defined max routes with range [0, NICINFO_MAX_ROUTES].
 *              Set to 0 to disable gathering.
 */
#define CONFNAME_GUESTINFO_MAXIPV6ROUTES "max-ipv6-routes"
syncer closed this task as Resolved.Aug 22 2019, 10:36 PM