Page MenuHomePhabricator

open-vm-tools causing 100% CPU load
Closed, ResolvedPublicBUG

Description

VMWare Tools are using 100% CPU after installing VyOS 1.2.0-rc11 within VMWare.

Tasks: 165 total,   4 running,  72 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.5 us, 65.4 sy,  0.0 ni, 18.5 id,  2.6 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:   1017016 total,   409800 used,   607216 free,    31236 buffers
KiB Swap:        0 total,        0 used,        0 free.   134692 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 2209 root      20   0  197004  15076   8464 R 95.3  1.5   0:58.97 vmtoolsd
    1 root      20   0  110612   5140   3276 S  0.0  0.5   0:01.05 systemd
    2 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kthreadd
    3 root       0 -20       0      0      0 I  0.0  0.0   0:00.00 rcu_gp

Details

Difficulty level
Unknown (require assessment)
Version
VyOS 1.2.0-rc11
Why the issue appeared?
Will be filled on close
MrXermon updated the task description. (Show Details)
pasik added a subscriber: pasik.Dec 20 2018, 10:36 PM

IIRC I worked around this by editing /etc/vmware-tools/tools.conf and adding:

[guestinfo]
poll-interval=0

See if it works for you also.

MrXermon renamed this task from open-vm-tool causing 100% CPU load to open-vm-tools causing 100% CPU load.Dec 20 2018, 10:48 PM

Yep, seems to do the job.

Do you have a full bgp table on this box?

Yes, but only IPv6. It seems that the VMware Tools only produce the load when the BGP sessions are established. Pretty strange! Maybe some sort of memory problem?

danhusan added a comment.EditedDec 20 2018, 11:16 PM

It is vmware-tools collecting all the routes from VyOS and reporting them to the hypervisor so you can see them in the ESXi GUI. Not really optimal for full tables.

VyOS devs: I am not sure what else goes missing by setting 'poll-interval=0' but I have been running with it for many months without any ill effects.
A way of just disabling the route-polling would be this patch here: https://github.com/vmware/open-vm-tools/issues/231

Uh. That's pretty crazy! I'll play around with the option on the other VMs I am migrating from Proxmox VE to VMWare ESXi.

syncer triaged this task as High priority.Dec 21 2018, 10:16 AM
syncer edited projects, added VyOS 1.2 Crux ( VyOS 1.2.0-EPA); removed VyOS 1.2 Crux.
hagbard claimed this task.Dec 26 2018, 10:34 PM

Hi @danhusan, did you ever try another poll value, like 3 secs or 5 or anything like that? If set to 0, the host system won't show you any updated meta data, like if you change the ip address etc.
(https://github.com/vmware/open-vm-tools/blob/master/open-vm-tools/services/plugins/guestInfo/guestInfoServer.c#L1662)
I'm therefore not entirely sure if that should be treated as a special case scenario (we could publish a kb if you run into that condition), or if it is a general issue since you 2 were the only ones experience that issue as far as I know.
I'm also not sure it only is triggered by your situation (full bgp table) or if it can happen on other occasions as well, if you came across more issues regarding that value, please let me know.

Hi @hagbard,
i played with different values and in my case (full table IPv6 router) the error continues during the following values:
3s, 5s, 10s, 20s, 30s
At 60s the CPU load starts to cycle between >30s full load, than it drops for a few seconds and raises again.

If i remove the full table from the router and only advertise a default route the vmware-tools run fine.

Maybe it's the best way to have a option to disable only the route gathering module within the VMWare Tools.

@MrXermon , yes that sounds reasonable. I found in the code that they limit it to 100 routes, can you please try the following:
(https://github.com/vmware/open-vm-tools/blob/master/open-vm-tools/lib/include/conf.h#L138)

Within the tools.conf you can test with:
max-ipv4-routes and max-ipv6-routes, they have default 100, 0 disables it entirely. If that does the trick I think I extend the tools-script package since vyos is supposed to be a router which usually deal with more that 1 route and to avoid vmware load issues, I guess that would be the right way to go.

tools.conf (please remove the poll-interval setting to test).
max-ipv6-routes=0
max-ipv4-routes=

danhusan added a comment.EditedDec 28 2018, 7:19 PM

Hi @danhusan, did you ever try another poll value, like 3 secs or 5 or anything like that? If set to 0, the host system won't show you any updated meta data, like if you change the ip address etc.
(https://github.com/vmware/open-vm-tools/blob/master/open-vm-tools/services/plugins/guestInfo/guestInfoServer.c#L1662)
I'm therefore not entirely sure if that should be treated as a special case scenario (we could publish a kb if you run into that condition), or if it is a general issue since you 2 were the only ones experience that issue as far as I know.
I'm also not sure it only is triggered by your situation (full bgp table) or if it can happen on other occasions as well, if you came across more issues regarding that value, please let me know.

I did, my experience was much the same as MrXermon's. Except that I have 700k routes and not ~60k. So the CPU was constantly pegged regardless.

Running with

[guestinfo]
max-ipv6-routes=0
max-ipv4-routes=0

does not seem to help, the process is still pegged at 99%

And FYI max-ipv6-routes=10 and max-ipv4-routes=10 doesn't seem to help either.

Thanks for testing that guys.

hagbard changed the task status from Open to In progress.Dec 29 2018, 7:09 PM

@syncer I was thinking to add a cli menu for vmwaretoolsd mitgation like these things. It seems that not many were affected by that but if there is anything in the cli available, it can configure the toolsd to prevent things like that plus the toolsd has tons of options. So, I'm not really sure how I should go forward with this one.