Page MenuHomePhabricator

High CPU usage by bgpd when snmp is active
Backport candidate, NormalPublic

Description

In some scenarios when the routing table has a large number of routes, the snmpd process consumes 100% CPU, causing bgpd to stop, losing its settings and the host having to be restarted.

top - 19:06:24 up 23:50,  1 user,  load average: 0.23, 0.12, 0.04
Tasks: 142 total,   2 running,  99 sleeping,   0 stopped,   0 zombie
%Cpu0  :  3.7 us,  0.3 sy,  0.0 ni, 96.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  1.4 us,  0.7 sy,  0.0 ni, 96.6 id,  0.3 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu2  :  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 98.7 us,  1.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem:   4040552 total,  2620536 used,  1420016 free,   111856 buffers
KiB Swap:        0 total,        0 used,        0 free.   244692 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3963 snmp      20   0  173256 119956   4504 R 100.0  3.0   2:05.66 snmpd
 4008 root      20   0  104248  20036  14980 S   4.0  0.5  50:36.23 uacctd
 4003 root      20   0  119708  33596  23456 S   1.3  0.8  26:15.33 uacctd
   17 root      20   0       0      0      0 S   0.3  0.0   1:14.61 ksoftirqd/1

After testing snmpd configuration by removing the ipCidrRouteTable and inetCidrRouteTable modules, CPU usage was normalized.

SNMPDOPTS = '- LSed -u snmp -g snmp -I -ipCidrRouteTable, inetCidrRouteTable -p /run/snmpd.pid'

Details

Difficulty level
Unknown (require assessment)
Version
-
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Perfectly compatible

Event Timeline

fvbrasileiro created this object in space S1 VyOS Public.
Dmitry added a subscriber: Dmitry.Oct 8 2019, 2:49 PM
c-po added a subscriber: c-po.Oct 8 2019, 8:07 PM

Any way of reproducing this with a simple snmpwalk? snmpget?

hagbard claimed this task.Oct 10 2019, 2:50 PM
hagbard changed the task status from Open to Needs testing.Oct 10 2019, 3:07 PM
hagbard triaged this task as Normal priority.
hagbard added a project: VyOS 1.2 Crux.
hagbard moved this task from Need Triage to In Progress on the VyOS 1.2 Crux board.
c-po added a comment.Oct 10 2019, 4:03 PM

I do not see this problem on a full table v4/v6 router with 2 cores 4 GB RAM. The question is why? Is removing the table a good idea? What was the state with 1.1.8?

hagbard added a comment.EditedOct 10 2019, 4:17 PM

There were multiple complains about bgpd crashes, memory issues inthe forum. They used the workaround removing the tables from snmpd successfully.

me@vyos:/tmp$ dpkg -l | grep vyos-1x
ii vyos-1x 1.3.0-16 all VyOS configuration scripts and data

me@vyos:/tmp$ sudo dpkg -i vyos-1x_1.3.0-16_all.deb
(Reading database ... 56555 files and directories currently installed.)
Preparing to unpack vyos-1x_1.3.0-16_all.deb ...
Unpacking vyos-1x (1.3.0-16) over (1.3.0-16) ...
dpkg: dependency problems prevent configuration of vyos-1x:
vyos-1x depends on accel-ppp; however:
Package accel-ppp is not installed.

dpkg: error processing package vyos-1x (--install):
dependency problems - leaving unconfigured
Processing triggers for systemd (215-17+deb8u13) ...
Errors were encountered while processing:
vyos-1x

me@vyos:/tmp$ sh version
Version: VyOS 1.2-rolling-201909242149

@fvbrasileiro Yeah, we found that out too today, we are working on a solution already. Please be patient.

fvbrasileiro added a comment.EditedOct 11 2019, 10:19 AM

No problem, I had already made the change manually in the snmp.py file. Since then, the problem has not occurred.

c-po added a comment.Oct 13 2019, 4:22 PM

The change now leads to not starting SNMP when I do not use BGP

Oct 13 18:18:47 LR4 snmpd[5404]: getaddrinfo: inetCidrRouteTable Name or service not known
Oct 13 18:18:47 LR4 snmpd[5404]: getaddrinfo("inetCidrRouteTable", NULL, ...): Name or service not known
Oct 13 18:18:47 LR4 snmpd[5404]: Error opening specified endpoint "inetCidrRouteTable"
Oct 13 18:18:47 LR4 snmpd[5404]: Server Exiting with code 1
Oct 13 18:18:47 LR4 snmpd[5401]: Starting SNMP services::
Oct 13 18:18:47 LR4 systemd[1]: snmpd.service: control process exited, code=exited status=1
Oct 13 18:18:47 LR4 systemd[1]: Failed to start LSB: SNMP agents.
Oct 13 18:18:47 LR4 systemd[1]: Unit snmpd.service entered failed state.
c-po added a comment.Oct 13 2019, 5:54 PM

Okay - just installed the latest rolling and it does not even boot up anymore. Its trapped in a snmp restart loop somehow. Going to revert this commit.

pasik added a subscriber: pasik.Mon, Oct 14, 5:15 PM

Can't create an iso right now to test it.

works with:

Version:          VyOS 1.2-rolling-201910110117
Built by:         autobuild@vyos.net
Built on:         Fri 11 Oct 2019 01:17 UTC
Build UUID:       48a11fa6-8c59-4dbb-94a3-215376c09a02
Build Commit ID:  46f9b2ab60e4fa
hagbard changed the task status from Needs testing to Backport candidate.Fri, Oct 18, 7:58 PM
hagbard moved this task from In Progress to Backlog on the VyOS 1.2 Crux board.
syncer moved this task from Need Triage to Finished on the VyOS 1.3 Equuleus board.
syncer moved this task from Needs Triage to Backlog on the VyOS 1.2 Crux (VyOS 1.2.4) board.