Page MenuHomePhabricator

BGP process memory leak
Needs testing, NormalPublicBUG

Description

After 6 days, 17 hours uptime with 1.2.0-rc10 i get the message that bgpd had to be killed
Out of memory: Kill process 1009 (bgpd) score 420 or sacrifice child
Killed process 1009 (bgpd) total-vm: 1153620kB, anon-rss:853680kB, file-rss:1244kB, shmem-rss:0kB

This is with several full tables. I will try to find our more information from logs, and memory usage.

Details

Difficulty level
Normal (likely a few hours)
Version
1.2.0-rc10
Why the issue appeared?
Will be filled on close

Event Timeline

Merijn created this task.Dec 11 2018, 3:53 PM
c-po added a subscriber: c-po.Dec 11 2018, 7:17 PM

Please note:

It is not assured that the Linux Kernel OOM killer will kill the process who caused the memory leak. Intead it randomly chooses a process to kill.

pasik added a subscriber: pasik.Dec 16 2018, 11:16 AM
syncer changed the task status from Open to Needs testing.Dec 21 2018, 10:45 AM
syncer triaged this task as Normal priority.
syncer edited projects, added VyOS 1.3 Equuleus; removed VyOS 1.2 Crux.
syncer added a subscriber: syncer.

@Merijn can you provide specs (is that HW deployment?)

@syncer
It is a virtual machine on Hyper-V 2016 with 2 cores, 2GB memory.
I compared the output from top over te past week and the only changes are in the uacctd processes.

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 998 frr       20   0 1018300 792792   3492 S   0.0 38.9  40:33.17 bgpd
 968 frr       20   0 1320968 642852   2480 S   0.0 31.6  47:45.19 zebra
3452 root      20   0  115536  28224  21976 S   0.0  1.4   2:50.35 uacctd
3453 root      20   0  110540  24180  21760 S   0.0  1.2   0:03.99 uacctd
3455 root      20   0  100604  10236   8524 S   0.0  0.5   0:00.56 uacctd
3424 snmp      20   0   56540   7744   2276 S   0.0  0.4   6:38.34 snmpd
9786 root      20   0   88404   5988   5084 S   0.0  0.3   0:00.23 sshd
1038 frr       20   0   96856   5876   2020 S   0.0  0.3   0:19.62 ospfd
9789 vyos      20   0   27780   5612   3328 S   0.0  0.3   0:00.19 vbash
1046 frr       20   0   96128   5448   2004 S   0.0  0.3   0:08.25 ospf6d
1015 frr       20   0   95420   5076   2004 S   0.0  0.2   0:08.17 ripd

i think 2 gigs not enough for several full tables
will recommed to 4gb at least

It was at 2GB because it is a test router running in the production network.
I increased memory and will check to see if this resolves it.

rgrant added a subscriber: rgrant.Apr 24 2019, 6:42 PM

I am also seeing a memory leak on a BGP full tables router. It is NOT using flow accounting, but IS a Crux 1.2.1 compiled VMWare image.

Identically configured router right next to it running VyOS 1.2.0-rolling+201904060337 does NOT have the same problem, but has others that concern us (performance issues, which is why we're trying the VMWare image).

Both are VMWare, rolling was an ISO installed, 1.2.1 was compiled from crux on April 21. Both have a single full tables BGP feed with 760K routes, and an internal OSPF network.

1.2.1 Compiled VMWare image: "free"

             total       used       free     shared    buffers     cached
Mem:       4040468    3937944     102524      11008       6564      41768
-/+ buffers/cache:    3889612     150856
Swap:            0          0          0

Rolling ISO: "free"

             total       used       free     shared    buffers     cached
Mem:       4040812    1929208    2111604      91480     168272     481276
-/+ buffers/cache:    1279660    2761152
Swap:            0          0          0

The VMWare image machine has locked up once already, I expect it to lock up again within the next few hours.

NOTE that "top" doesn't show the same info as "free" - well, it does, but is inconsistent:

1.2.1 VMWare image:

KiB Mem:   4040468 total,  3936520 used,   103948 free,     6624 buffers
KiB Swap:        0 total,        0 used,        0 free.    40984 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1158 frr       20   0 1124936 458396   4008 S   0.3 11.3   1:02.72 zebra
 1169 frr       20   0  796420 549388   4364 S   0.0 13.6   0:43.25 bgpd
 1723 root      20   0  258676   3152   2684 S   0.0  0.1   0:00.05 rsyslogd
  930 root      20   0  115460   5596   4324 S   0.0  0.1   0:05.91 vmtoolsd
    1 root      20   0  110836   4472   2348 S   0.0  0.1   0:03.87 systemd

Rolling ISO:

KiB Mem:   4040812 total,  1930400 used,  2110412 free,   168272 buffers
KiB Swap:        0 total,        0 used,        0 free.   481120 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1109 frr       20   0 1131964 465184   5396 S   0.0 11.5  45:14.38 zebra
 1127 frr       20   0  799192 555112   7060 S   0.7 13.7  40:21.32 bgpd
 1740 root      20   0  258676   4188   2768 S   0.0  0.1   1:12.81 rsyslogd
  996 root      20   0  174020   7484   6660 S   0.3  0.2   7:18.88 vmtoolsd
    1 root      20   0  110740   5160   3148 S   0.0  0.1   1:50.51 systemd

Free memory goes up and down, so memory is being freed - but slowly decreasing.

My next step is to replace the VMWare image with the standard ISO I build at the same time, to eliminate vmtools. I tried reducing the open-vm-tools config statement:

[guestinfo]
poll-interval=0

as per https://phabricator.vyos.net/T1131 but it didn't help - and the CPU wasn't high anyway.

Merijn added a comment.EditedApr 24 2019, 7:01 PM

I am running 1.2.1 compiled on 17-04-2019, uptime is 6 days without issue.
RIB entries 1366663, using 209 MiB of memory
Peers 16, using 330 KiB of memory
Peer groups 4, using 256 bytes of memory

free
                               total       used          free           shared    buffers    cached
Mem:                      3063344  2370308     693036      42180      55988     209084
-/+ buffers/cache:    2105236  958108
Swap:                      0          0          0

Ah, Thanks Merijn!

What hypervisor, if any?

This one is running on Hyper-V 2016 and is not pushing any traffic. It is my test router and experimenting with RPKI.
The routers doing traffic are on hardware and not running 1.2.x yet.

Hmmm, yeah, this one isn't doing anything yet either - just a test.

I presume your 1366663 RIB entries includes a couple of full tables peers out of the 16 peers?

I am going to wait until hard landing of this box then switch to ISO install of same compiled version to see if it's any difference. The mem leak might not be in BGP process.

This router is receiving BGP from several internal BGP routers each with full table peers or couple of peerings.