Page MenuHomeVyOS Platform

VRRP lock configuration and High CPU utilization.
Confirmed, HighPublicBUG

Description

  1. Bug. After executing transition scripts by VRRP, the configuration is locked and I can't change any value.
  2. Bug. Timeouts on MASTER (show vrrp) clear every ~3 sec and high cpu utilization on the master. 80-100 %

To reproduce

Router01:

set interfaces ethernet eth1 vif 99 address '10.0.0.254/24'

set high-availability vrrp group lan hello-source-address '10.0.0.254'
set high-availability vrrp group lan interface 'eth1.99'
set high-availability vrrp group lan no-preempt
set high-availability vrrp group lan peer-address '10.0.0.253'
set high-availability vrrp group lan priority '250'
set high-availability vrrp group lan transition-script backup '/config/scripts/vrrp-trans-fail.sh'
set high-availability vrrp group lan transition-script fault '/config/scripts/vrrp-trans-fail.sh'
set high-availability vrrp group lan transition-script master '/config/scripts/vrrp-trans-master.sh'
set high-availability vrrp group lan virtual-address '10.0.0.1/24'
set high-availability vrrp group lan vrid '99'

Router02:

set interfaces ethernet eth1 vif 99 address '10.0.0.253/24'

set high-availability vrrp group lan hello-source-address '10.0.0.253'
set high-availability vrrp group lan interface 'eth1.99'
set high-availability vrrp group lan no-preempt
set high-availability vrrp group lan peer-address '10.0.0.254'
set high-availability vrrp group lan transition-script backup '/config/scripts/vrrp-trans-fail.sh'
set high-availability vrrp group lan transition-script fault '/config/scripts/vrrp-trans-fail.sh'
set high-availability vrrp group lan transition-script master '/config/scripts/vrrp-trans-master.sh'
set high-availability vrrp group lan virtual-address '10.0.0.1/24'
set high-availability vrrp group lan vrid '99'

Transition script fail

vyos@r1-roll# sudo cat /config/scripts/vrrp-trans-fail.sh 
#!/bin/vbash

source /opt/vyatta/etc/functions/script-template

configure
set interfaces ethernet eth1 vif 99 description BACKUP_by_script

commit

exit

Transition master

#!/bin/vbash

source /opt/vyatta/etc/functions/script-template

configure
set interfaces ethernet eth1 vif 99 description MASTER_by_script
commit

exit

Router 1 in master state.

vyos@r1-roll:~$ show int
Codes: S - State, L - Link, u - Up, D - Down, A - Admin Down
Interface        IP Address                        S/L  Description
---------        ----------                        ---  -----------
eth0             192.168.122.11/24                 u/u  
eth1             -                                 u/u  
eth1.99          10.0.0.254/24                     u/u  MASTER_by_script
                 10.0.0.1/24                            
lo               127.0.0.1/8                       u/u  
                 ::1/128                                
vyos@r1-roll:~$ 
vyos@r1-roll:~$ show vrrp
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   2s

Trying to add or del configuration

vyos@r1-roll# delete interfaces ethernet eth1 vif 99 description 

  Failed to delete specified config path
  Delete failed

[edit]
vyos@r1-roll# 

vyos@r1-roll# delete interfaces ethernet eth1 vif 99 

  Failed to delete specified config path
  Delete failed

[edit]
vyos@r1-roll#

vyos@r1-roll# set interfaces ethernet eth0 description FOO

  Set failed

[edit]
vyos@r1-roll#

Permissions

vyos@r1-roll# sudo ls -la /config/scripts/
total 20
drwxrwsr-x 2 root vyattacfg 4096 Jul 28 16:05 .
drwxrwsr-x 8 root vyattacfg 4096 Jul 28 15:55 ..
-rwxr-xr-x 1 root vyattacfg  154 Jul 28 16:05 vrrp-trans-fail.sh
-rwxr-xr-x 1 root vyattacfg  153 Jul 28 16:05 vrrp-trans-master.sh
-rwxr-xr-x 1 root vyattacfg  230 Jul 28 01:24 vyos-postconfig-bootup.script

Second bug with timeouts

vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   1s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   6s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   2s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   3s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   2s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   1s
vyos@vyos:~$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   3s
vyos@vyos:~$

Details

Difficulty level
Unknown (require assessment)
Version
VyOS 1.3-rolling-202007280117
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)

Event Timeline

Viacheslav changed the task status from Open to Confirmed.Jul 28 2020, 4:17 PM
Viacheslav created this task.
Viacheslav triaged this task as High priority.Jul 29 2020, 7:29 AM
Viacheslav updated the task description. (Show Details)
Viacheslav renamed this task from VRRP lock configuration to VRRP lock configuration and High CPU utilization..Jul 29 2020, 7:34 AM
Viacheslav added a comment.EditedJul 29 2020, 7:47 AM

The master change state every few seconds.

Jul 29 07:52:39 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-master.sh
Jul 29 07:52:40 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE
Jul 29 07:52:40 vyos Keepalived_vrrp[1821]: (lan) sent 0 priority
Jul 29 07:52:40 vyos Keepalived_vrrp[1821]: (lan) Entering MASTER STATE
Jul 29 07:52:41 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-fail.sh
Jul 29 07:52:42 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE
Jul 29 07:52:42 vyos Keepalived_vrrp[1821]: (lan) sent 0 priority
Jul 29 07:52:42 vyos Keepalived_vrrp[1821]: (lan) Entering MASTER STATE
Jul 29 07:52:43 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-master.sh
Jul 29 07:52:44 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE
Jul 29 07:52:44 vyos Keepalived_vrrp[1821]: (lan) sent 0 priority
Jul 29 07:52:44 vyos Keepalived_vrrp[1821]: (lan) Entering MASTER STATE
Jul 29 07:52:45 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-fail.sh
Jul 29 07:52:46 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE
Jul 29 07:52:46 vyos Keepalived_vrrp[1821]: (lan) sent 0 priority
Jul 29 07:52:46 vyos Keepalived_vrrp[1821]: (lan) Entering MASTER STATE
Jul 29 07:52:46 vyos Keepalived_vrrp[1821]: Warning: Failed to connect to the agentx master agent ([NIL]):
Jul 29 07:52:47 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-master.sh
Jul 29 07:52:48 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE
Jul 29 07:52:48 vyos Keepalived_vrrp[1821]: (lan) sent 0 priority
Jul 29 07:52:48 vyos Keepalived_vrrp[1821]: (lan) Entering MASTER STATE
Jul 29 07:52:49 vyos keepalived-fifo.py[1822]: Running the command: /config/scripts/vrrp-trans-fail.sh
Jul 29 07:52:50 vyos Keepalived_vrrp[1821]: (lan) Entering BACKUP STATE

It seems that the problem is serious and under attention

pasik added a subscriber: pasik.Jul 29 2020, 8:52 AM
thomas-mangin added a comment.EditedJul 29 2020, 10:55 AM

I do not have a lab to reproduce this ATM.

My guess is that one of "configure", "set" or "commit" function is failing and causing the lock.
The return code of configure and commit should be checked to see if commit failed.

if commit fails in particular, this should be checked with "$?" and if not zero, then rollback should be run.
I have like the felling that after too many failed commit, the number of running commit is over what the backend allows causing a lock until one of them closes ... which they will never do.

Removing this line from the master prevents erroneous changes master/backup. And CPU displays normal values.

set high-availability vrrp group lan transition-script backup '/config/scripts/vrrp-trans-fail.sh'

Last Transition

$ show vrrp 
Name    Interface      VRID  State    Last Transition
------  -----------  ------  -------  -----------------
lan     eth1.99          99  MASTER   5m19s
thomas-mangin removed thomas-mangin as the assignee of this task.Jul 29 2020, 2:00 PM
thomas-mangin added a subscriber: thomas-mangin.

Ideally which interface is master/slave should be recorded and handled by VyOS so that users do not have to put some workaround like this one to know.

zsdc added a subscriber: zsdc.Jul 29 2020, 2:16 PM

Changing description in a master transition script will lead to an endless loop, because of:

  1. Description change (or any other interface update) in a script trigger EthernetIf.update().
  2. EthernetIf.update() trigger a lot of interface changes:
Jul 29 14:05:36 vyos sudo[3097]:     root : TTY=ttyS0 ; PWD=/home/vyos ; USER=root ; COMMAND=/usr/bin/sh -c VYOS_TAGNODE_VALUE='eth1' /usr/libexec/vyos/conf_mode/interfaces-ethernet.py
Jul 29 14:05:36 vyos sudo[3097]: pam_unix(sudo:session): session opened for user root by vyos(uid=0)
Jul 29 14:05:36 vyos control.py[3098]: set_interface: alias,
Jul 29 14:05:36 vyos control.py[3098]: set_interface: link_detect, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: vrf,
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_cache_tmo, 30
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_filter, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_accept, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_announce, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_ignore, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: proxy_arp, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: proxy_arp_pvlan, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_forwarding, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_accept_ra, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_autoconf, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_dad_transmits, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: mtu, 1500
Jul 29 14:05:36 vyos control.py[3098]: set_interface: alias, MASTER_by_script
Jul 29 14:05:36 vyos control.py[3098]: set_interface: link_detect, 1
Jul 29 14:05:36 vyos Keepalived_vrrp[1302]: (lan) Entering BACKUP STATE
Jul 29 14:05:36 vyos Keepalived_vrrp[1302]: (lan) sent 0 priority
Jul 29 14:05:36 vyos control.py[3098]: set_interface: vrf,
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_cache_tmo, 30
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_filter, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_accept, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_announce, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: arp_ignore, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: proxy_arp, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: proxy_arp_pvlan, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_forwarding, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_accept_ra, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_autoconf, 0
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ipv6_dad_transmits, 1
Jul 29 14:05:36 vyos control.py[3098]: set_interface: gro, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: gso, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: sg, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: tso, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: ufo, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: admin_state, up
Jul 29 14:05:36 vyos Keepalived_vrrp[1302]: (lan) Entering MASTER STATE
Jul 29 14:05:36 vyos control.py[3098]: set_interface: gro, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: gso, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: sg, off
Jul 29 14:05:36 vyos control.py[3098]: set_interface: tso, off
Jul 29 14:05:37 vyos control.py[3098]: set_interface: ufo, off
Jul 29 14:05:37 vyos control.py[3098]: set_interface: admin_state, up
  1. Something from this all trigger keepalived interface reinitialization.
  2. Keepalived change VRRP state to BACKUP and then MASTER, and run transition scripts.
  3. GOTO 1.

I think we do not need to touch those interface options that have not been changed in config. This should solve the issue.

thomas-mangin added a comment.EditedJul 29 2020, 2:23 PM

Yes, it would solve the issue ... but ... currently, we re-apply the whole interface setting, so there is no change to have the vyos and live configuration not sync'ed.
This would be lost. It is a trade-off, but it could be done. It would be however the only sub-system working that way.

(I will try to work a PR so we can discuss the details)

This comment was removed by thomas-mangin.