When we specify listen-address for
set service monitoring telegraf prometheus-client listen-address <address>
we should be able to specify vrf as well. As the OAM interface is usually placed into separate vrf for security reasons. This could be like
set service monitoring telegraf prometheus-client vrf <vrf-name>
if possible.
Thank you,
Alex
Description
Details
- Difficulty level
- Easy (less than an hour)
- Version
- -
- Why the issue appeared?
- Will be filled on close
- Is it a breaking change?
- Perfectly compatible
- Issue type
- Improvement (missing useful functionality)
Event Timeline
I tried to add vrf, but it requires some permissions, service is not starting
diff --git a/data/templates/monitoring/override.conf.j2 b/data/templates/monitoring/override.conf.j2 index 9f1b4ebe..63e479af 100644 --- a/data/templates/monitoring/override.conf.j2 +++ b/data/templates/monitoring/override.conf.j2 @@ -1,7 +1,10 @@ +{% set vrf_command = 'ip vrf exec ' ~ vrf ~ ' ' if vrf is vyos_defined else '' %} [Unit] After=vyos-router.service ConditionPathExists=/run/telegraf/vyos-telegraf.conf [Service] +ExecStart= +ExecStart={{ vrf_command }}/usr/bin/telegraf -config /run/telegraf/vyos-telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS Environment=INFLUX_TOKEN={{ influxdb.authentication.token }} CapabilityBoundingSet=CAP_NET_RAW CAP_NET_ADMIN CAP_SYS_ADMIN AmbientCapabilities=CAP_NET_RAW CAP_NET_ADMIN diff --git a/interface-definitions/service-monitoring-telegraf.xml.in b/interface-definitions/service-monitoring-telegraf.xml.in index 36f40a53..dc014ee1 100644 --- a/interface-definitions/service-monitoring-telegraf.xml.in +++ b/interface-definitions/service-monitoring-telegraf.xml.in @@ -306,6 +306,7 @@ </leafNode> </children> </node> + #include <include/interface/vrf.xml.i> </children> </node> </children>
Service status
vyos@r1# systemctl status vyos-telegraf.service ● vyos-telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB Loaded: loaded (/etc/systemd/system/vyos-telegraf.service; disabled; vendor preset: enabled) Drop-In: /etc/systemd/system/vyos-telegraf.service.d └─10-override.conf Active: failed (Result: exit-code) since Tue 2022-08-16 16:37:05 EEST; 1s ago Docs: https://github.com/influxdata/telegraf Process: 10453 ExecStart=ip vrf exec foo /usr/bin/telegraf -config /run/telegraf/vyos-telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS (code=exited, status=255/EXCEPTION) Main PID: 10453 (code=exited, status=255/EXCEPTION) CPU: 2ms Aug 16 16:37:04 r1 systemd[1]: vyos-telegraf.service: Main process exited, code=exited, status=255/EXCEPTION Aug 16 16:37:04 r1 systemd[1]: vyos-telegraf.service: Failed with result 'exit-code'. Aug 16 16:37:05 r1 systemd[1]: vyos-telegraf.service: Scheduled restart job, restart counter is at 5. Aug 16 16:37:05 r1 systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB. Aug 16 16:37:05 r1 systemd[1]: vyos-telegraf.service: Start request repeated too quickly. Aug 16 16:37:05 r1 systemd[1]: vyos-telegraf.service: Failed with result 'exit-code'. Aug 16 16:37:05 r1 systemd[1]: Failed to start The plugin-driven server agent for reporting metrics into InfluxDB. [edit] vyos@r1#
Log:
Aug 16 16:38:21 r1 sudo[10470]: vyos : TTY=pts/0 ; PWD=/home/vyos ; USER=root ; COMMAND=/usr/bin/systemctl restart vyos-telegraf Aug 16 16:38:21 r1 sudo[10470]: pam_unix(sudo:session): session opened for user root(uid=0) by vyos(uid=1003) Aug 16 16:38:21 r1 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB. Aug 16 16:38:21 r1 ip[10473]: mkdir failed for /sys/fs/cgroup/system.slice/vyos-telegraf.service/vrf/foo: Permission denied Aug 16 16:38:21 r1 ip[10473]: Failed to setup vrf cgroup2 directory Aug 16 16:38:21 r1 systemd[1]: vyos-telegraf.service: Main process exited, code=exited, status=255/EXCEPTION Aug 16 16:38:21 r1 systemd[1]: vyos-telegraf.service: Failed with result 'exit-code'.
Manual start of telegraf works for me
root@vyos-lns-1:/etc/systemd/system# ip vrf exec oam /usr/bin/telegraf --debug -config /run/telegraf/vyos-telegraf.conf -config-directory /etc/telegraf/telegraf.d
2022-08-16T16:29:51Z I! : Plugin "outputs.prometheus_client" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.cpu" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.mem" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.linux_sysctl_fs" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.ntpq" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.net" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.kernel" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.interrupts" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.conntrack" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.nstat" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.disk" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.system" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.processes" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.ethtool" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.internal" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.syslog" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.diskio" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.netstat" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! : Plugin "inputs.systemd_units" deprecated since version and will be removed in :
2022-08-16T16:29:51Z I! Starting Telegraf 1.23.1
2022-08-16T16:29:51Z I! Loaded inputs: conntrack cpu disk diskio ethtool internal interrupts kernel linux_sysctl_fs mem net netstat nstat ntpq processes syslog system systemd_units
2022-08-16T16:29:51Z I! Loaded aggregators:
2022-08-16T16:29:51Z I! Loaded processors:
2022-08-16T16:29:51Z I! Loaded outputs: prometheus_client
2022-08-16T16:29:51Z I! Tags enabled: host=vyos-lns-1
2022-08-16T16:29:51Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"vyos-lns-1", Flush Interval:15s
2022-08-16T16:29:51Z D! [agent] Initializing plugins
2022-08-16T16:29:51Z D! [agent] Connecting outputs
2022-08-16T16:29:51Z D! [agent] Attempting connection to [outputs.prometheus_client]
2022-08-16T16:29:51Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
2022-08-16T16:29:51Z D! [agent] Successfully connected to outputs.prometheus_client
2022-08-16T16:29:51Z D! [agent] Starting service inputs
2022-08-16T16:30:06Z D! [outputs.prometheus_client] Wrote batch of 247 metrics in 2.110746ms
2022-08-16T16:30:06Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
but in that case it does not create /sys/fs/cgroup/system.slice/vyos-telegraf.service/vrf/oam
The only way to start telegraf with ip vrf exec i found - is to comment out
#User=telegraf
in /etc/systemd/system/vyos-telegraf.service and
chown root:root /run/telegraf
not a good solution running telegraf as root, but i can move further with telegraf any way.
Try to add some capabilities, for example, CAP_CHOWN or CAP_DAC_OVERRIDE or something else
sudo nano /etc/systemd/system/vyos-telegraf.service.d/10-override.conf
CapabilityBoundingSet=CAP_NET_RAW CAP_NET_ADMIN CAP_SYS_ADMIN CAP_DAC_OVERRIDE CAP_CHOWN CAP_LEASE
Nothing helps
Aug 19 14:13:50 ip[4307]: mkdir failed for /sys/fs/cgroup/system.slice/vyos-telegraf.service/vrf: Permission denied
Aug 19 14:13:50 ip[4307]: Failed to setup vrf cgroup2 directory
It seems working:
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB Loaded: loaded (/lib/systemd/system/telegraf.service; disabled; vendor preset: enabled) Drop-In: /etc/systemd/system/telegraf.service.d └─10-override.conf Active: active (running) since Mon 2022-08-29 12:51:47 EEST; 1min 7s ago Docs: https://github.com/influxdata/telegraf Main PID: 6740 (telegraf) Tasks: 9 (limit: 9409) Memory: 49.7M CPU: 836ms CGroup: /system.slice/telegraf.service └─vrf └─foo └─6740 /usr/bin/telegraf --config /run/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d --pidfile /run/telegraf/telegraf.pid Aug 29 12:51:48 r14 ip[6740]: 2022-08-29T09:51:48Z I! : Plugin "inputs.disk" deprecated since version and will be removed in : Aug 29 12:51:48 r14 ip[6740]: 2022-08-29T09:51:48Z I! : Plugin "inputs.net" deprecated since version and will be removed in : Aug 29 12:51:48 r14 ip[6740]: 2022-08-29T09:51:48Z I! Starting Telegraf 1.23.1 Aug 29 12:51:48 r14 ip[6740]: 2022-08-29T09:51:48Z I! Loaded inputs: conntrack cpu disk diskio ethtool internal interrupts kernel linux_sysctl_fs mem net netstat nstat ntpq processes syslog system systemd_units Aug 29 12:51:48 r14 ip[6740]: 2022-08-29T09:51:48Z I! Loaded aggregators: Aug 29 12:51:48 r14 ip[6740]: 2022-08-29T09:51:48Z I! Loaded processors: Aug 29 12:51:48 r14 ip[6740]: 2022-08-29T09:51:48Z I! Loaded outputs: prometheus_client Aug 29 12:51:48 r14 ip[6740]: 2022-08-29T09:51:48Z I! Tags enabled: host=r14 Aug 29 12:51:48 r14 ip[6740]: 2022-08-29T09:51:48Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"r14", Flush Interval:15s Aug 29 12:51:48 r14 ip[6740]: 2022-08-29T09:51:48Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
I'd suggest adding
**Restart=always RestartSec=10**
to /usr/share/vyos/templates/telegraf/override.conf.j2 as it is done for ntp.service.
Otherwise the telegraf service do not start - it does 5 start attempts very quickly during boot with error:
Sep 07 11:43:59 vyos-lns-1 systemd[1]: telegraf.service: Failed with result 'exit-code'. Sep 07 11:43:59 vyos-lns-1 systemd[1]: telegraf.service: Scheduled restart job, restart counter is at 5. Sep 07 11:43:59 vyos-lns-1 systemd[1]: telegraf.service: Start request repeated too quickly. Sep 07 11:43:59 vyos-lns-1 systemd[1]: telegraf.service: Failed with result 'exit-code'.
and stays in a failed state.
see boot log attached.