Page MenuHomeVyOS Platform

SSH: configuration directory is not always created on boot
Closed, ResolvedPublicBUG

Description

Reported via Forums: https://forum.vyos.io/t/ssh-in-vrf-fails-after-reboot/6455/2

SSH can not be started when there is no sshd_config file.
The /run/sshd directory is not always created via render() on system startup.

Root cause:
The corresponding systemd unit file (/lib/systemd/system/ssh.service) uses a RuntimeDirectory statement and by default, when the service stops, the runtime directory (/run/sshd) is removed.

RuntimeDirectoryPreserve=
Takes a boolean argument or restart. If set to no (the default), the directories specified in RuntimeDirectory= are always removed when the service stops. If set to restart the directories are preserved when the service is both automatically and > manually restarted. Here, the automatic restart means the operation specified in Restart=, and manual restart means the one triggered by systemctl restart foo.service. If set to yes, then the directories are not removed when the service is stopped. Note that since the runtime directory /run/ is a mount point of "tmpfs", then for system services the directories specified in RuntimeDirectory= are removed when the system is rebooted.

When running SSHd inside a VRF the daemon needs several tries to start successfully as it tries to load some bpf code (according to the systemd log https://bugzilla.redhat.com/show_bug.cgi?id=1813599), but when the service stops, the config is removed and it can no longer start -> DEADLOCK.

a simple call to sudo systemctl restart ssh fixes the issue, it is yet unclear why the service does not auto recover

Details

Difficulty level
Easy (less than an hour)
Version
1.4-rolling-202101120217
Why the issue appeared?
Issues in third-party code
Is it a breaking change?
Perfectly compatible

Event Timeline

c-po changed the task status from Open to In progress.Wed, Jan 13, 6:18 PM
c-po claimed this task.
c-po created this task.
c-po changed Difficulty level from Unknown (require assessment) to Easy (less than an hour).
c-po changed Why the issue appeared? from Will be filled on close to Issues in third-party code.
c-po changed Is it a breaking change? from Unspecified (possibly destroys the router) to Perfectly compatible.
c-po updated the task description. (Show Details)

Further debugging revelead it's a problem inserting the VRF BPF code, but I wonder why systemd does not restart the service one more time, then it works.

strace[1579]: openat(AT_FDCWD, "/proc/1582/cgroup", O_RDONLY) = 5
strace[1579]: fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
strace[1579]: read(5, "10:freezer:/\n9:cpuset:/\n8:perf_e"..., 1024) = 260
strace[1579]: close(5)                                = 0
strace[1579]: mkdir("/sys", 0755)                     = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs", 0755)                  = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup", 0755)           = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup/unified", 0755)   = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice", 0755) = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice/ssh.service", 0755) = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice/ssh.service/vrf", 0755) = 0
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice/ssh.service/vrf", 0755) = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice/ssh.service/vrf/foo", 0755) = 0
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice/ssh.service/vrf/foo", 0755) = -1 EEXIST (File exists)
strace[1579]: openat(AT_FDCWD, "/sys/fs/cgroup/unified/system.slice/ssh.service/vrf/foo", O_RDONLY|O_DIRECTORY) = 5
strace[1579]: bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SOCK, insn_cnt=6, insns=0x7ffe733340e0, license="GPL", log_level=1, log_size=262144, log_buf="", kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS}, 112) = -1 EPERM (Operation not permitted)
systemd[1]: ssh.service: Main process exited, code=exited, status=255/EXCEPTION
strace[1579]: write(2, "Failed to load BPF prog: 'Operat"..., 51Failed to load BPF prog: 'Operation not permitted'
systemd[1]: ssh.service: Failed with result 'exit-code'.
strace[1579]: ) = 51
strace[1579]: close(5)                                = 0
strace[1579]: close(-1)                               = -1 EBADF (Bad file descriptor)
strace[1579]: getuid()                                = 0
strace[1579]: exit_group(-1)                          = ?
strace[1579]: +++ exited with 255 +++
systemd[1]: Failed to start OpenBSD Secure Shell server.

Note the line: systemd[1]: ssh.service: Main process exited, code=exited, status=255/EXCEPTION

The default systemd SSH unit file has RestartPreventExitStatus=255 set and the man states:

Restart=
Configures whether the service shall be restarted when the service process exits, is killed, or a timeout is reached.
...
As exceptions to the setting above, the service will not be restarted if the exit code or signal is specified in RestartPreventExitStatus= (see below) or the service is stopped with systemctl stop or an equivalent operation. Also, the services will always be restarted if the exit code or signal is specified in RestartForceExitStatus= (see below).

Thus clearing RestartPreventExitStatus= from the systemd unit file for SSH solves this issue.

c-po triaged this task as Normal priority.