SSH: configuration directory is not always created on boot
Closed, ResolvedPublicBUG
Actions

Assigned To

Authored By

	c-po
	Jan 13 2021, 6:18 PM

Description

Reported via Forums: https://forum.vyos.io/t/ssh-in-vrf-fails-after-reboot/6455/2

SSH can not be started when there is no sshd_config file.
The /run/sshd directory is not always created via render() on system startup.

Root cause:
The corresponding systemd unit file (/lib/systemd/system/ssh.service) uses a RuntimeDirectory statement and by default, when the service stops, the runtime directory (/run/sshd) is removed.

RuntimeDirectoryPreserve=
Takes a boolean argument or restart. If set to no (the default), the directories specified in RuntimeDirectory= are always removed when the service stops. If set to restart the directories are preserved when the service is both automatically and > manually restarted. Here, the automatic restart means the operation specified in Restart=, and manual restart means the one triggered by systemctl restart foo.service. If set to yes, then the directories are not removed when the service is stopped. Note that since the runtime directory /run/ is a mount point of "tmpfs", then for system services the directories specified in RuntimeDirectory= are removed when the system is rebooted.

When running SSHd inside a VRF the daemon needs several tries to start successfully as it tries to load some bpf code (according to the systemd log https://bugzilla.redhat.com/show_bug.cgi?id=1813599), but when the service stops, the config is removed and it can no longer start -> DEADLOCK.

a simple call to sudo systemctl restart ssh fixes the issue, it is yet unclear why the service does not auto recover

Details

Difficulty level: Easy (less than an hour)
Version: 1.4-rolling-202101120217
Why the issue appeared?: Issues in third-party code
Is it a breaking change?: Perfectly compatible
Issue type: Bug (incorrect behavior)

Event Timeline

c-po changed the task status from Open to In progress.Jan 13 2021, 6:18 PM

c-po claimed this task.

c-po created this task.

c-po updated the task description. (Show Details)Jan 13 2021, 7:14 PM

c-po changed Difficulty level from Unknown (require assessment) to Easy (less than an hour).

c-po changed Why the issue appeared? from Will be filled on close to Issues in third-party code.

c-po changed Is it a breaking change? from Unspecified (possibly destroys the router) to Perfectly compatible.

c-po updated the task description. (Show Details)Jan 13 2021, 8:24 PM

c-po updated the task description. (Show Details)

pasik added a subscriber: pasik.Jan 14 2021, 4:34 PM

c-po moved this task from Need Triage to Backlog on the VyOS 1.3 Equuleus board.Jan 14 2021, 5:39 PM

c-po moved this task from Backlog to In Progress on the VyOS 1.3 Equuleus board.

Further debugging revelead it's a problem inserting the VRF BPF code, but I wonder why systemd does not restart the service one more time, then it works.

strace[1579]: openat(AT_FDCWD, "/proc/1582/cgroup", O_RDONLY) = 5
strace[1579]: fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
strace[1579]: read(5, "10:freezer:/\n9:cpuset:/\n8:perf_e"..., 1024) = 260
strace[1579]: close(5)                                = 0
strace[1579]: mkdir("/sys", 0755)                     = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs", 0755)                  = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup", 0755)           = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup/unified", 0755)   = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice", 0755) = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice/ssh.service", 0755) = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice/ssh.service/vrf", 0755) = 0
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice/ssh.service/vrf", 0755) = -1 EEXIST (File exists)
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice/ssh.service/vrf/foo", 0755) = 0
strace[1579]: mkdir("/sys/fs/cgroup/unified/system.slice/ssh.service/vrf/foo", 0755) = -1 EEXIST (File exists)
strace[1579]: openat(AT_FDCWD, "/sys/fs/cgroup/unified/system.slice/ssh.service/vrf/foo", O_RDONLY|O_DIRECTORY) = 5
strace[1579]: bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SOCK, insn_cnt=6, insns=0x7ffe733340e0, license="GPL", log_level=1, log_size=262144, log_buf="", kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS}, 112) = -1 EPERM (Operation not permitted)
systemd[1]: ssh.service: Main process exited, code=exited, status=255/EXCEPTION
strace[1579]: write(2, "Failed to load BPF prog: 'Operat"..., 51Failed to load BPF prog: 'Operation not permitted'
systemd[1]: ssh.service: Failed with result 'exit-code'.
strace[1579]: ) = 51
strace[1579]: close(5)                                = 0
strace[1579]: close(-1)                               = -1 EBADF (Bad file descriptor)
strace[1579]: getuid()                                = 0
strace[1579]: exit_group(-1)                          = ?
strace[1579]: +++ exited with 255 +++
systemd[1]: Failed to start OpenBSD Secure Shell server.

Note the line: systemd[1]: ssh.service: Main process exited, code=exited, status=255/EXCEPTION

The default systemd SSH unit file has RestartPreventExitStatus=255 set and the man states:

Restart=
Configures whether the service shall be restarted when the service process exits, is killed, or a timeout is reached.
...
As exceptions to the setting above, the service will not be restarted if the exit code or signal is specified in RestartPreventExitStatus= (see below) or the service is stopped with systemctl stop or an equivalent operation. Also, the services will always be restarted if the exit code or signal is specified in RestartForceExitStatus= (see below).

Thus clearing RestartPreventExitStatus= from the systemd unit file for SSH solves this issue.

c-po closed this task as Resolved.Jan 18 2021, 4:53 PM

c-po triaged this task as Normal priority.

c-po moved this task from In Progress to Finished on the VyOS 1.3 Equuleus board.Mar 18 2021, 4:29 PM

SrividyaA set Issue type to Bug (incorrect behavior).Aug 30 2021, 5:08 PM

syncer edited projects, added VyOS 1.3 Equuleus (1.3.0); removed VyOS 1.3 Equuleus.Aug 29 2022, 12:17 PM

syncer moved this task from Need Triage to Finished on the VyOS 1.3 Equuleus (1.3.0) board.

	F1118555: image.png
	Jan 18 2021, 3:37 PM

	F1117813: image.png
	Jan 13 2021, 8:25 PM

SSH: configuration directory is not always created on bootClosed, ResolvedPublicBUGActions

Description

Details

Event Timeline

SSH: configuration directory is not always created on boot
Closed, ResolvedPublicBUG
Actions