Page MenuHomePhabricator

vyos hostname configuration failure and it triggers vyos-router restarted
In progress, Requires assessmentPublicBUG

Description

Hi,

Actually there are two problems happened here, but they are related.

This is log when I configured hostname:

[edit]
vyos@test-1# set system host-name test-2
[edit]
vyos@test-1# commit

Message from root@test-1 on (none) at 13:41 ...
Active configuration has been changed by user 'root' on '?'.
Please make sure you do not have conflicting changes. You can also discard
the current changes by issuing 'exit discard'.
EOF
[ system host-name test-2 ]
Could not connect to vyos-hostsd

[[]] failed
Commit failed
Failed to generate committed config
[edit]
vyos@test-1# commit
No configuration changes to commit
[edit]
vyos@test-1# set system host-name test-2
DEBUG vexit_internal: calling getCompletionEnv() without config session
DEBUG vexit_internal: calling getCompletionEnv() without config session
DEBUG vexit_internal: calling getCompletionEnv() without config session
[edit]
vyos@test-1#

The hostname configuration failure is caused by vyos-hostsd, obviously, something failed and trigger that restart:

Dec 17 13:41:05 test-1 systemd[1]: vyos-hostsd.service: main process exited, code=exited, status=1/FAILURE
Dec 17 13:41:05 test-1 systemd[1]: Unit vyos-hostsd.service entered failed state.
Dec 17 13:41:05 test-1 systemd[1]: vyos-hostsd.service holdoff time over, scheduling restart.
Dec 17 13:41:05 test-1 systemd[1]: Stopping VyOS DNS configuration keeper...
Dec 17 13:41:05 test-1 systemd[1]: Starting VyOS DNS configuration keeper...
Dec 17 13:41:05 test-1 systemd[1]: Started VyOS DNS configuration keeper.

I only can identify that the code reply_msg = self.__socket.recv().decode() in _communicate function of /usr/lib/python3/dist-packages/vyos/hostsd_client.py triggered the problem, maybe there was something wrong in socket communication between client and server, I'm not good at it, so please help to identify the root cause.

But if it's just about restart of vyos-hostsd, the log won't be like above. There is another consequent problem, at same time, the vyos-router service is restarted, actaully it's triggered by restart of vyos-hostsd. So all configuration is reloaded at that time.

Dec 17 13:41:05 test-1 systemd[1]: vyos-router.service: control process exited, code=exited status=127
Dec 17 13:41:05 test-1 systemd[1]: Unit vyos-router.service entered failed state.
Dec 17 13:41:05 test-1 systemd[1]: Starting VyOS Router...
Dec 17 13:41:05 test-1 systemd[1]: Started VyOS Router.

I checked /lib/systemd/system/vyos-hostsd.service, and found that they are dependency in service definition, I don't know if it's the cause and if it's on purpose. But I think it's not reasonable that restarting vyos-hostsd to trigger restarting of vyos-router which lead to many thing reconfigured unintentionally , it's very big impact.

[Install]
RequiredBy=cloud-init-local.service vyos-router.service

best regards.

Details

Difficulty level
Unknown (require assessment)
Version
1.2.3
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Perfectly compatible

Event Timeline

Hi,

I found a easy way to reproduce.

sudo systemctl restart systemd-journald
# then configure your hostname

everytime, restart of systemd-journald cause corruption of vyos-hostsd, and then configuring hostname cause restart of vyos-hostsd, and it eventually cause restart of vyos-router, and vyos is reconfigured unexpectably.

best regards.

zsdc changed the task status from Open to In progress.Dec 19 2019, 6:37 PM
zsdc added a subscriber: zsdc.

Hello, @MapleWang!
Thank you for pointing our attention to this issue! It is really bad that such simple action as changing hostname in some cases (well, in fact not only this but it is easy to reproduce) leads to the whole router crash.
The problem consists of several parts:

  1. In old systemd versions (which is used in Debian Jessie and VyOS 1.2) exists a problem, when during a restart of systemd-journald all pipes between this daemon and systemd services are disconnecting.
  2. In vyos-hostsd, which is responsible for hostname and DNS and controlled by systemd we used print() for logging and debug purposed without enough handling of errors.

So, when arises the situation when there is no PIPE connection between vyos-hostsd and systemd-journald, vyos-hostsd not able to print messages and crashes. :(

We have prepared a fix for this, which replaces unsafe print() to logging, which is not sensitive to output problems and silently passes any such errors by default.
The only problem, which stays: after the stdout disconnection all logs messages will be lost in the open space between vyos-hostsd and systemd-journald, but after switching to Buster in VyOS 1.3 it should disappear too with a new version of systemd.

Thanks again for the detailed bug report!

pasik added a subscriber: pasik.Dec 19 2019, 8:34 PM

Hi @zsdc ,

Really thanks for your reply. It's nice to have this fix. But to be honest, crash of vyos-hostsd is not so big deal for me, what really concern me is that restart of vyos-hostsd is followed by restart of vyos-router.

In my understanding, vyos-router should only be called on boot time in most of case, we should avoid it's called in running phase if it's unintentional. I have checked unit file of vyos-hostsd, it uses RequiredBy as following:

[Install]
RequiredBy=cloud-init-local.service vyos-router.service

It means that no matter what condition is, as long as vyos-hostsd is restarted, vyos-router will be restarted. I admit that this binding is necessary in boot phase when the configuration is unloaded, but it's unnecessary in running phase when configuration is already loaded, it's not safe.

Is it possible to unbind relationship between vyos-hostsd and vyos-router in running phase?

best regards.

zsdc added a comment.Dec 31 2019, 3:17 PM

Hello, @MapleWang!
Unfortunately, I cannot find any other reliable way to configure vyos-hostsd service to be running before the vyos-router. In fact, vyos-hostsd is really necessary to be running for proper work of the VyOS system, so we can consider this even from the other point of view - how to keep all services operable after the vyos-router restart?
If you will have any ideas, which can help to decrease the overall impact of this situation, we would be happy to get them.

Hi @zsdc

My original thoughts was quite straight forward, modify /usr/libexec/vyos/init/vyos-router as below:

start()
{
    # in the last line of start phase, add a flag file in temporary file system like
    touch /tmp/vyos-router
}

# add new restart phase
case "$action" in
     start)  start ;;
     stop)  stop ;;
     restart)  restart ;;
     force-reload)  stop && start ;;
esac

restart()
{
    # check if the flag exists, if it exists, it means the system is running, then don't allow related restart
    if [ -e "/tmp/vyos-router" ]
    then
        return
    else
       stop && start
    fi  
}

But the truth is the restart phase defined in /usr/libexec/vyos/init/vyos-router won't be used by systemd at all, I checked document of systemd, it seems restart in systemd just call "Execstop && Execstart", there is no way to specify execution command for restart action. Then it would be awkward to implement my thoughts above. But hope it give you some hints to fix this problem.

best regards

Hi @zsdc

Why not use WantedBy instead of RequiredBy in vyos-hostsd.service like:

[Install]
WantedBy=cloud-init-local.service vyos-router.service

It can maintain relationship vyos-hostsd and vyos-router in startup phase and also won't lead to problem we described above in running phase.

By the way, may I say there are several bugs of stop function in vyos-router?

It's original code:

local -i status=-1
log_daemon_msg "Stopping VyOS router"
for ((i=${#sub_inits[@]} - 1; i >= 0; i--)) ; do
    s=${subinit[$i]}
    log_progress_msg $s
    ${vyatta_sbindir}/${s}.init stop
    let status\|=$?
done
log_end_msg $status
log_action_begin_msg "Un-mounting VyOS Config"
umount ${vyatta_configdir}
log_action_end_msg $?

/usr/lib/frr/frr stop

sub_inits should be subinit and frr should be frrinit.sh