Page MenuHomeVyOS Platform

qemu-kvm grub issue
Closed, ResolvedPublicBUG

Description

The system enters grub mode after killing the VM process.

To reproduce

sever@sever:~/scripts/stresstest$ ps ax | grep qemu
 839938 ?        Sl     5:06 /usr/bin/qemu-system-x86_64 -name guest=r5-roll

sudo kill -9  839938

And start VM again.
Sometimes it happens the first time, sometimes the fifth.

I can't reproduce it with stable release 1.2.6-s1

Script for testing.

#!/usr/bin/env bash

VM_NAME="r5-roll"

while true; do

  virsh start ${VM_NAME}
  sleep 7
  sudo kill -9 $(ps ax | grep qemu | egrep -v grep | awk '{print $1}')
  sleep 1
done

Details

Difficulty level
Hard (possibly days)
Version
vyos-1.4-rolling-202101290218-amd64
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)
Issue type
Bug (incorrect behavior)

Event Timeline

Looks like grub.cfg is being overwritten (note the 0 file size):

grub> ls -l (hd0,msdos1)/boot/grub/
DIR          20210129100148 i386-pc/
DIR          20210129100144 locale/
DIR          20210129100147 fonts/
0            20210129155859 grub.cfg
1024         20210129100147 grubenv

I think this is just filesystem corruption caused by yanking the power to the VM. serial-getty@ttyS0.service is also showing 0 file size (both are written in generate() in src/conf_mode/system_console.py in vyos-1x):

grub> ls -l (hd0,msdos1)/boot/1.4-rolling-202101290218/rw/etc/systemd/system/
0            20210129162816 serial-getty@ttyS0.service
DIR          20210129162816 getty.target.wants/
DIR          20210129100224 ntp.service.d/
DIR          20210129100302 ssh.service.d/

In the crux branch of vyatta-cfg-system, scripts/system/vyatta_update_console.pl doesn't overwrite those files unless they've changed (see update()).

Some weirdness: vyos-rsyslog.conf looks fine even though it's also unconditionally written in generate() in src/conf_mode/system-syslog.py in vyos-1x:

grub> ls -l (hd0,msdos1)/boot/1.4-rolling-202101290218/rw/etc/rsyslog.d/       
181          20210129162817 vyos-rsyslog.conf

grub> cat (hd0,msdos1)/boot/1.4-rolling-202101290218/rw/etc/rsyslog.d/vyos-rsyslog.conf 
## generated by syslog.py ##
## file based logging
$outchannel global,/var/log/messages,262144,/usr/sbin/logrotate /etc/logrotate.d/vyos-rsyslog
*.info;local7.debug :omfile:$global

Only some files in /etc are adjusted by the Python scripts. We can extend the render() function to only write the file if the content changed (by either read back the content or comparind checksums of the content).

Might this help?

I think we definitely need to try because this reproducible. @Viacheslav did you reproduce this on the ESXi hypervisor?

Emergency recovery procedure:

grub> ls (hd0,msdos1)/boot/
#Figure out directory name, in this case, this is 1.4-rolling-202102040221

grub> linux /boot/1.4-rolling-202102040221/vmlinuz boot=live quiet rootdelay=5 noautologin net.ifnames=0 biosdevname=0 vyos-union=/boot/1.4-rolling-202102040221 console=ttyS0,9600 console=tty0

grub> initrd /boot/1.4-rolling-202102040221/initrd.img
grub> boot
Viacheslav changed Difficulty level from Normal (likely a few hours) to Hard (possibly days).Feb 22 2021, 9:23 AM
trae32566 added a subscriber: trae32566.

I've had this bite me a few times now as well, but I wasn't able to pin it down before to being a bug.

I attempted to fix this bug by journaling all filesystem data (ext3/4 mount option data=journal). Unfortunately, I still encountered an unbootable system.

I also attempted to fix this bug by writing to grub.cfg.new, calling fsync() on it, renaming it to grub.cfg, and calling fsync() on the directory. Unfortunately, I still encountered an unbootable system.

Viacheslav changed the task status from Open to Needs testing.Mar 8 2021, 12:32 AM

Looks good on 1.4-rolling-202103080218:

vyos@vyos:~$ ls -l /etc/systemd/system/serial-getty@ttyS0.service 
-rw-r--r-- 1 root vyattacfg 1145 Mar  8 15:20 /etc/systemd/system/serial-getty@ttyS0.service
vyos@vyos:~$ ls -l /boot/grub/grub.cfg 
-rw-r--r-- 1 root root 2260 Mar  8 10:02 /boot/grub/grub.cfg
vyos@vyos:~$ show version

Version:          VyOS 1.4-rolling-202103080218
Release Train:    sagitta

Built by:         autobuild@vyos.net
Built on:         Mon 08 Mar 2021 02:18 UTC
Build UUID:       4debd270-2278-4f2e-84eb-200965ec8b40
Build Commit ID:  e758a269e84509

Architecture:     x86_64
Boot via:         installed image
System type:      KVM guest

Hardware vendor:  QEMU
Hardware model:   Standard PC (Q35 + ICH9, 2009)
Hardware S/N:     
Hardware UUID:    06d0fd71-2f1e-4e2f-a29c-79dc6ca82ca9

Copyright:        VyOS maintainers and contributors

I kill the vm 80+ times.
All works fine.
@c-po can we "cherry-pick" it to 1.3.0?

stepler changed the task status from Needs testing to Backport candidate.Mar 11 2021, 8:38 PM
c-po moved this task from Backport Candidates to Finished on the VyOS 1.4 Sagitta board.
c-po moved this task from Need Triage to Finished on the VyOS 1.3 Equuleus board.

Looks good on 1.3-rolling-202103280642:

vyos@vyos:~$ ls -l /etc/systemd/system/serial-getty@ttyS0.service 
-rw-r--r-- 1 root vyattacfg 1145 Mar 29 18:47 /etc/systemd/system/serial-getty@ttyS0.service
vyos@vyos:~$ ls -l /boot/grub/grub.cfg 
-rw-r--r-- 1 root root 2260 Mar 29 09:06 /boot/grub/grub.cfg
vyos@vyos:~$ show version

Version:          VyOS 1.3-rolling-202103280642
Release Train:    equuleus

Built by:         autobuild@vyos.net
Built on:         Mon 29 Mar 2021 03:42 UTC
Build UUID:       a3b25a73-a77e-440d-9000-099e5b504cd2
Build Commit ID:  56ecc5b821b2f0

Architecture:     x86_64
Boot via:         installed image
System type:      KVM guest

Hardware vendor:  QEMU
Hardware model:   Standard PC (Q35 + ICH9, 2009)
Hardware S/N:     
Hardware UUID:    688112b3-7dd7-42b5-9642-83ab9b7242ee

Copyright:        VyOS maintainers and contributors
SrividyaA set Issue type to Bug (incorrect behavior).Aug 30 2021, 6:13 PM