Page MenuHomeVyOS Platform

qemu-kvm grub issue
Needs testing, HighPublicBUG

Description

The system enters grub mode after killing the VM process.

To reproduce

sever@sever:~/scripts/stresstest$ ps ax | grep qemu
 839938 ?        Sl     5:06 /usr/bin/qemu-system-x86_64 -name guest=r5-roll

sudo kill -9  839938

And start VM again.
Sometimes it happens the first time, sometimes the fifth.

I can't reproduce it with stable release 1.2.6-s1

Script for testing.

#!/usr/bin/env bash

VM_NAME="r5-roll"

while true; do

  virsh start ${VM_NAME}
  sleep 7
  sudo kill -9 $(ps ax | grep qemu | egrep -v grep | awk '{print $1}')
  sleep 1
done

Details

Difficulty level
Hard (possibly days)
Version
vyos-1.4-rolling-202101290218-amd64
Why the issue appeared?
Will be filled on close
Is it a breaking change?
Unspecified (possibly destroys the router)

Event Timeline

Looks like grub.cfg is being overwritten (note the 0 file size):

grub> ls -l (hd0,msdos1)/boot/grub/
DIR          20210129100148 i386-pc/
DIR          20210129100144 locale/
DIR          20210129100147 fonts/
0            20210129155859 grub.cfg
1024         20210129100147 grubenv

I think this is just filesystem corruption caused by yanking the power to the VM. serial-getty@ttyS0.service is also showing 0 file size (both are written in generate() in src/conf_mode/system_console.py in vyos-1x):

grub> ls -l (hd0,msdos1)/boot/1.4-rolling-202101290218/rw/etc/systemd/system/
0            20210129162816 serial-getty@ttyS0.service
DIR          20210129162816 getty.target.wants/
DIR          20210129100224 ntp.service.d/
DIR          20210129100302 ssh.service.d/

In the crux branch of vyatta-cfg-system, scripts/system/vyatta_update_console.pl doesn't overwrite those files unless they've changed (see update()).

Some weirdness: vyos-rsyslog.conf looks fine even though it's also unconditionally written in generate() in src/conf_mode/system-syslog.py in vyos-1x:

grub> ls -l (hd0,msdos1)/boot/1.4-rolling-202101290218/rw/etc/rsyslog.d/       
181          20210129162817 vyos-rsyslog.conf

grub> cat (hd0,msdos1)/boot/1.4-rolling-202101290218/rw/etc/rsyslog.d/vyos-rsyslog.conf 
## generated by syslog.py ##
## file based logging
$outchannel global,/var/log/messages,262144,/usr/sbin/logrotate /etc/logrotate.d/vyos-rsyslog
*.info;local7.debug :omfile:$global

Only some files in /etc are adjusted by the Python scripts. We can extend the render() function to only write the file if the content changed (by either read back the content or comparind checksums of the content).

Might this help?

I think we definitely need to try because this reproducible. @Viacheslav did you reproduce this on the ESXi hypervisor?

Emergency recovery procedure:

grub> ls (hd0,msdos1)/boot/
#Figure out directory name, in this case, this is 1.4-rolling-202102040221

grub> linux /boot/1.4-rolling-202102040221/vmlinuz boot=live quiet rootdelay=5 noautologin net.ifnames=0 biosdevname=0 vyos-union=/boot/1.4-rolling-202102040221 console=ttyS0,9600 console=tty0

grub> initrd /boot/1.4-rolling-202102040221/initrd.img
grub> boot
Viacheslav changed Difficulty level from Normal (likely a few hours) to Hard (possibly days).Mon, Feb 22, 9:23 AM
trae32566 added a subscriber: trae32566.

I've had this bite me a few times now as well, but I wasn't able to pin it down before to being a bug.

I attempted to fix this bug by journaling all filesystem data (ext3/4 mount option data=journal). Unfortunately, I still encountered an unbootable system.

I also attempted to fix this bug by writing to grub.cfg.new, calling fsync() on it, renaming it to grub.cfg, and calling fsync() on the directory. Unfortunately, I still encountered an unbootable system.

Viacheslav changed the task status from Open to Needs testing.Mon, Mar 8, 12:32 AM