Failsafe reboot timer
Open, WishlistPublic
Actions

Assigned To

Authored By

	jjakob
	Jun 6 2020, 9:26 AM

Description

It would be very nice if we could have a "failsafe reboot" mode that we could use after image upgrades, in case the upgrade goes wrong and the system doesn't boot after reboot and is inaccessible, it would reboot back into the old image after a preset delay (e.g. 10-30 minutes). The command from op-mode could be reboot-failsafe [time]. OF course for this to work the migration scripts would need to save their output to a temporary place and only replace the default config.boot after a successful boot commit. For cases where the commit succeeds but the system is still inaccessible, the failsafe reboot will need to replace config.boot (as it was already migrated to the new version) with a archive copy of the pre-upgrade config.boot.

Details

Difficulty level: Unknown (require assessment)
Version: -
Why the issue appeared?: Will be filled on close
Is it a breaking change?: Behavior change
Issue type: Feature (new functionality)

Related Objects

Mentioned Here: T4516: Rewrite system image manipulation tools in Python
T3285: Schedule reboots through systemd-shutdownd instead of atd

Event Timeline

jjakob triaged this task as Wishlist priority.Jun 6 2020, 9:26 AM

jjakob created this task.

jjakob created this object in space S1 VyOS Public.

Sorry, I don't know if I understand it wrong, but please allow me to express my opinion, but when you install and add a new image from the old image for upgrading, it may occur that it can't be used normally at startup config.boot Configuration (especially rolling update), in which case, whether or not the old config.boot There is no intention to make any migration or modification Justice. Either the configuration syntax has been changed and errors still occur after migration, or some settings of the system image have problems. A better solution is to modify grub's default boot so that it can boot from the old normal system image, if it exists, but this requires user authorization.

Of course, if you can disable some settings automatically in case of partial failure of configuration command, it will be very good.
My understanding may be wrong. If so, please don't laugh at me. Thanks again.

Perhaps a good way is still for the user to choose whether to reconfigure or restart with the old image.

@jjakob Thanks for a good summary of one of the issues, namely, not replacing config until 'success' (at least of boot, if not other criteria). Migration does save a backup, but we had recently discussed not doing in place change by default. I would like to assess this in relation to other recent discussions of failsafe and rollback, and then proceed with that compatible change first ...

Sorry, I don't know if I understand it wrong, but please allow me to express my opinion, but when you install and add a new image from the old image for upgrading, it may occur that it can't be used normally at startup config.boot Configuration (especially rolling update),

Right, the rolling image often has bugs that prevent it from successfully committing the config.boot on startup. If this failure results in the system not being accessible via SSH (failure to configure ethernet, bond or bridge interfaces, DHCP failures, firewall failures,...) there's no way to remotely fix it. If the router is at a remote site it's very time consuming and costly to go out and physically fix it via the console.

in which case, whether or not the old config.boot There is no intention to make any migration or modification Justice.

The way the current bootup commit works, (I'm not 100% certain, I may be wrong in some parts) is that it first looks if any components of the config require syntax migration, if yes, it backs up config.boot to config.boot.pre-migration-$(date) and goes through the migrator scripts, each one modifying the configuration to bring the syntax up to date, then it saves the new component version string and new config into config.boot, and then commits it. Therefore the config.boot is already migrated to the new syntax even if the commit then fails, meaning you can't simply reboot into the old image and expect things to work, as the old image expects the old config syntax, but the config was already migrated to the new one. We need to replace config.boot with its .pre-migration version before rebooting into the old image.

Either the configuration syntax has been changed and errors still occur after migration, or some settings of the system image have problems. A better solution is to modify grub's default boot so that it can boot from the old normal system image, if it exists, but this requires user authorization.

Yes, my proposal would work like this: if the user reboots the system with reboot-failsafe (or reboot-confirm, as it's similar to commit-confirm) the command would enable a systemd timer. This timer would periodically (eg every 5min) alert the user (print a message to the "wall") that the reboot still needs to be confirmed or the system will reboot in x minutes. Then if the system 'uptime' reaches a certain time (I'm thinking 30 minutes would be a good default) it will restore the old config.boot from /config/archive/config.boot or config.boot.1 (I still need to check in which cases it commits it to archive and which historical version is the latest pre-migration/commit), replace the default entry in grub.cfg with the previous active image (we'd need to record this previous active image some place, likely in grub.cfg itself, as the previous active image isn't necessarily the next oldest image!), disable the reboot-confirm timer and reboot. Then on next login the user would be warned that a reboot-confirm failed and the system was auto-rebooted into the old image (I'd need to research how this would best be done).

Of course, if you can disable some settings automatically in case of partial failure of configuration command, it will be very good.

This already kind of exists, if a certain part of the config fails, the console will show that the vyos-router service config load failed, the other services will still be operational. As I said, this is for those services that make the system inaccessible from the network so the failure can't be recovered from remote.

Yes, let me confirm a few details here ....

The way the current bootup commit works, (I'm not 100% certain, I may be wrong in some parts) is that it first looks if any components of the config require syntax migration, if yes, it backs up config.boot to config.boot.pre-migration-$(date) and goes through the migrator scripts, each one modifying the configuration to bring the syntax up to date, then it saves the new component version string and new config into config.boot, and then commits it. Therefore the config.boot is already migrated to the new syntax even if the commit then fails, meaning you can't simply reboot into the old image and expect things to work, as the old image expects the old config syntax, but the config was already migrated to the new one. We need to replace config.boot with its .pre-migration version before rebooting into the old image.

According to my experience, vyos upgrade uses the following command:

add system image image.iso

Will be migrated during upgrade config.boot To the new images, but it seems that the config.boot And in order to change (or maybe I remember it wrong), I used to choose the older boot image directly when the migration failed.

Use the following command to view the old image config.boot File:

show file image-name://config/config.boot

You can ask the user before the upgrade operation if you want to perform the fallback on failure function on the first boot after the upgrade, and select an old image that is safe when the migration fails.

Ah right, then all the things about replacing config.boot aren't necessary. I was thinking that /config was permanent between images, I don't know how I forgot that it lives inside each image separately.

The timer isn't neccesary either, it can be a systemd service that starts a python process monitoring uptime and issuing periodic wall messages, then it edits grub.cfg and reboots if not stopped before by a reboot-confirm.

The whole point of this command is to make easy failure recovery for systems where you don't have remote console access, just SSH, so editing grub.cfg is necessary.

I look forward to your success.

In fact, you may be able to switch the default startup item by simply trying to call the following command, and this corresponds to the following command:

set system image default-boot image-name

Right, that's obvious. The issue is that we need to know to *which* image to switch to, but it's easy to solve, as I'll describe.

Imagine a scenario where you install a newer image, reboot to it, and it fails to boot, so you boot back into the old image. When the bug that caused the image to fail to boot is supposedly fixed, you install a newer image. Now you have 3 images: the oldest one that works, the first newer one that doesn't work and the 2nd newer one that you just installed. You now reboot-failsafe (reboot-confirm) into the newest one. But there's still something wrong and it still fails to boot. It then needs to reboot not into the first older image (which is the previous one that doesn't work) but the 2nd oldest one (the one that it was installed from).

Therefore when modifying the default boot image (either via 'add system image' or 'set system image default-boot') the script that modifies grub.cfg needs to add its own name to it (easiest to do by adding a comment line to grub.cfg with its menu entry number), so that the failsafe reboot knows into which image to reboot into (it parses out the entry number from the grub comment and sets the default entry to it).

Can you ask the user if you want to start the migration failure fallback mechanism on the first boot of the new image when upgrading, and if the user chooses to enable this mechanism, you should let the user select an old secure image (execute the mechanism only on the first boot)?

Therefore when modifying the default boot image (either via 'add system image' or 'set system image default-boot') the script that modifies grub.cfg needs to add its own name to it (easiest to do by adding a comment line to grub.cfg with its menu entry number), so that the failsafe reboot knows into which image to reboot into (it parses out the entry number from the grub comment and sets the default entry to it).

This is indeed a solution, but consider whether it will conflict with the following command.

set system image default-boot image-name

Can you ask the user if you want to start the migration failure fallback mechanism on the first boot of the new image when upgrading, and if the user chooses to enable this mechanism, you should let the user select an old secure image (execute the mechanism only on the first boot)?

That's exactly what I was thinking and was just about to comment with the suggestion. 'add system image' already has a prompt that asks which image to make the default boot image. That can be followed by a prompt which image the user wants to make the "recovery" image, defaulting to the currently booted image. We could even add a command set system image recovery that does the same prompt.
Regarding the prompt for failsafe reboot, if there is a prompt for normal reboot in add system image right now (I don't remember if it is) that could be replaced by a failsafe reboot.
If there is no prompt, we can perhaps add a simple notice message "You may do a failsafe reboot with 'reboot-failsafe' over a standard reboot if you prefer. Remember to confirm the reboot if it's successful." and the user can run the reboot that he wants to (the normal reboot still being an option)

This is indeed a solution, but consider whether it will conflict with the following command.

No it won't as that's the script that actually needs to record the old image too so it needs to be modified as part of this change. Done right, there's no conflicts.

Is this solved with T3285?

syncer edited projects, added VyOS 1.3 Equuleus (1.3.0); removed VyOS 1.3 Equuleus.Nov 6 2021, 11:24 AM

syncer edited projects, added VyOS 1.3 Equuleus (1.3.3); removed VyOS 1.3 Equuleus (1.3.0).Aug 29 2022, 7:05 AM

This should be implemented under the rewrite of system-image-tools:
https://vyos.dev/T4516

pasik added a subscriber: pasik.Apr 13 2023, 5:15 PM

GernhardReinlunzen added a subscriber: GernhardReinlunzen.Jun 27 2023, 5:26 PM

I-n-d-y added a subscriber: I-n-d-y.Jun 30 2023, 6:27 AM

kevinrausch added a subscriber: kevinrausch.Jan 16 2024, 2:15 PM

dmbaturin edited projects, added VyOS 1.5 Circinus; removed VyOS 1.4 Sagitta.Fri, Apr 12, 3:07 PM

Failsafe reboot timerOpen, WishlistPublicActions

Description

Details

Related Objects

Event Timeline

Failsafe reboot timer
Open, WishlistPublic
Actions