Use governance for emergency upgrades

lexa · May 30, 2024, 3:02pm

Planned software upgrades are coordinated through a software upgrade proposal with a 2 week voting period. Discussions for an expedited (1 week voting period) software upgrade proposal process are underway here.

Changing the process for regular software upgrade proposals is beyond the scope of this post.

Current state of the Hub’s coordinated/emergency upgrades

To begin, let’s think about our current process for all software upgrades. Why do some of them use governance and some not?

gaia versions use semantic versioning to refer to new versions.
In the binary’s code, the version name is a parameter whichis identified by only the MAJOR name (e.g., v16, not v16.1 or v16.4). The binary for v16.0.0 and the binary for v16.1.0 both used ‘v16’ as the version name parameter.
- For those who remember v7-Theta and v8-rho, we had issues because the names are case-sensitive, which introduced opportunity for human error. After this, we changed to just the numeral format.
- Examples of where this constant is identified in the gaia code: v10, v16
- Example of v15.2, where you can see the version name is still v15
It’s a Hub norm that governance is used only for major upgrades (e.g., moving from v16 to v17, NOT moving from v16.0 to v16.1).
Cosmovisor requires a version to have a new name during upgrade to work reliably.
- Good: “v15” is the name for the v15.2.0 version; “v16” is the name for the v16.0.0 version. Cosmovisor is expected to handle this upgrade.
- Bad: “v16” is the name for the v16.0.0 version; “v16” is also the name for the v16.1.0 version. Cosmovisor is not expected to handle this upgrade.
- In some testnet events, we’ve had validators successfully use Cosmovisor for these minor upgrades. If this is a priority, we can keep investigating it as a possibility.
Minor and patch upgrades (regardless of whether they are addressing a minor bug or a major security vulnerability) are coordinated off-chain with a manually chosen halt-height.

Issues with current state

Coordinating upgrades involves a lot of human touch – Dante picks a halt height, @btruax sends a lot of emails and makes a lot of posts, I answer a lot of questions in Discord and Telegram
Tooling can easily track on-chain data for upgrades but can’t reliably track information about off-chain coordination for emergency upgrades
Validators who miss a non-governance upgrade potentially face consensus mismatches and apphashes

Suggested change

Coming from a thread on this post, we’re entertaining the idea of using governance for emergency upgrades.

I’m not the discussion police, but here’s what I think is useful to think about at this stage in the ideation:

What would using governance for emergency upgrades improve?
What new risks or issues could occur if we use governance for emergency upgrades?
How do these improvements or risks change based on other parameters that we might tweak (e.g., voting period for emergency upgrades)?

lexa · May 30, 2024, 3:13pm

Top-level post is intended to collect relevant info! Here are my personal opinions:

Validator attention

The Hub has a very responsive validator set that can be counted on in emergencies. I wouldn’t want to abuse that responsiveness by introducing a lot more rapid governance processes, however I prefer (and I think validators would prefer) rapid governance → on-chain data for emergency upgrades over rapid off-chain coordination via email and Discord.

Below, I highlight the risk of exposing vulnerabilities and wonder about setting up an empty upgrade proposal (pointing at an empty repo) and then adding the binary 24h in advance of the upgrade height. If this is possible (and acceptable, from a software dev and social norm perspective) then I think the governance doesn’t actually have to be ‘rapid’ after all. We could probably use the same voting period as a normal software upgrade (whether that’s 1 or 2 weeks).

Risk of exposing vulnerabilities

Imo, the biggest risk in using governance for emergency upgrades is when we are fixing a major security vulnerability.

In this case, we need to publish the fix and have it go through governance before doing the upgrade. The worst case Ontario is that this fix is published (exposing the vulnerability) but the upgrade doesn’t actually pass and the chain doesn’t actually upgrade. Terrible!

@mpoke – would it be possible to put an upgrade on chain and point it at a repo where the fix will be published 24h in advance of upgrade height, the way we currently do for coordinated emergency upgrades? Like we set up the upgrade proposal to happen but don’t actually reveal the binary until we know the proposal is going to pass and it’s safe to reveal the fix.

Shift to using full semantic version names

I’d also welcome Marius’ thoughts on shifting from naming versions using only the major version to using the full semantic name. I believe Neutron does this (example) and thus uses governance for all upgrades, but it also introduces the risk of human error in failing to keep the release and upgrade name in sync (example from Neutron).

freak12techno · May 30, 2024, 3:34pm

As a validator, I heavily support this idea, here’s my overview:

What would using governance for emergency upgrades improve?

Currently, having an off-chain upgrade (that’s done via all validators setting a halt-height on their nodes and upgrading their binaries at the same time) has a few downsides compared to using governance upgrades:

There’s completely no info on chain about such an upgrade, the only way you can know about this is by the team reaching out to you via Discord, email or whatever. As of me, I prefer relying on the automated monitoring that should get me alerted about something I should keep in mind instead of Discord messages/emails: it’s really easy to get distracted and miss a really important notification about the emergency chain upgrade, while if it’s done by a monitoring system, it’s way more reliable. For example, this is one of the reasons I’ve build https://github.com/QuokkaStake/cosmos-node-exporter: it also provides you the information about the current upgrade plan for the proposal that has passed but not yet applied (and if Cosmovisor has the appropriate binary in the correct folder for it) and the estimation time till the block it’s applied on; I can (and actualy did) set up alerts if I have an upgrade that I do not have Cosmovisor binary prepared for and if the upgrade is less than 30 minutes away, so I can be present. With on-chain information, I can fetch it from chain or build tools to do it for me, but if it’s off-chain, there’s no technical way to do so.
If a consensus breaking upgrade is applied without governance, it’s gonna create a lot of problems for people who want to bootstrap archive nodes from scratch. Imagine fetching all of the blocks and at some time your node is crashed because of the AppHash error because there was an upgrade Cosmovisor doesn’t know anything about and wasn’t prepared for. With governance upgrades, you can easily build all the binaries beforehand, and Cosmovisor would apply all of them for you without you needing to set halt-height and replace the binary every time a non-governance upgrade happens.
If a consensus breaking upgrade is applied without governance and there’s a validator who forgot to upgrade, their node would produce a different AppHash (compared to a crash that’d happen with a governance proposal) and this would be highly inobvious for the node owner why does it happen, while with the governance upgrade approach the node would just crash saying that some upgrade wasn’t applied, which is more clear.

Considering everything above, in the ideal world IMO there should never be an upgrade that is done via halt-height and all the upgrades, even the emergency ones, should be applied via governance.

What new risks or issues could occur if we use governance for emergency upgrades?

I see two risks:

If it’s an expedited proposal, there’s obviously a risk of not reaching the quorum in time, that way the proposal would be converted to a regular one and everybody would have to wait for a longer time while the vulnerability is still present.
If it’s really urgent, the fix should be applied as soon as possible, and having to wait longer time might do more harm.

How do these improvements or risks change based on other parameters that we might tweak (e.g., voting period for emergency upgrades)?

One parameter I see that might influence it in both a good and a bad way is voting period for emergency upgrades. Ideally, there should be a compromise between having a shorter period of voting for such proposal (and therefore risking not reaching the quorum in time) vs having a longer voting period (and therefore risking somebody figuring out the vulnerability and abusing it).
For me, the voting period that is somewhat okay considering both of these concerns is 2-3 days: there are chains which have actually 2 days regular governance (namely, Kujira) and somehow they reach the quorum, so this should be manageable here, and also all of the emergency upgrades I saw here are usually taking 1-2 days to be prepared.

mpoke · May 30, 2024, 5:15pm

@lexa Thanks for starting the conversation and for sharing your views.

I’d also welcome Marius’ thoughts on shifting from naming versions using only the major version to using the full semantic name. I believe Neutron does this (example) and thus uses governance for all upgrades, but it also introduces the risk of human error in failing to keep the release and upgrade name in sync (example from Neutron).

This shouldn’t be a problem. We used only the major version until now just because we didn’t need to use the minor. We never had a software upgrade proposal to bump minor versions of Gaia.

Imo, the biggest risk in using governance for emergency upgrades is when we are fixing a major security vulnerability.

In my view, emergency upgrades for both major and critical security vulnerabilities should not go through governance at all. Even sharing publicly that there is a critical vulnerability on the Hub would help attackers to exploit it.

would it be possible to put an upgrade on chain and point it at a repo where the fix will be published 24h in advance of upgrade height, the way we currently do for coordinated emergency upgrades? Like we set up the upgrade proposal to happen but don’t actually reveal the binary until we know the proposal is going to pass and it’s safe to reveal the fix.

I think this should be possible as the binaries are just provided in the info field of a software upgrade proposal. We actually had scenarios where we asked the validators to use another binary than the one in the proposal (e.g., the v15 upgrade).

Topic		Replies	Views
A formalized process for issuing emergency Cosmos SDK software updates Proposal Ideas	8	1208	June 5, 2019
Security advisory ICS-2024-002 and Gaia v19.2.0 patch retrospective Security	5	173	September 6, 2024
[PROPOSAL 926] [VOTING] Signaling Proposal - Expedite Software Upgrade Proposals Signaling/Text	21	838	June 3, 2024
[PROPOSAL #988][VOTING] - Gaia v22.2.0 Software Upgrade Software Upgrade	0	164	February 14, 2025
Post-Launch Roadmap - Proposal: Atom Transfers Proposal Ideas	9	3032	March 26, 2019