Cosmos Hub 3 Upgrade Post-Mortem
In preparation for the upgrade, validators halted the cosmoshub-2 chain at 11:39 am UTC. Around twenty-five minutes into the migration, multiple validators uncovered an issue with the derived genesis file that was caused by a bug in the migration command.
This bug ensured that cosmoshub-3 would fail to launch with the instructions provided in the upgrade proposal. Validators then followed the proposal’s recovery plan to relaunch the cosmoshub-2 chain, which resumed operating at 1:54 pm UTC. Total downtime for this failed upgrade was 2 hours and 15 minutes.
This bug would seem to have been caught during the gaia-13004 (v0.34.7) to gaia-13005 (v1.0.0-rc3), but there were some additional changes to the migration logic that were not tested during the gaia-13005 to gaia-13006 (v2.0.0) upgrade. A fix to the migration issue has been contributed by Kwun Yeung and the SDK team is in the process of putting together a release (v2.0.2) of gaia (stay tuned here).
This failure to upgrade exposed flaws in our testing and upgrade procedures as well as some internal processes. In the course of discussions with validators over the last week the following actions should be taken to reduce the likelihood of a similar issue:
- The automated upgrades work being done by Regen Networks team (under contract with the ICF) should be prioritized to mitigate issues on subsequent releases. This upgrade method achieved a full upgrade of a decentralized network with under 2.5 minutes downtime. For comparison, the cosmoshub-3 upgrade anticipated 1 hour of downtime.
- Run a full export/migration of mainnet against the simulator for each release. This is an easy step to add to the release process and would have prevented the issue.
- Create tooling that allows validators to easily spin up a testnet from a clone of mainnet. This tooling would replace validator public keys in the genesis file and allow for a small number of validators to test any upgrade/migration against a clone of mainnet. There is currently an ongoing community effort to launch a testnet forked from mainnet with the patch. If you would like to be involved there is currently coordination happening in this telegram group.
- Improve contingency and rollback documentation to ensure that validators are better able to recover from any state. Potentially, some large validators and/or ICF/AiB could maintain recent copies of the
~/.gaiad/directories in S3 buckets to help validators quickly start up in the case of failure.
- Upgrades in the future should include failure to launch criteria to remove ambiguity.
On a positive note, the issue was quickly identified and validators were able to relaunch the chain with no double signing incidents, or downtime slashing. This continues to demonstrate the operational excellence of the cosmos validator set and their ability to execute under fire as a group. There were also a couple of validators who managed to write scripts to fix the genesis file during the chaos of the failing upgrade (shout out to Oliver from StakeWithUs).
One item of note: communication split between multiple channels is a debated topic. Some view it as decentralizing and positive, while others would like one channel for all communication.