Cosmos Hub v17.1 Chain Halt - Post-mortem

Overview

On June 5th, 2024, at 19:21 (UTC), the Cosmos Hub chain halted, and users could not successfully execute any transactions. The incident occurred slightly after the scheduled v17 software upgrade took place. The upgrade triggered a bug when a validator leaves the active set of validators and another validator takes its place. The Informal Systems Cosmos Hub team was first informed about the incident through the Informal Staking team. The Informal Systems Cosmos Hub team, in coordination with Hypha and Binary Builders, provided the fix, and the chain resumed on June 6th, 2024, at 0:02 (UTC).

Timeline

Event Block Height Time (UTC)
Chain upgrade started 20739800 16:58 (on June 5th, 2024)
Chain upgrade completed 20739802 17:15
Chain halts 20740970 19:21
Informed (over Slack) by the Informal Staking team that the chain has halted chain is halted 19:46
Hypha (over Slack) confirms the error in their mainnet node as well chain is halted 19:47
Hypha, the Informal Systems Cosmos Hub team, and Binary Builders meet on Zoom to fix the issue chain is halted 20:06
Fix the issue and cut Cosmos SDK v0.47.15-ics-lsm chain is halted 21:35 (from git show v0.47.15-ics-lsm)
Cut Gaia v17.2.0 (using Cosmos SDK v0.47.15-ics-lsm) chain is halted 22:28 (from git show v17.2.0)
Published the release binaries after running automated tests chain is halted 22:57 (see the action)
Chain resumes 20740972 0:02 (on June 6th, 2024)

The total time the chain was halted was around 4 hours and 40 minutes.

The above timeline does not include the communication steps taken to update validators on what happened (over Discord) and when we informed them when they should use the newest v17.2.0 released.

Issue

The Cosmos-SDK staking module has “hooks” that allow 3rd party modules to run code at specific points within the execution of a block. The ICS provider module runs logic in the AfterUnbondingInitiatedHook hook, which is called during EndBlock when the staking module computes updates to the validator set. Under certain circumstances, the state of the staking module is inconsistent when this hook executes 3rd party code. Specifically, when in the same block a validator is added to the active set, and another validator is removed from the active set, the number of validators can exceed the MaxValidators parameter.

The ICS provider module code that caused this chain halt calls the function GetLastValidators. This function happens to include a sanity check that halts the chain if the number of validators exceeds the MaxValidators parameter. As a temporary fix for the chain halt, we changed the GetLastValidators function to truncate the list to the first MaxValidators validators with the most power without panicking.

This workaround does not risk corrupting the state of the ICS module because the list of validators is not used at this point and is merely computed as part of a convenience function. As a permanent fix for this issue, we will refactor the ICS provider code to remove the call to GetLastValidators at this point. This fix should also have a slight performance benefit.

Room for improvement

Although we routinely and robustly test code before we run an upgrade, sometimes bugs happen. With that said, there is always room for improvement. We identified several key areas in which we can improve.

  1. In both our automated randomized tests and the testnet, we did not set the MaxValidators parameter at a number lower than the number of validators on the chain. For this reason, the bug was never hit in tests. Test cases where we randomize all chain parameters would catch more edge cases like this, and we should have inactive validators in future testnet scenarios.
  2. The bug was triggered by a convenience function in our code, which retrieves several different pieces of data, not all of which are used by all callers of the function. In this case, the data retrieved by the query that triggered the bug was not even used where the bug was triggered. We have opened an issue to optimize the code only to query the necessary data at every point. Create lightweight version for GetAllConsumerChains · Issue #1943 · cosmos/interchain-security · GitHub
  3. The bug was triggered because the database is inconsistent in certain Cosmos-SDK hooks. This may be unavoidable since the intended use of hooks is to allow third-party modules to insert logic into the functioning of another module. We will work with the Cosmos-SDK team to understand which hooks may have an inconsistent state and whether they can be made consistent or if the inconsistency can be documented if it is unavoidable.

More details

This log (given to us by Hypha) provides the stack trace of what led to the halt. By looking at this log and at the code we could figure out the root cause of the issue.

The chain halted due to a panic in the GetLastValidators method. The method found more bonded validators than the maximum number of 180 bonded validators, and since GetLastValidators was called during an EndBlock, the chain halted.

To see why the panic was raised, we need to understand that validator addresses are stored in at least two different places in Cosmos SDK that use the following keys:

  1. LastValidatorPowerKey: “supposed” to store all the bonded validators;
  2. ValidatorsByPowerIndexKey: all validators are stored under this key.

GetLastValidators is returning all the bonded validators and hence iterates over the LastValidatorPowerKey. Now, when the staking module attempts to compute the validator updates (in ApplyAndReturnValidatorSetUpdates) to be sent to CometBFT it iterates through all the validators up to the maximum number of validators, so it iterates over the ValidatorsByPowerIndexKey. For each validator, it checks on whether this validator was previously a bonded validator (i.e., is found in the last map; note that the last map is created by iterating over LastValidatorPowerKey and hence contains the bonded validators). If the validator is not found, then this validator is stored under the LastValidatorPowerKey, so this validator is considered a bonded validator. Note that a validator not being found, means that a new validator that was not bonded and part of the active set is about to join the active set of validators. At this point, we have more than 180 validators stored under the LastValidatorPowerKey key, so if we were to call GetLastValidators it would try to iterate over more than 180 validators and would panic. Afterwards, validators that are not anymore part of the active set would be deleted and the amount of bonded validators stored under LastValidatorPowerKey would be again 180. However, from the moment a new validator joins and before it gets deleted, GetLastValidators could panic. This is precisely what happens because during this time we call bondedToUnbonding that ends up calling GetLastValidators.

More specifically, the incident occurred because a validator issued a MsgUnjail transaction that resulted in this validator joining the active set. The unjail calls unjailValidator, which in turn calls SetValidatorByPowerIndex. This stores the validator under the types.GetValidatorsByPowerIndexKey(validator, k.PowerReduction(ctx)) key, namely under the ValidatorsByPowerIndexKey. Now ValidatorsPowerStoreIterator iterates through the store under this key so when ApplyAndReturnValidatorSetUpdates goes through all the validators and because this just-unjailed validator did not exist in the previous block (because it was jailed) it will not be found and hence the validator would set SetLastValidatorPower. At this point, from the LastValidatorPowerKey perspective we have 181 validators because the validator that will be removed from the active set is not yet removed with DeleteLastValidatorPower. Afterwards, when we go through the remaining validators and call bondedToUnbonding, which in turn calls BeginUnbondingValidator, which calls AfterUnbondingInitiatedHook hook that finally calls GetAllConsumerChains, we have GetAllConsumerChains that calls GetLastValidators but we still have 181 validators, so GetLastValidators panics.

Note that the incident could have occurred not only due to an unjailing operation but whenever the active 180 validator set was changed by having one validator coming in and one getting out.

Finally, note that although we call GetLastValidators in multiple other places (e.g., in QueueVSCPackets that is called in EndBlockVSU so in an EndBlock) the panic never triggers there. This is because those calls to GetLastValidators are not called during the ApplyAndReturnValidatorSetUpdates when we might have slightly more validators than 180.

17 Likes

This was an incredibly rapid and effective post-mortem process – huge props to everyone involved from Informal especially in how quickly we got to the bottom of this behaviour from a technical standpoint.

we should have inactive validators in future testnet scenarios.

In Hypha’s internal debrief, we talked about why we didn’t encounter this issue on testnet even though we’ve been running v17 for weeks, including during ISLE.

We try to keep most of the params on the testnet close to Hub params, including maxValidators. Right now our provider chain’s maxValidators param is 175 despite having only 45 validators actually active and securing the network. There’s plenty of room in the active set, so validators performing normal operations like bonding, unbonding, unjailing, etc don’t actually put the chain in a state where we have 176 bonded validators.

Another thing we discussed is that our testnets have to meet two (sometimes contradictory) goals: training and realism. We want validators to know what to do and be capable of doing it perfectly, but we also want to experience the chaos that we’re sure to see on mainnet.

For both ISLE and the Testnet Incentives Program, we expect validators to remain unjailed for the full period. So this exact incident (validator unjails, joins the active set, and causes there to momentarily be more than maxValidators) is unlikely to occur on testnet even if we reduce maxValidators because… testnet validators rarely get jailed.

In other words – great training, but not great realism.

Going forward on testnet, we’ll consider some different methods of better reproducing mainnet conditions here (e.g., reducing the param, running lots of small validators so we always have some inactive vals, introducing more chaos into our stake distributions, holding game days etc) even though our validators are unrealistically well trained :sweat_smile:

6 Likes

The QA team gets paid way too much money for this to happen. This clearly wasn’t some kind of weird edge case bug because it transpired immediately after the upgrade - in other words it should have been caught in the testnet.

From what I see in lexa’s post, it is clear that the testnet is only a partial copy of the production environment. This is the reason Gavin has Kusama which is a pre-production environment. Pre-production environment are different than testing environments because they are not environment in which new functionality is tested but an environment which mirrors production conditions. Pre-prod (or UAT - user acceptance testing) environments are there exactly to catch bugs like the one that transpired here, for this type of situation. You are not testing new functionality, you are testing compatibility of the new code with the production environment.

If there is anything to fix here, you need to create a UAT/pre-prod environment of some sort and then add a couple of days of UAT testing to your schedule before initiating a prod release.

At one of the companies, the UAT environment was such a good mirror of production that we utilized it as an active backup. If the prod failed, the load balancer immediately redirected traffic to the UAT which became the new prod. Eventually prod-uat became an active-active set. We’d deploy new version to UAT, if there were no issues we’d point the load balancer to UAT and then upgrade the PROD environment. Then the prod environment becomes the UAT and the backup env. We’d rotate the environments each release - UAT turns to PROD, PROD to UAT, UAT to PROD, PROD to UAT, etc

I don’t know if something like this is possible with a blockchain. It probably would involve creating a fork by all the validators, putting the new version, then running it for a day or two and then killing the fork before upgrading the original chain.

2 Likes

Thank you for the thorough technical explanation and timeline, especially on the heels of running this incident! The retrospective itself is exemplary work, and all of the teams involved have done a fantastic job to investigate, resolve, and account for the issue that triggered the chain halt. :heart: :shield: :star2:

2 Likes

First of all, thanks for the detailed post-mortem overview, it seems really thorough and I agree with all of what was said here.

A few thoughts from a validator point of view:

  • IMO the solution to basically ignore the error if it happens seems like a workaround rather than a proper solution, and I also voiced it in Discord, but that has already been outlined in the post-mortem, appreciate that a lot
  • sometimes I had a feeling that the validator set felt like they have to wait for the team to do stuff without knowing the exact timing. I recall the GitLab case, when they accidentally dropped the production database: they made a public post-mortem with all of the details and what are they doing to prevent this from happening in the future (also mentioned in @lexa 's post), but one cool thing they did was having a livestream when they restored the data and basically answered the questions in chat. Not sure of that, but maybe having a Zoom call or something with devs who are trying to fix that and validators who can provide support, as some of us are also programmers and can help (but also so they’d be aware of what’s happening and were more informed and were able to understand how much time there’s left before the fix can be deployed). This has its own downsides, like if it’s not moderated properly it’s gonna turn into a mess and would do more harm than good, but I wonder if the idea is nice and what does the team think of that
  • really appreciate the validator set who’ve done an outstanding job waiting and reacting accordingly and applying the fix in time, and the team who did an outstanding effort in fixing the bug

Now onto how it was possible to miss this bug before and only find it on mainnet. I love the Swiss cheese theory, basically the idea is: there’s a lot of places this bug could’ve been found and prevented, but somehow it went through all of these cheese holes being undetected. I see three possible places for that: 1) automated testing, 2) manual testing (as in devnet) and 3) public testnet (ISLE) (maybe I am missing some other places, let me know your thoughts on that).

For 1) obviously there’s a need for automated tests (as CI checks ideally) for these edge cases. Maybe even have a code coverage metric to ensure all of the lines of code are covered with tests? This might be not an ideal solution, but better than ending up in a case when there’s no tests for a specific thing which causes a chain halt and nobody being aware there’s no tests for that.

For 2), not sure about the solution, probably having a checklist of what to check before pushing a release to a public testnet and adding such edge cases there?

For 3) I wrote a few things on incentivised testnet channel, TLDR: it would be nice if there’s some incentives for validators who test edge cases and basically make sure everything is working as expected. Idea is the following: currently, TIP and ISLE is all about following rules (like, not getting jailed, voting on proposals, upgrading etc.) but that won’t help finding these edge cases and bugs in them, and probably validators won’t go extra mile if that is not incentivised. I suggest it can be done this way: have let’s say 5 days when you have to follow the rules and let’s say 2 days when you can do whatever you want without losing your points, but if you manage to find something that is not working as intended or something like that, you get incentivised (bonus points if you can throroughly describe it and provide as much details as possible and maybe even fix it yourself).

One more thing I’d like to add (maybe it’s something really basic that everybody knows, but it’s never bad to highlight it) is that when reading a post-mortem or providing feedback it’s really important to avoid pointing fingers and blaming others; instead it’s all about how we as the community can make Hub a more stable and secure place and make sure it will never halt again.

Cheers!

4 Likes

Thank you for providing an excellent post-mortem. This issue, initially identified on the Cosmos Hub chain, is something that could potentially occur on other chains utilizing the Cosmos SDK. We believe it’s crucial to thoroughly investigate potential corner cases within the cosmos-sdk and devise appropriate solutions.
What are your thoughts on establishing a contribution group to collectively conduct source code reviews and add test cases? We can definitely take this opportunity to enhance our processes.

  • Please note that this is my personal opinion and does not reflect the views of DSRV.
1 Like

Exemplary execution. Thank you much for the rapid and careful response and the clarity about the problem and solution.

1 Like