Overview
On June 5th, 2024, at 19:21 (UTC), the Cosmos Hub chain halted, and users could not successfully execute any transactions. The incident occurred slightly after the scheduled v17 software upgrade took place. The upgrade triggered a bug when a validator leaves the active set of validators and another validator takes its place. The Informal Systems Cosmos Hub team was first informed about the incident through the Informal Staking team. The Informal Systems Cosmos Hub team, in coordination with Hypha and Binary Builders, provided the fix, and the chain resumed on June 6th, 2024, at 0:02 (UTC).
Timeline
Event | Block Height | Time (UTC) |
---|---|---|
Chain upgrade started | 20739800 | 16:58 (on June 5th, 2024) |
Chain upgrade completed | 20739802 | 17:15 |
Chain halts | 20740970 | 19:21 |
Informed (over Slack) by the Informal Staking team that the chain has halted | chain is halted | 19:46 |
Hypha (over Slack) confirms the error in their mainnet node as well | chain is halted | 19:47 |
Hypha, the Informal Systems Cosmos Hub team, and Binary Builders meet on Zoom to fix the issue | chain is halted | 20:06 |
Fix the issue and cut Cosmos SDK v0.47.15-ics-lsm | chain is halted | 21:35 (from git show v0.47.15-ics-lsm ) |
Cut Gaia v17.2.0 (using Cosmos SDK v0.47.15-ics-lsm) | chain is halted | 22:28 (from git show v17.2.0 ) |
Published the release binaries after running automated tests | chain is halted | 22:57 (see the action) |
Chain resumes | 20740972 | 0:02 (on June 6th, 2024) |
The total time the chain was halted was around 4 hours and 40 minutes.
The above timeline does not include the communication steps taken to update validators on what happened (over Discord) and when we informed them when they should use the newest v17.2.0 released.
Issue
The Cosmos-SDK staking module has “hooks” that allow 3rd party modules to run code at specific points within the execution of a block. The ICS provider module runs logic in the AfterUnbondingInitiatedHook hook, which is called during EndBlock when the staking module computes updates to the validator set. Under certain circumstances, the state of the staking module is inconsistent when this hook executes 3rd party code. Specifically, when in the same block a validator is added to the active set, and another validator is removed from the active set, the number of validators can exceed the MaxValidators
parameter.
The ICS provider module code that caused this chain halt calls the function GetLastValidators. This function happens to include a sanity check that halts the chain if the number of validators exceeds the MaxValidators
parameter. As a temporary fix for the chain halt, we changed the GetLastValidators function to truncate the list to the first MaxValidators
validators with the most power without panicking.
This workaround does not risk corrupting the state of the ICS module because the list of validators is not used at this point and is merely computed as part of a convenience function. As a permanent fix for this issue, we will refactor the ICS provider code to remove the call to GetLastValidators at this point. This fix should also have a slight performance benefit.
Room for improvement
Although we routinely and robustly test code before we run an upgrade, sometimes bugs happen. With that said, there is always room for improvement. We identified several key areas in which we can improve.
- In both our automated randomized tests and the testnet, we did not set the
MaxValidators
parameter at a number lower than the number of validators on the chain. For this reason, the bug was never hit in tests. Test cases where we randomize all chain parameters would catch more edge cases like this, and we should have inactive validators in future testnet scenarios. - The bug was triggered by a convenience function in our code, which retrieves several different pieces of data, not all of which are used by all callers of the function. In this case, the data retrieved by the query that triggered the bug was not even used where the bug was triggered. We have opened an issue to optimize the code only to query the necessary data at every point. Create lightweight version for GetAllConsumerChains · Issue #1943 · cosmos/interchain-security · GitHub
- The bug was triggered because the database is inconsistent in certain Cosmos-SDK hooks. This may be unavoidable since the intended use of hooks is to allow third-party modules to insert logic into the functioning of another module. We will work with the Cosmos-SDK team to understand which hooks may have an inconsistent state and whether they can be made consistent or if the inconsistency can be documented if it is unavoidable.
More details
This log (given to us by Hypha) provides the stack trace of what led to the halt. By looking at this log and at the code we could figure out the root cause of the issue.
The chain halted due to a panic
in the GetLastValidators method. The method found more bonded validators than the maximum number of 180 bonded validators, and since GetLastValidators
was called during an EndBlock
, the chain halted.
To see why the panic
was raised, we need to understand that validator addresses are stored in at least two different places in Cosmos SDK that use the following keys:
- LastValidatorPowerKey: “supposed” to store all the bonded validators;
- ValidatorsByPowerIndexKey: all validators are stored under this key.
GetLastValidators
is returning all the bonded validators and hence iterates over the LastValidatorPowerKey
. Now, when the staking module attempts to compute the validator updates (in ApplyAndReturnValidatorSetUpdates) to be sent to CometBFT it iterates through all the validators up to the maximum number of validators, so it iterates over the ValidatorsByPowerIndexKey
. For each validator, it checks on whether this validator was previously a bonded validator (i.e., is found in the last
map; note that the last
map is created by iterating over LastValidatorPowerKey
and hence contains the bonded validators). If the validator is not found, then this validator is stored under the LastValidatorPowerKey
, so this validator is considered a bonded validator. Note that a validator not being found, means that a new validator that was not bonded and part of the active set is about to join the active set of validators. At this point, we have more than 180 validators stored under the LastValidatorPowerKey
key, so if we were to call GetLastValidators
it would try to iterate over more than 180 validators and would panic. Afterwards, validators that are not anymore part of the active set would be deleted and the amount of bonded validators stored under LastValidatorPowerKey
would be again 180. However, from the moment a new validator joins and before it gets deleted, GetLastValidators
could panic. This is precisely what happens because during this time we call bondedToUnbonding
that ends up calling GetLastValidators
.
More specifically, the incident occurred because a validator issued a MsgUnjail transaction that resulted in this validator joining the active set. The unjail calls unjailValidator, which in turn calls SetValidatorByPowerIndex. This stores the validator under the types.GetValidatorsByPowerIndexKey(validator, k.PowerReduction(ctx)) key, namely under the ValidatorsByPowerIndexKey. Now ValidatorsPowerStoreIterator iterates through the store under this key so when ApplyAndReturnValidatorSetUpdates goes through all the validators and because this just-unjailed validator did not exist in the previous block (because it was jailed) it will not be found and hence the validator would set SetLastValidatorPower. At this point, from the LastValidatorPowerKey
perspective we have 181 validators because the validator that will be removed from the active set is not yet removed with DeleteLastValidatorPower. Afterwards, when we go through the remaining validators and call bondedToUnbonding, which in turn calls BeginUnbondingValidator, which calls AfterUnbondingInitiatedHook hook that finally calls GetAllConsumerChains, we have GetAllConsumerChains
that calls GetLastValidators
but we still have 181 validators, so GetLastValidators
panics.
Note that the incident could have occurred not only due to an unjailing operation but whenever the active 180 validator set was changed by having one validator coming in and one getting out.
Finally, note that although we call GetLastValidators
in multiple other places (e.g., in QueueVSCPackets that is called in EndBlockVSU so in an EndBlock) the panic
never triggers there. This is because those calls to GetLastValidators
are not called during the ApplyAndReturnValidatorSetUpdates when we might have slightly more validators than 180.