Vulnerability Coordination Update: CosmosSDK Security Advisory 05-30-2019
On Tuesday, May 28th, a high-severity security vulnerability in the staking module of the CosmosSDK was reported to the Tendermint team through security@tendermint.com. The issue was reported at 21:57 UTC, with triage beginning immediately among core development teams who were operating in North America, Europe, and Australia.
The Vulnerability
The Cosmos economic security model depends on a set of related concepts: Bonding, Unbonding, Delegation and Redelegation. As a way to strike a balance between decentralization and fast block times, we also only support a fixed number of validators in the Cosmos Hub and allow any Atom holder to participate through delegation.
One of the key components to this security model is that once atoms have been bonded, they cannot be unbonded for 21 days. In addition to this, however, we also wanted to improve the quality of life for delegators who took a risk and delegated to a validator who never made into the active set of validators ranked by delegated stake by allowing for instant unbonding.
The vulnerability that was reported demonstrated a flaw in the code that managed the combination of redelegation, instant unbonding, and the handling of validators who are out of the voting set. Simply put, this bug created an opportunity for validators to misbehave and break the rules without having to deal with any consequences.
In the current Gaia state machine, when a bonded stake holder wants to undelegate their staked ATOMs from a bonded validator, they must wait the full unbonding period before their ATOMs are liquid again. If the delegate wants to undelegate their staked ATOMs from a unbonding validator, they must wait until the validator completes unbonding before their ATOMs are liquid again. However, if the delegate wants to undelegate from an already unbonded validator, the undelegation amount is immediately liquid.
Given these conditions, a delegate (or even a validator) could have bypassed the full unbonding period when undelegating from a bonded validator and have their funds immediately become liquid essentially insta-unbonding by following these steps:
- Redelegate from a bonded validator to an unbonded validator
- In the same or subsequent tx, the delegator unbonds from the unbonded validator
As a result, the delegate would immediately receive their staked ATOMs without being slashed. The funds are immediately liquid because when a MsgUndelegate
is processed, the staking keeper’s Undelegate is called which internally calls getBeginInfo
:
func (k Keeper) Undelegate(
ctx sdk.Context, delAddr sdk.AccAddress, valAddr sdk.ValAddress, sharesAmount sdk.Dec,
) (completionTime time.Time, sdkErr sdk.Error) {
// create the unbonding delegation
completionTime, height, completeNow := k.getBeginInfo(ctx, valAddr)
// completeNow is true and completionTime is the zero value of time.Time
// ...
}
// This function called by Undelegate and BeginRedelegation.
// get info for begin functions: completionTime and CreationHeight
func (k Keeper) getBeginInfo(ctx sdk.Context, valSrcAddr sdk.ValAddress) (
completionTime time.Time, height int64, completeNow bool) {
switch {
case !found || validator.Status == sdk.Bonded: // ...
case validator.Status == sdk.Unbonded: // in case of BeginRedelegation?
return completionTime, height, true
case validator.Status == sdk.Unbonding: // in case of BeginRedelegation?
Default: // ...
}
}
Now since completeNow
is true
the following is executed in Undelegate
:
func (k Keeper) Undelegate(
ctx sdk.Context, delAddr sdk.AccAddress, valAddr sdk.ValAddress, sharesAmount sdk.Dec,
) (completionTime time.Time, sdkErr sdk.Error) {
// create the unbonding delegation
completionTime, height, completeNow := k.getBeginInfo(ctx, valAddr)
returnAmount, err := k.unbond(ctx, delAddr, valAddr, sharesAmount)
if err != nil {
return completionTime, err
}
balance := sdk.NewCoin(k.BondDenom(ctx), returnAmount)
// no need to create the ubd object just complete now
// THIS BLOCK WILL BE EXECUTED
if completeNow {
if !balance.IsZero() {
if _, err := k.bankKeeper.UndelegateCoins(...); err != nil {
return completionTime, err
}
}
// NO UNBONDING OBJECT IS CREATED
return completionTime, nil
}
// ...
}
Simply put, a delegator could have redelegated to a validator who was out of the voting set and then immediately unbond. Though this would have required creating a custom transaction to exploit the logic in the code, it was possible to use this flaw to their advantage with just 1 transaction. Also, this bug would allow for the instantaneous unbonding of stake: Atom holders could vote in governance and instantly unbond, or a validator could quickly unbond their self bond and the double sign to attack their delegators.
Investigation and Response
Within hours of the bug being reported, the triage team created an investigation tool to detect exploitation of the vulnerability. This tool enabled us to look back to the genesis of Cosmos Hub 2, which was initiated on April 22, 2019.
Within the first few hours of our response, our tooling detected 5 false positives and 4 separate incidents of exploitation. It appears that 2-3 of these events were the initial bug reporter testing out the bug, and in these artifacts it is easy to identify exploitation by looking for instances when unbonding time was miscomputed to 0001-01-01T00:00:00Z
.
Within the first 24 hours of receiving the bug report, our tooling detected ~22 events total, with several more false positives showing up in the data. After further review, it was determined that false positives were the product of the detection tooling’s broad search logic.
In this particular case, it is likely that a bug collision existed, and that multiple parties were aware of this security vulnerability existing before it was reported. While the existence of a security vulnerability is not an incident, the exploitation of a vulnerability is: once we confirmed evidence of exploitation beyond the initial proof of concept sent by the bug reporter, this became an active incident that was closed when the Cosmos Hub validators successfully updated the network.
Remediation
After validating the issue, members of the core development team that work on the CosmosSDK began discussing technical solutions for the issue that was reported. Among those options, the team discussed enforcing unbonding periods, tracking sources of funds during unbonding, and not allowing redelegations to unbonded validators. Additionally, the team had a stated preference for making the migration version upgradeable without a state dump and restart.
While several avenues for patching the vulnerability exist, the core development team favored simplicity in code over a string of complex if/then statements. After writing, reviewing, and testing the code comprising the patch, the CosmosSDK team made the patch available in v 0.34.6 of the CosmosSDK.
Vulnerability Coordination
In parallel with the technical remediation work for the issue, the triage team kicked off our internal process for vulnerability notification.
In the case of high and critical severity bugs, we are committed to providing pre-notification of security vulnerabilities 24 hours before publishing a public advisory to organizations, projects, and entities dependent on code we have developed. As part of our round of pre-notification, we proactively reached out to several projects, the validator community, and exchanges who would be impacted by the issue to let them know that a critical issue had been reported, and that they should prepare for a patch and coordination plans to take place within the next 24 hours. Because we are committed to providing the safest, most up to date version of our software to everyone, we actively choose to make the patch for the issue available to everyone in the ecosystem at once.
As part of the coordination process, an ad-hoc, ephemeral communication channel was created to allow for confidential coordination and discussion of the vulnerability with validators. In this discussion, targets for a block height and time to apply the security update were chosen, and validators additionally chose to use the governance module on the Cosmos Hub to pass a governance proposal as a means to coordinate a hard fork for patching the vulnerability.
Within 24 hours of pre-notifying impacted parties that we would be releasing a security update for the CosmosSDK, we published a security advisory on the Cosmos Forum. The security vulnerability was fully remediated in v 0.34.6 of the Cosmos SDK when on Friday, May 31, the Cosmos Hub validators successfully upgraded the network to run on the patched version of the software.
Retrospective
For security to be successful, it must be a shared responsibility across the entire Cosmos ecosystem. More than anything else, we are extremely proud of the quick coordination between the bug reporter, the triage team, core developers, impacted organizations, and the Cosmos Hub validators that resolved this issue.
Though we were able to successfully collaborate to resolve the first security vulnerability on the Cosmos mainnet, we continuously strive to improve our processes to enable us to better coordinate vulnerability disclosure in the future. In this case, the triage team has identified several areas where action from the community would help us improve coordination in the future.
-
To improve our ability to proactively notify the validator community of a security issue 24 hours in advance of publishing an advisory, a dedicated communication channel to share information is needed.
* In this case, we recommend that Cosmos Hub validators set up a dedicated email (security@) to receive security advisories.
* On our end, we will also be working to add a parameter field for validators to share their dedicated security address as a way to opt-in to confidential vulnerability coordination in the future
-
To improve our ability to proactively notify organizations of security issues, it would be invaluable for more exchanges, development teams, and organizations that may be impacted by security bugs to set up and publish a security@ email as well. This would improve our ability to directly pre-notify impacted stakeholders, and ensure that we’re not having to go through customer support channels to coordinate vulnerability remediation for critical issues.
-
To improve our overall incident response capabilities with network operators, we will be exploring solutions to better coordinate and share intelligence with the Cosmos Hub validators.
* A proposal with working requirements for response will be published on the Cosmos forum within the next 1-2 days about standing up a CosmosCERT, a vulnerability and incident coordination team comprised of core developers from All in Bits, the Interchain Foundation, and Cosmos Hub validators.
An additional retrospective of this update is available from Figment Networks is here.
Special thanks to Zaki Manian, Jack Zampolin, and Alex Bezbochuk for their contributions this post.