Security advisory ICS-2024-002 and Gaia v19.2.0 patch retrospective

A security advisory has been made public here with a description of the bug and severity assessment performed by the ICS team and me. Please review this information as context for the patch process, timeline, and feedback received throughout this upgrade.

Patch process

Based on our severity assessment, Informal and Hypha decided on the following process:

  1. Soft patch in a 6 hour window that covers most timezones (9:00 UTC - 15:00 UTC)

    • Have >⅓ of the Hub’s active set apply a privately-released security patch to mitigate the impact of an attack.
    • Once >⅓ have applied the patch, the chain will halt due to consensus errors if any of the four vulnerable message types are sent.
    • To recover from a chain halt, the remaining ⅔ of the active set will need to apply the same patch ASAP.
  2. Coordinated upgrade on Sep-05 at 15:00 UTC

    • The Hub’s coordinated upgrade process for publicly released binaries is reliable and operators know how to do it.
    • If the chain halts, ⅓+ of the set has already upgraded and is running the canonical version of the chain. The dev team is primed to rollback a problematic transaction and instruct validators on next steps.

Soft patch and binary information was communicated through Discord and Telegram channels. We also used a Google Form to confirm upgrade status and collect feedback from node operators about how this patch was being distributed and communicated (see Feedback section).

Timeline

2024-Sep-02 (Labour Day in North America)

  • 13:30 UTC: The Informal dev team informs representatives from Hypha and Amulet about a critical bug in ICS that is live in production as of the v19 upgrade (Aug-21).
  • 14:30 UTC: Informal and Hypha decide to move forward with a soft patch distributed privately to 40% of the validator set on 2024-Sep-04, followed by a coordinated upgrade of the entire set.
  • 18:30 UTC: Informal and Hypha assemble a list of responsive validators comprising 40% of the active set. Hypha finalizes copy for messages.
  • 20:00 UTC: Informal and Hypha begin sending messages to validators asking them to confirm their availability during a 6 hour window on Sep-04 to apply a patch.

2024-Sep-03

  • 7:30 UTC: >33.3% of the active set has confirmed that they will be available in the 6 hour window requested.
  • 9:30 UTC: Patched binaries are privately built and tested.
  • 19:00 UTC: Hypha finalizes copy for messages to distribute binary and communicate next steps.

2024-Sep-04

  • 8:00 UTC: Permissions for the privately released binary are set to allow chosen validators to download it.
  • 8:30 UTC: Binaries are distributed to ~40% of the active set via Telegram and Discord.
  • 13:30 UTC: >33.3% of the active set has upgraded and the vulnerability has been mitigated.
  • 13:45 UTC: Public security advisory is created and a public release is cut.
  • 17:30 UTC: Email is sent to all Hub validators announcing a coordinated upgrade on 2024-Sep-05 at 15:00 UTC. Discord announcement confirms this.
    • The email also tells validators that it is safe to immediately upgrade to v19.2.0 if they wish.
    • Some validators begin upgrading immediately based on conversation in Discord.

2024-Sep-05

  • 15:22 UTC: Coordinated upgrade height arrives and Hub validators are expected to upgrade.
  • 17:40 UTC: ICS security advisory is made public.

Feedback from early ⅓ validators

The timeline for the patch was good and clear.

There was plenty of time to confirm that operators could be available to apply the patch and the 6 hour window comfortably covered several timezones.

Consider using a Google sheet for validators to self-identify that they’ve upgraded.

Other chains use a Google sheet where validators can mark themselves as having completed the upgrade and see the voting power tally up. This is feedback we received after the last emergency upgrade.

To reduce the risk of collusion, we chose to keep the list of upgraded validators private and collect it via a Google form. The voting power tally was available to the dev and comms team in the background, which is how we tracked when the chain had passed the ⅓ threshold.

Operators want to see the code when applying privately communicated upgrades.

Depending on the severity of the issue and the security of the comms channel, this may or may not be possible. Currently, our comms are going through TG and Discord channels which are not secure enough to share code involved in a live exploit.

Communication security.

A lot of feedback received was about the security of communication and distribution of the binary. We got many suggestions for how to address this and most of them are blocked by needing secure emails for validators on the Hub.

  • Set up a private channel for communication
  • Use signed emails
  • Use a private git repo

Any private channel or repository requires that validator teams are highly responsive to adding and removing their node operators when people join or leave the team, or when the validator moves in or out of the active set. These are operations that are often neglected, even in active Telegram groups where we currently communicate.

Some actionable suggestions and work that is currently in progress (but not completed in time to be used for this patch):

  • Send emails to validators’ security contact address listed on-chain. Not all validators have filled in this field. You can do so by using gaiad tx staking edit-validator --security-contact --from
  • Use signed emails to prove that the sender is trustworthy. We have not set up keys and communicated with validators to do this yet, but it’s something we’re aware of.
  • Create a hash of the security patch message (i.e., the ones distributed in Telegram and Discord) and display the hash on an official account, such as Informal’s Github or Twitter, to verify that the message is legitimate.

Feedback from later ⅔ validators

Governance.

Validators typically prefer governance-gated upgrades. Given the severity of this issue, a governance upgrade was not possible. This feedback was received in Discord long before the security advisory was released.

Be clearer about what “consensus breaking” means.

For developers, “consensus breaking” generally means that a consensus break is possible but not guaranteed. For example, this patch is “consensus breaking” but only if one of the four vulnerable message types is sent on-chain.

Feedback from validators is that “consensus breaking” means that a consensus issue is guaranteed to occur. For example, a change to CometBFT is a guaranteed consensus break because it is a change to how validators compute consensus in every single block. A patch like the one in gaia v19.2.0 is not “consensus breaking” by validator standards.

For live security issues, there’s a limit to how much detail can be provided about what will break consensus, because those details will lead someone to identify the vulnerability. This is a good way to lead into the next piece of feedback:

Deciding and communicating about a coordinated upgrade vs a freely released patch.

The dev/comms team thinks about how likely a consensus break is when deciding to do a coordinated upgrade vs telling validators to upgrade at their own pace.

The goal in this case was to have a coordinated upgrade to provide confidence that the full validator set had upgraded, since we do not have monitoring tools that provide this information. However, the way we communicated about the coordinated upgrade included the idea that it was safe to apply the upgrade before the halt-height, which caused confusion amongst the validator set and made it hard to tell when we’d passed the ⅔+ threshold.

There are two major thoughts here:

When a patch is only potentially consensus-breaking, the dev/comms team needs to make a decision and be clear about it. In this case, we were unclear about the decision and it caused confusion. Whether the correct decision was a coordinated upgrade or a freely released patch is hard to say, but either option is manageable as long as the communication is clear.

The dev/comms team needs a way to confirm that ⅔+ of the validator set has upgraded and the chain is no longer vulnerable to a chain halt. Several options have been suggested:

  • Thorchain uses a particular upgrade tx that can be tracked on chain
  • Polkadot uses a protocol called JAM that lets the chain upgrade without halting.
  • Use a coordinated upgrade and trust that Hub validators perform very well during upgrades.
  • Manually ask validators if they have upgraded after a freely released patch (this is what we’re currently doing – hi, you’ve probably received a DM from me)
  • Use the Google Sheet/Form to have all validators confirm their upgrade after a freely released patch (we did this for only ⅓ of the set this time)

The most immediate option is to use the Google Form for the entire set when we have a freely released patch.

Final thoughts

:+1: High severity issue in production was mitigated and then resolved with no chain halt. Coordination amongst the first ⅓+ of the validator set was quick and effective compared to past emergencies.

:-1: Confusion around how the full validator set should apply the patch. Still lacking a secure communication channel with any subset of the validator set.

:handshake: Much appreciation to Amulet for their guidance and advice throughout this issue.

:pray: Thank you, as always, to the validators who we count on to patch the chain in an emergency. Your reliability and quick actions keep the Hub safe no matter the severity of the issue.

:point_right: Next steps: Informal and Hypha (in consultation with Hub validators) to produce internal runbooks for analyzing Hub-specific security issues, including how to categorize an issue, which communication channels to use, how to protect against compromised accounts, and how to communicate about different kinds of emergencies.

Please contribute any feedback here. All of it is helpful as we refine the security response process on the Hub.

1 Like

Appreciate the detailed response.

One question: so I worked with a big tech company, and we had a culture of incidents discussion, and one of the biggest aspects was so-called “action items”, e.g. “the list of things we’re doing so we can be sure that this would technically not be possible to happen again”. Does the team have any answer on how to prevent such bugs in the future?
Asking because considering the bug description, it could’ve been really catastrophic if someone would be able to actually abuse it.

Communication-wise I don’t have anything bad to say, other than it would’ve been better if the announcement said that we are safe to update at any time, but better to do it ASAP. Agree with the rest of what’s written in the initial post.

A bit off-topic but I also wonder if it’s possible to somehow build it into cosmos-sdk, so there’d be a way to say for sure which validator is using which version, so we won’t need Google spreadsheets or something similar. Stating this here because this is quite a common thing in Cosmos ecosystem and it happens quite often unfortunately, so different teams have to deal with it and Google spreadsheet tend to be unreliable, as you have to manually mark yourself as upgraded, or someone may mark your as upgraded mistakenly, while in fact your node is using the older binary. Wonder if someone from Informal or sdk developers can comment on whether this is technically possible to do it the way I suggest?

2 Likes

In review of the security issue, the procedure taken was acceptable.

The only source of confusion was the uncommon instruction of a halt-height when 33% already upgraded.

A halt-height makes sense if a halt until the estimated block time is accepted in case of an early attack.

Eliminating the halt risk would require all remaining validators to upgrade immediately.

1 Like

Not off-topic at all, and this would make it much easier to keep track of upgrades. I’ve heard several technical ways of doing this discussed in the past, such as broadcasting the version in every signature (or maybe once per epoch nowadays?) but I honestly don’t remember why those ideas were dismissed. I recall someone suggesting that it is a security issue if someone is able to detect that the chain is split between versions, but I’m not an expert in this. I’ll ask around.

The Google Sheet is not my preferred method, for the reasons you describe. It’s an open link and prone to human error. In the absence of a technical solution, the Google Form worked well enough…it’s a lot harder to accidentally type the wrong validator name than it is to hit the wrong checkbox. Here’s the form, for reference: https://forms.gle/pqpoREbYpCtbXmo1A

Does the team have any answer on how to prevent such bugs in the future?

We’re still in conversation about it, and I know that’s not a very satisfying answer (either to give or to receive but it’s only been 4 days).

We’re talking about better awareness for API changes and the details in how the stack technologies interact with Hub-specific technology. I would like to see more robust automated testing when we integrate stack updates for more than just happy paths, and a more robust set of tests in general. This sort of test coverage requires time and creativity to set up, and often gets pushed to the bottom of the backlog because it doesn’t seem important until a bug like this occurs. Test-driven development (in which the code is written iteratively with small unit tests) is another idea, but I am not sure if this is already being used in the dev process. My team typically receives an essentially completed piece of work and then runs local tests and puts it on the testnets.

I’ll see if any other team members have anything to add at this time.

1 Like

Thank you for the question and the engagement.

One approach that’s seems worth exploring is using vote extensions.

This is the general idea:
It seems possible to write a module that would utilize vote extensions to signal that an upgrade was completed by a validator.

Upon starting the patched binary, the validator node would add a short string specifying which version of the software it is currently running. In this the nodes would have written v19.2.0.

We could use this information to establish the current rate of adoption of the new patched binary.

Limitations:
This would also signal to potential adversaries that the nodes are upgrading because the data is accessible to everyone. Mitigating this is possible but quickly turns a simple module into a more involved piece of code.

We are open to suggestions if you have other ideas.

EDIT: I will make another post to answer other questions.

1 Like

Actually seems like a nice solution. I’d love to see an ADR on that.
Thanks for elaborating!

As a suggestion, from my experience from working with enterprise-level companies, a nice way would be to have a public meeting to debrief such an incident.
Here’s how we did it: I worked in a company that has like 20 different services within it, and every time there was a severe incident, there was a public meeting where the representatives of the service that had an incident were making a report, including the following: the incident timeline, how it was fixed, what went good, what went bad, and (the most important) the action items (e.g. “what can be done so this problem would not ever happen again”), then everybody was allowed to suggest the ideas to make it even more stable.
Actually, this thread is something like that already, my idea is to have a place where people are able to brainstorm their ideas on making things more stable (would be ideal if other chains’ developers/validators can also join, as all Cosmos chains have a lot of similarities and often face the same issues).

(Not sure whether this applies to Gaia/Cosmos development, just sharing a point of view of how it can be done.)