A security advisory has been made public here with a description of the bug and severity assessment performed by the ICS team and me. Please review this information as context for the patch process, timeline, and feedback received throughout this upgrade.
Patch process
Based on our severity assessment, Informal and Hypha decided on the following process:
-
Soft patch in a 6 hour window that covers most timezones (9:00 UTC - 15:00 UTC)
- Have >⅓ of the Hub’s active set apply a privately-released security patch to mitigate the impact of an attack.
- Once >⅓ have applied the patch, the chain will halt due to consensus errors if any of the four vulnerable message types are sent.
- To recover from a chain halt, the remaining ⅔ of the active set will need to apply the same patch ASAP.
-
Coordinated upgrade on Sep-05 at 15:00 UTC
- The Hub’s coordinated upgrade process for publicly released binaries is reliable and operators know how to do it.
- If the chain halts, ⅓+ of the set has already upgraded and is running the canonical version of the chain. The dev team is primed to rollback a problematic transaction and instruct validators on next steps.
Soft patch and binary information was communicated through Discord and Telegram channels. We also used a Google Form to confirm upgrade status and collect feedback from node operators about how this patch was being distributed and communicated (see Feedback section).
Timeline
2024-Sep-02 (Labour Day in North America)
- 13:30 UTC: The Informal dev team informs representatives from Hypha and Amulet about a critical bug in ICS that is live in production as of the v19 upgrade (Aug-21).
- 14:30 UTC: Informal and Hypha decide to move forward with a soft patch distributed privately to 40% of the validator set on 2024-Sep-04, followed by a coordinated upgrade of the entire set.
- 18:30 UTC: Informal and Hypha assemble a list of responsive validators comprising 40% of the active set. Hypha finalizes copy for messages.
- 20:00 UTC: Informal and Hypha begin sending messages to validators asking them to confirm their availability during a 6 hour window on Sep-04 to apply a patch.
2024-Sep-03
- 7:30 UTC: >33.3% of the active set has confirmed that they will be available in the 6 hour window requested.
- 9:30 UTC: Patched binaries are privately built and tested.
- 19:00 UTC: Hypha finalizes copy for messages to distribute binary and communicate next steps.
2024-Sep-04
- 8:00 UTC: Permissions for the privately released binary are set to allow chosen validators to download it.
- 8:30 UTC: Binaries are distributed to ~40% of the active set via Telegram and Discord.
- 13:30 UTC: >33.3% of the active set has upgraded and the vulnerability has been mitigated.
- 13:45 UTC: Public security advisory is created and a public release is cut.
- 17:30 UTC: Email is sent to all Hub validators announcing a coordinated upgrade on 2024-Sep-05 at 15:00 UTC. Discord announcement confirms this.
- The email also tells validators that it is safe to immediately upgrade to v19.2.0 if they wish.
- Some validators begin upgrading immediately based on conversation in Discord.
2024-Sep-05
- 15:22 UTC: Coordinated upgrade height arrives and Hub validators are expected to upgrade.
- 17:40 UTC: ICS security advisory is made public.
Feedback from early ⅓ validators
The timeline for the patch was good and clear.
There was plenty of time to confirm that operators could be available to apply the patch and the 6 hour window comfortably covered several timezones.
Consider using a Google sheet for validators to self-identify that they’ve upgraded.
Other chains use a Google sheet where validators can mark themselves as having completed the upgrade and see the voting power tally up. This is feedback we received after the last emergency upgrade.
To reduce the risk of collusion, we chose to keep the list of upgraded validators private and collect it via a Google form. The voting power tally was available to the dev and comms team in the background, which is how we tracked when the chain had passed the ⅓ threshold.
Operators want to see the code when applying privately communicated upgrades.
Depending on the severity of the issue and the security of the comms channel, this may or may not be possible. Currently, our comms are going through TG and Discord channels which are not secure enough to share code involved in a live exploit.
Communication security.
A lot of feedback received was about the security of communication and distribution of the binary. We got many suggestions for how to address this and most of them are blocked by needing secure emails for validators on the Hub.
- Set up a private channel for communication
- Use signed emails
- Use a private git repo
Any private channel or repository requires that validator teams are highly responsive to adding and removing their node operators when people join or leave the team, or when the validator moves in or out of the active set. These are operations that are often neglected, even in active Telegram groups where we currently communicate.
Some actionable suggestions and work that is currently in progress (but not completed in time to be used for this patch):
- Send emails to validators’ security contact address listed on-chain. Not all validators have filled in this field. You can do so by using gaiad tx staking edit-validator --security-contact --from
- Use signed emails to prove that the sender is trustworthy. We have not set up keys and communicated with validators to do this yet, but it’s something we’re aware of.
- Create a hash of the security patch message (i.e., the ones distributed in Telegram and Discord) and display the hash on an official account, such as Informal’s Github or Twitter, to verify that the message is legitimate.
Feedback from later ⅔ validators
Governance.
Validators typically prefer governance-gated upgrades. Given the severity of this issue, a governance upgrade was not possible. This feedback was received in Discord long before the security advisory was released.
Be clearer about what “consensus breaking” means.
For developers, “consensus breaking” generally means that a consensus break is possible but not guaranteed. For example, this patch is “consensus breaking” but only if one of the four vulnerable message types is sent on-chain.
Feedback from validators is that “consensus breaking” means that a consensus issue is guaranteed to occur. For example, a change to CometBFT is a guaranteed consensus break because it is a change to how validators compute consensus in every single block. A patch like the one in gaia v19.2.0 is not “consensus breaking” by validator standards.
For live security issues, there’s a limit to how much detail can be provided about what will break consensus, because those details will lead someone to identify the vulnerability. This is a good way to lead into the next piece of feedback:
Deciding and communicating about a coordinated upgrade vs a freely released patch.
The dev/comms team thinks about how likely a consensus break is when deciding to do a coordinated upgrade vs telling validators to upgrade at their own pace.
The goal in this case was to have a coordinated upgrade to provide confidence that the full validator set had upgraded, since we do not have monitoring tools that provide this information. However, the way we communicated about the coordinated upgrade included the idea that it was safe to apply the upgrade before the halt-height, which caused confusion amongst the validator set and made it hard to tell when we’d passed the ⅔+ threshold.
There are two major thoughts here:
When a patch is only potentially consensus-breaking, the dev/comms team needs to make a decision and be clear about it. In this case, we were unclear about the decision and it caused confusion. Whether the correct decision was a coordinated upgrade or a freely released patch is hard to say, but either option is manageable as long as the communication is clear.
The dev/comms team needs a way to confirm that ⅔+ of the validator set has upgraded and the chain is no longer vulnerable to a chain halt. Several options have been suggested:
- Thorchain uses a particular upgrade tx that can be tracked on chain
- Polkadot uses a protocol called JAM that lets the chain upgrade without halting.
- Use a coordinated upgrade and trust that Hub validators perform very well during upgrades.
- Manually ask validators if they have upgraded after a freely released patch (this is what we’re currently doing – hi, you’ve probably received a DM from me)
- Use the Google Sheet/Form to have all validators confirm their upgrade after a freely released patch (we did this for only ⅓ of the set this time)
The most immediate option is to use the Google Form for the entire set when we have a freely released patch.
Final thoughts
High severity issue in production was mitigated and then resolved with no chain halt. Coordination amongst the first ⅓+ of the validator set was quick and effective compared to past emergencies.
Confusion around how the full validator set should apply the patch. Still lacking a secure communication channel with any subset of the validator set.
Much appreciation to Amulet for their guidance and advice throughout this issue.
Thank you, as always, to the validators who we count on to patch the chain in an emergency. Your reliability and quick actions keep the Hub safe no matter the severity of the issue.
Next steps: Informal and Hypha (in consultation with Hub validators) to produce internal runbooks for analyzing Hub-specific security issues, including how to categorize an issue, which communication channels to use, how to protect against compromised accounts, and how to communicate about different kinds of emergencies.
Please contribute any feedback here. All of it is helpful as we refine the security response process on the Hub.