Cosmos Hub v17.1 Chain Halt - Post-mortem

This was an incredibly rapid and effective post-mortem process – huge props to everyone involved from Informal especially in how quickly we got to the bottom of this behaviour from a technical standpoint.

we should have inactive validators in future testnet scenarios.

In Hypha’s internal debrief, we talked about why we didn’t encounter this issue on testnet even though we’ve been running v17 for weeks, including during ISLE.

We try to keep most of the params on the testnet close to Hub params, including maxValidators. Right now our provider chain’s maxValidators param is 175 despite having only 45 validators actually active and securing the network. There’s plenty of room in the active set, so validators performing normal operations like bonding, unbonding, unjailing, etc don’t actually put the chain in a state where we have 176 bonded validators.

Another thing we discussed is that our testnets have to meet two (sometimes contradictory) goals: training and realism. We want validators to know what to do and be capable of doing it perfectly, but we also want to experience the chaos that we’re sure to see on mainnet.

For both ISLE and the Testnet Incentives Program, we expect validators to remain unjailed for the full period. So this exact incident (validator unjails, joins the active set, and causes there to momentarily be more than maxValidators) is unlikely to occur on testnet even if we reduce maxValidators because… testnet validators rarely get jailed.

In other words – great training, but not great realism.

Going forward on testnet, we’ll consider some different methods of better reproducing mainnet conditions here (e.g., reducing the param, running lots of small validators so we always have some inactive vals, introducing more chaos into our stake distributions, holding game days etc) even though our validators are unrealistically well trained :sweat_smile:

6 Likes