Gaia-7003 Postmortem

We’ll investigate the gaia-7003 halt by traveling back in (virtual) time: starting at the fault and tracing back the causal antecedents until we find the root issue.

Starting with the halt stack trace, we see that at block 28132, the Gaia state machine incorrectly attempted to unbond a validator (974A7590D8BC1B9D323F7A9770D5388D324DA0E7) which was already unbonded. This isn’t itself the bug - it’s just a sanity check, we architect the code defensively to panic any time we hit an unexpected state, which is the case here. Bonding and unbonding happen in the UpdateBondedValidators sequence, so that’s a likely place to start debugging. Using @jaekwon’s gaiareplay debugging branch, we log the validator iteration sequence and start searching for any irregularities - if you want to follow along, you can run it yourself or download the output logs.

Starting from the end of the log, we first see 974A7590D8BC1B9D323F7A9770D5388D324DA0E7 set as the cliff validator in block 27480 (line 200139) - but when we look at the validator set updates two lines below, we see that 974A7590D8BC1B9D323F7A9770D5388D324DA0E7 was unbonded! (compare the pubkeys) That’s not supposed to happen; the cliff validator is by-definition the lowest-power bonded validator - so if we set a cliff validator that’s unbonded, we might later try to unbond it, as we did in this case causing the halt. This isn’t the bug - just an invalid state - we now need to understand why we would have unbonded the cliff validator. Let’s look at the code - we set the cliff validator to the lowest-ranked bonded validator, and if we bonded a new validator, we unbond the cliff validator. What if we bonded a new validator but the cliff validator didn’t change? We’d unbond the cliff validator, then set the cliff validator key again to the old (now unbonded) cliff validator - as observed.

In what situation would we bond a new validator but leave the cliff validator unchanged? Look a few lines above in the logs, at 286CB8F73D08FE76ACA6A63DEE1793B0E6D4B622 - which was above the cliff validator in the power store, not jailed, but not bonded. That’s definitely wrong - the top n validators in the power store should always be bonded. We can check in the Tendermint updates that 286CB8F73D08FE76ACA6A63DEE1793B0E6D4B622 was indeed bonded in block 27480, but since it was already above the cliff validator in the power store, the cliff validator didn’t change - explaining why the cliff validator was unbonded.

We still haven’t found the bug, but we’ve found a new cause: validator 286CB8F73D08FE76ACA6A63DEE1793B0E6D4B622 being in the top n validator candidates - but not bonded. Let’s go back to block 27438. Again we see 286CB8F73D08FE76ACA6A63DEE1793B0E6D4B622 in the power store (line 199714), bonded as expected - but looking at the Tendermint updates, we see that 286CB8F73D08FE76ACA6A63DEE1793B0E6D4B622 (compare pubkeys) was unbonded at the end of iteration! Only the cliff validator can be unbonded in an UpdateBondedValidators call, so perhaps 286CB8F73D08FE76ACA6A63DEE1793B0E6D4B622 was set as the cliff validator - but it was clearly higher in the power store than other bonded validators.

Strange. Let’s go back another block, to 27237. Ah-hah! Indeed, we see that 286CB8F73D08FE76ACA6A63DEE1793B0E6D4B622 was set as the cliff validator - but we actually see it twice in the ranked power store: first on line 198339, and then again on line 198348. That explains how it could have been both the cliff validator and ranked higher - but why would it be in the power store twice? Take a close look at the store keys on both lines:

>> VALIDATOR 153: 0501303030303030303030303035FFFFFFFFFFFFCDDC0000FFEC (0501000000000005FFFFFFFFFFFFCDDC0000FFEC) -> {286CB8F73D08FE76ACA6A63DEE1793B0E6D4B622 PubKeyEd25519{06A959C49906DE8521EF1D78F81569897721288F5F6D0E06DA595A811262B61C} false 2 9/2 5/1 {agaoye   } 11685 0  0/1 0/1 0/1 0/1 0/1}
>> VALIDATOR 160: 0501303030303030303030303034FFFFFFFFFFFFD25A0000FFFF (0501000000000004FFFFFFFFFFFFD2Z0000FFFF) -> {286CB8F73D08FE76ACA6A63DEE1793B0E6D4B622 PubKeyEd25519{06A959C49906DE8521EF1D78F81569897721288F5F6D0E06DA595A811262B61C} false 2 9/2 5/1 {agaoye   } 11685 0  0/1 0/1 0/1 0/1 0/1}

In particular, they differ in the last two bytes: the first is 0xEC, the second 0xFF. Those bytes are the bond intra-tx counter, used for ordering validators who increase their stake at the same height. This validator was bonded at genesis, and it wasn’t the first validator, so we’d expect it to have a non-zero bond intra-tx counter per the genesis initialization logic - so 0xEC is the right bond intra-tx key, and 0xFF is the wrong one (the key is inverted since the sort is in reverse order, 0xFF corresponds to a bond intra-tx counter of zero).

Looking at the validator struct, we see that the bond intra-tx counter there is 0 - matching one of the store keys but not the other. This is an already-known bug - #1839 - caught by our simulation before it halted the testnet, but sadly after we released the version of the code that gaia-7003 uses. That explains our issue, as when the power key was updated the old one which should have been deleted would have been incorrectly calculated since the validator struct was wrong, and left in the ranked power store as observed. Turns out we’d already fixed this bug - but still worth tracing in detail to be sure.