How to Emergency Recover From Lost Consensus?

I know Discord seems to be the place to go for questions, but I’m rarely able to access Discord so I figured I’d ask here as I experiment and post what I find if I find the answer, to help flesh out the forum a little bit. :slight_smile:

Anyway, my question is, how do you recover from permanently lost consensus. For instance, I have 2 validator nodes, and one is permanently destroyed. The other validator will never reach consensus now because it will be waiting for the second one to come up. How do I recover the cluster so that the remaining validator can continue fresh from where it left off?

Another note, in my case, I added the second validator to the cluster through the ABCI interface, so it isn’t in the genesis.json file. Otherwise I would have thought that I could just update the genesis file with a new chain ID and validator list and restart the cluster.

@zicklag you need 2/3 of the voting power to be up to recover from consensus. A single validator can still be producing blocks if it retains > 2/3 voting power. But you should know that you need a live network to even re-delegate tokens to shift voting power around. Not sure what you’re looking to achieve though – it may be desirable that the network is in fact down if half the validators are down. You might also want to have more validators than just 2 in that case for anti-fragility.

My use-case is similar to a raft cluster that has lost consensus, for instance, with Docker Swarm.

The swarm cluster will lock up if it loses consensus, but it will allow you to force one of the servers to be the new master, and essentially re-start the cluster using that node as a trusted seed. In the context of a blockchain, I suppose it’s like a fork, if I am using that term correctly.

I want to be able to fork the network, creating a new network with one of the nodes as the authority for the current state of the chain, so that I don’t lose any data ( assuming that node was up-to-date ), but have a way to totally re-start consensus with a new validator set.

I plan on using tendermint in a context similar to Docker Swarm where nodes are all going to be a part of a private cluster and will use the tendermint for replicating the state machine across the cluster. Users running the software in their cluster may have any number of machines, so while having a minimum of 4 nodes would be recommended for fault tolerance, I want to have a way to recover if the cluster loses consensus for any reason, even if it be operator error.

Perhaps you want to look at exporting state for genesis? This allows you to generate a new genesis json file at a particular block height on the network. You can use this exported genesis file to start up the new network. You can see how it’s done on the Cosmos Hub in the docs here. And here are the SDK docs.

1 Like

That sounds along the right lines. I’ll look into that, thanks!