Troubleshooting Missed blocks


#1

Opening this topic in the forum as a place to discuss this issue and hopefully get some ideas.

Looking in Hubble, many validators occasionally miss a block. How should an operator diagnose missed blocks? There is a wealth of data in the blockchain, and in logs that we keep on our validators and our sentry nodes. What should we be looking for?

Does the Tendermint/Cosmos team have expectations for what this should look like in the production network? If there are 100 validators distributed globally & 5 second block times, is it expected that all properly operating validators will have 100% uptime? Or is it expected that a properly operating validator will miss some percentage of blocks? Is there any theory that the team can share that would help the validator community understand the dynamics?


#2

Also very interested in this. I missed a block on gaia-7003 :confused:


#3

Yeah, great question. I think it could also be helpful if people posted how they have their validator/sentrys setup. To know if it’s related to some kind of specific architecture.


#4

This is not a solution you are asking for but, you can look up /commit?height= on any well-connected full-node for any validator’s block signing status. When I test this, I found out we should check /commit?height= at least 2 blocks before the current height because it sometimes updated late. Below is an implementaion.


#5

How should an operator diagnose missed blocks? There is a wealth of data in the blockchain, and in logs that we keep on our validators and our sentry nodes. What should we be looking for?

This is a great question but it can be difficult to answer without having access to logs across many validators.

Here are some general notes:

  • A proposer will wait timeout_commit after seeing a commit for a block before proposing the new block, to give time for more than the required +2/3 votes to get in. This currently defaults to 5s, but it used to be 1s so its possible some validators still have the old setting.

  • We distinguish between canonical and non-canonical commits. A non-canonical commit is the first +2/3 you’ve seen for a block. A canonical commit is the actual set of votes for block H that get included in H+1. A canonical and non-canonical commit for the same block intersect in at least +1/3 of voting power. See the canonical field in the /commit response. We could do a better job of exposing the non-canonical commit seen first by each node to get a sense of how the votes were propagated (currently, you’d have to just keep pinging for /commit on each node for the next height and catch it while canonical=false, ie. before the next block is committed).

  • Gossiping of votes is done by routines that randomly select a vote we think the peer hasnt seen and send it. Note this means sentries won’t necessarily prioritize gossiping the votes of their validators - we may want to address this!

  • Sending routines have some sleep parameters:

    • In consensus, if we think there’s nothing to send, we sleep peer_gossip_sleep_duration.
    • In the underlying connection, we only actually flush bytes out every flush_throttle_timeout

One place to start is to look at when your sentries hear about your validator’s vote vs votes from the rest of the network.

You should see “Signed and pushed vote” logs on the validator when it pushes out a vote, and “Added to prevote”/“Added to precommit” logs on the sentries with votes from your validator and all the others. If the vote is only being added after a non-canonical commit has already been observed (ie. within timeout_commit), you’ll see “Added to lastPrecommits”

There is also a debug log message, “setHasVote”, which tells you when other nodes have received a particular vote, so you can find out when each of your sentry’s peers have received the vote from your validator.

If there are 100 validators distributed globally & 5 second block times, is it expected that all properly operating validators will have 100% uptime? Or is it expected that a properly operating validator will miss some percentage of blocks?

Given the partial asynchrony inherent in the internet, we can’t reasonably expect 100% signature inclusion, though we should be able to get close. Certainly a properly configured validator should be satisfying the app-level liveness requirements, even if its occasionally missing a block.

Is there any theory that the team can share that would help the validator community understand the dynamics?

Modelling this is a work in progress. Basically we need to consider network latency, number of hops between validators, and specifics of how we actually do the gossip. Now that the consensus protocol paper is out (https://arxiv.org/abs/1807.04938), we plan to focus more on the gossip piece. As noted above, there’s at least a few things we could do to optimize the gossip, especially for sentries gossiping on behalf their validator.