Our understanding of the Cosmos Hub mempool issues

I would like first to thank everyone for contributing their understanding to this very important topic and offering some ideas how things can be improved. Let me try to offer my perspective:

  • I agree with @zaki_iqlusion that the nice properties of the existing system is that it is very robust and that messages get propagated through the network pretty fast. In fact, a part of the problem is that the current design is probably too robust. We have actually been doing some research back in 2020, as it was clear even back then that this is probably the most challenging and least mature part of the tendermint/comet. And we at Informal, are clearly not the only ones being aware of this, zaki_iqlusion, jack, dev, ismail, xla, chris goes, etc, have been also talking about this for a very long time. It turns out that, although there is a ton of research on gossip and consensus based systems, there is almost no research on the mix of the two, i.e., consensus systems (includes tx gossiping, aka mempool, also vote and block gossiping, not just consensus protocol) on top of gossip systems. The closest research in our view was the line of papers around BAR (Byzantine, Altruistic, Rational) model, for example Bar Gossip (https://www.cs.cornell.edu/lorenzo/papers/bar-gossip.pdf). We have published our results in this paper (https://www.inf.usi.ch/faculty/pedone/Paper/2021/middleware2021b.pdf), and note that it actually assumes only crash faults. Scientific work on the BFT version is still in progress and we will hopefully have something soon to share. So how to design efficient and secure large scale, BFT tolerant, gossip based consensus systems, is in my view, still an open and very challenging research question. And this might help explain my perspective to @jacobgadikian that this is a known design problem, not an issue, in a sense we can’t fix it in a short time frame, as you do when you find some implementation or misconfiguration bug. Since this was raised in terms of a security incident, we (Informal) wanted to keep a low comms profile while working on it (the right, professional thing to do), and that might have been misinterpreted as not considering the issue important and not working hard on it. We in fact, deeply care about it and have been working on this for a very long time (not just in the last two months).
  • The other aspect here is that comet’s existing architecture and implementation does not allow us to easily implement any novel protocol/design idea, especially that involves changing the gossip/p2p layer. The existential challenges we faced in 2022 with tendermint (before Informal took over stewardship of the comet project) were actually related to this exact problem, and hopefully we all learned that making changes in tendermint/comet gossip/p2p layer need to be done with a lot of understanding and care, and with super thorough QA process. Strengthening QA and testing process are actually areas where the Comet team at Informal has spent a lot of effort since we have taken over stewardship of comet and you can read about this here: CometBFT Documentation - CometBFT Quality Assurance - v0.38 and CometBFT Documentation - Method - v0.38. Mempool and gossip inefficiencies have been identified as one of the most important and critical problems of comet, and the Comet team has been working on it the whole year. Bucky has mentioned in his post ([Proposal] Cosmos Hub adopt the Skip Block SDK - #35 by ebuchman) some results and some work which is still in progress. We unfortunately needed to spend quite significant time reverse engineering current design and intentions from the current code, so we can start to make improvements in a safe way. We also needed to put some monitoring tooling in place so we can really measure the impact of changes we are making. But this is all in place now, and we are ready to unlock R&D in those areas, and we are super happy to collaborate with anyone interested in helping with this. Dogemos’s perspective re CheckTx is also a super valid observation and insight.
    Finally, there is a bit orthogonal effort to get rid of comet’s mempool altogether (ADR 110: Remote mempool by thanethomson · Pull Request #1565 · cometbft/cometbft · GitHub), and I am strongly supporting this initiative for two main reasons: 1) it separates nicely concerns (and strongly signals that tx propagation and block creation should happen outside comet and that we want to work with others on figuring out right APIs and responsibilities) and 2) reduce comet’s footprint and responsibilities (we might avoid a need to solve super hard research problem mentioned above or at least we can focus on a simpler version of it).
7 Likes