the fox must be here.
Have you read https://arxiv.org/abs/1807.04938 ? It mostly answers your questions.
Why TCP? Tendermint is designed for Gossip Networks not point to point networks. It might make sense to have a FIBER like block relay network at scale eventually but TCP is a good fit our target scenario.
Why no view change? Cause Jae invented a novel termination mechanism that eliminates the need for the view change.
You couldn’t be bothered to read the paper I just linked to could you?
- Formal specification
- Consensus proof for Tendermint.
- A complete bibliography.
Of course the Gossip model vs the TCP model matters, PBFT models a network topology where all the nodes are directly connected to each other. The consensus protocol can provide the fault detection and recovery instead of doing it at the transport layer.
Tendermint works in a model where there is a heterogeneous mix of validator nodes and full nodes in the network. Under normal conditions, we don’t expect to have any sort of direct network link between validators nodes. Validators nodes are also gossiping about many things other than consensus like new transactions, evidence of Byzantine Faults etc. It make sense to only have one p2p layer for all that.
Tendermint also is designed to operate over the internet not enterprise networks. This means our communication protocols need to be as friendly to middleboxes as possible. While it is possible to deploy new internet scale UDP protocols like QUIC, Google has been working on this for years and run into numerous challenges. There is certainly room for improvements in our network stack but you are approaching this from entirely the wrong angle.
the fox must be here.
I think we moved past the complaints of lack of intellectual rigor in Tendermint consensus and into a networking stack design debate and your last comment sort of moves in the direction of a concrete proposal.
Here is the logic what we send a peer when a peer is behind in the same height/round.
Because of how we think about proposal validation, we always need to send PolVotes or PreVotes to the peer. We can’t just skip to sending precommits.
Would love a concrete proposal on how to improve Tendermint here.
There is definitely going to be a Tendermint 2.0 and a lot of re-architecting of the gossip layer is in scope. We are also planning on switching to BLS signatures and doing aggregation in the gossip layer and if we can see benefits from switching to UDP it might make sense.
Thanks for taking an interest in our project and taking the time to raise your concern. You’re certainly right that we could/should do a better job of documenting this important design decision.
First thing I would point out is that the most successful live BFT systems in the world today use TCP connections. These are of course Bitcoin and Ethereum. Note these are full production systems that run on an internet scale decentralized p2p network and handle many concerns beyond just active consensus - there’s peer exchange, broadcasting transactions, helping old peers sync, etc. Of course these are probabilistic systems, unlike PBFT and Tendermint. But note that even BFT-Smart, the apparent leading implementation of PBFT-like protocols besides Tendermint, also uses TCP. Correct me if I’m wrong, but I don’t believe Castro’s system has ever been used for or destined for production.
We have very deliberate reasons for using TCP. We want to have connection oriented protocols with our peers because we have to multiplex many concerns over a single connection (transactions, peer exchange, etc.), as Zaki already mentioned. While of course we could use UDP, that would put significant additional burden on the protocol developers that we deemed unnecessary since TCP handles it so well already.
There’s actually another important reason we use TCP. It allows us to implement our consensus protocol more efficiently. Using TCP, we make explicit use of the guarantee that messages we have sent will be received (or else the connection will break). This way, we can reduce the amount of application level messages we need to send over the network - we only send consensus messages to peers that we think they need. As soon as we send a peer a message, we consider it having been received, and thus we don’t need to worry about sending it again. Of course if the connection breaks, the peer starts from scratch, but using TCP gives us very nice guarantees that allows us to implement a more efficient consensus protocol. This is tremendously helpful property to utilize especially in a p2p gossip network as it reduces the number of application-level messages which need to be sent over the network, and of course message overhead is a fundamental bottleneck in such systems. This is how we managed to design a system that can support over 100 validators heterogeneously connected over the public internet. Again, as far as we know, no other PBFT-like system has even come close to such a feat.
This is why Zaki was showing you application-level code. Hopefully this explanation makes what Zaki has already said even clearer.
As for this comment:
The consensus protocol can provide the fault detection and recovery instead of doing it at the transport layer.
This was explicitly about protocols using point-to-point fully connected networks, like how PBFT was designed. As already stated, this is fundamentally insufficient for us.
As for your concern re catching-up. That’s a fine concern, but in the scheme of things, it seems minor compared to the many other things we’re dealing with and hardly worth accusing us of having done no due diligence. We have mechanisms that allow peers to catch up both across heights and across rounds when necessary. If it’s a bit slower because they’re catching up on old TCP messages, it’s a small price to pay for the extra benefits we get. If you can find a serious vulnerability here, we encourage you to submit it to our bug bounty program: https://hackerone.com/tendermint
As for your continued claims on our lack of due diligence, I would point you to the following:
- We just published a paper with formal proofs of safety and liveness of the Tendermint algorithm. It was presented recently at FLoC to an audience including senior researchers from Microsoft who were quite interested in collaboration. In any case, we would love your feedback/review on the paper if you have the time https://arxiv.org/abs/1807.04938
- Last summer we performed extensive industry standard testing on Tendermint as a distributed BFT database. I encourage you to read the results: https://jepsen.io/analyses/tendermint-0-10-2
What I demonstrated in the application layer code is the unless you rewrite the application layer so that you don’t have to send
POLVotes , head of the line blocking is irrelevant to Tendermint. The primary advantage of switching to a datagram oriented protocol from a connection oriented protocol is the ability of the application layer logic to control HOL blocking. What happens in Tendermint, the application layer also blocks on sending the first phase of the two phase commit so UDP doesn’t buy us anything.
the fox must be here.
Comparing the p2p stack to Ethereum’s is perfectly fine, since it doesn’t matter at the p2p layer. The UDP / TCP discussion in a gossip network is pretty independent of the algorithm used for consensus.
Our requirements are message delivery guarantees, and multiplexing. You’re right that TCP provides message ordering (and associated overhead) which I don’t think Tendermint needs after the handshake. (We definitely want message delivery guarantees, and multiplexing is a nice plus) QUIC does this with UDP + message ordering per stream for several streams getting multiplexed together. It took google engineers several years to develop this. A proposal to create a new scheme based off QUIC (but perhaps without the message ordering) once QUIC itself is sufficiently standardized / deployed would be reasonable, though it would take tons of work to build. However such a thing does not yet exist today, and is not something we need to block the launch of the cosmos ecosystem for over a year on. TCP seems to me to be a good choice for the needs of tendermint given the existing infrastructure that exists today. Especially since TCP is already implemented on every system, and months don’t have to be spent auditing its implementation.
If there is sufficient need, the p2p stack can be upgraded in the future. Its not like the ecosystem is locked into this forever. (On-chain governance has the power to change all of this) There is a plan for how to handle p2p upgrades in progress: https://github.com/tendermint/tendermint/pull/1983, so we could even have upgrades in a backwards compatible manner. (Since we only need message ordering for the handshake, we could use TCP there, and then switch to a UDP based peer which had reception of a message) There is a huge design space that can be explored here post launch, its important to note that anything we have at launch doesn’t lock us in forever. Its more important to have something super well reviewed and secure at launch, rather than something optimized for maximum efficiency and less safety. (Part of the reason why we are not using any novel cryptography at launch, despite many of us being ecstatic about its use cases) Optimized code + algorithms can and will be brought it in by governance as time progresses.
Zaki already mentioned that we have plans to update the p2p layer once BLS gets added in.
This sounds like progress.
Multiplexing is for the many protocols we run in Tendermint. It’s not just consensus. We also help old peers sync blocks, we broadcast recent transactions in the mempool, we gossip evidence of byzantine behaviour, we gossip about peers, and we do this all over the same connection with the peer. It’s quite nice having TCP guarantees in all of these protocols.
As for the consensus. In fully connected point-to-point networks, we all agree, it could be better to use UDP, because you don’t really care about dropping messages - every node sends messages to every other node and the PBFT mechanism handles the faults. That’s great. But in Tendermint, we expect the network to be not fully connected and for there to be many hops between nodes, which means we have to be more intelligent about what messages we send to who and the reliability of those sends.
We can’t just say “use UDP and let the consensus handle it” because having a non-fully-connected network means there’s a layer of abstraction (ie. the p2p network) between the connections and the consensus that we need to bridge for the consensus to be even able to handle anything. Even if we use TCP, we have no guarantee that messages will get to the final destination because they could get dropped at any of the multiple hops in the p2p network along the way. So if you think of the p2p network itself as a kind of transport that connects otherwise non-directly-connected nodes, then that transport certainly does not have message ordering or delivery guarantees, and its the responsibility of the consensus to handle that, just as you would expect. Our reasoning for using TCP is because it allows us to implement the gossip over this non-fully-connected network more efficiently since we don’t have to worry about application-level acknowledgements of message receipt.
We intend to write a paper about the Tendermint gossip layer and how we actually implemented the consensus gossip which will hopefully clear all of this up, we just simply haven’t had the time to write that yet.
In the meantime, if you can point to a real implementation of an asynchronous deterministic BFT protocol like Tendermint and PBFT that uses UDP, I’d be interested to see.
Speaking as a network engineer, UDP has the following advantages:
- Eliminates head-of-line blocking
- Can avoid retransmission of stale data when messages are dropped
These are both micro-optimizations over TCP, and there is presently no evidence either of these are bottlenecks in Tendermint.
UDP has the following disadvantages:
- Decongestion: where TCP provides built-in decongestion, decongestion with UDP is generally solved by userspace decongestion algorithms, which are tricky to get right and take years to develop. There are off-the-shelf solutions for UDP decongestion like DCCP, however performance results with DCCP vary wildly: sometimes it can provide performance improvements over TCP or UDP w\ userspace congestion, and other times it’s slower than TCP. It really depends on the particular problem and the particular DCCP stack.
- Transport encryption with UDP is significantly more difficult. Replay defense in a datagram-oriented protocols requires explicit nonces and a sliding window protocol rather than a simple, implicit, incrementing nonce. DTLS is almost universally reviled for these reasons.
The arguments being made in this post are grossly overstated hyperbole which completely overlook the practical considerations. The Tendermint team has other priorities besides micro-optimizing the performance of the network layer. Given that, TCP is a perfectly reasonable starting place, and should significantly reduce the amount of work needed to actually ship a working MVP.
I think UDP and DCCP as alternative transports are worth investigating, but will be difficult and time consuming, and probably don’t make sense to investigate until after the Cosmos mainnet is live.
Adding another point to the discussion: filtering TCP DDoS attacks is a well-understood problem, whereas it’s very difficult for UDP. Effective DDoS mitigation is inherently stateful, and this is much harder to get right for UDP protocols since each protocol has its own handshake and session handling mechanisms. Stateless UDP protocols like DNS, NTP, SSDP and (particularly annoyingly) memcached are also the reason the DDoS issue is as bad as it is. The QUIC protocol has been carefully designed to avoid both issues.
I implemented a custom UDP DDoS mitigation system, so I’m well-aware of the tradeoffs at play.
This is an important consideration for Cosmos, which is very likely to experience DDoS attacks (an attacker might attempt to attack each validator whose turn it is to propose, for instance).
Middleboxes and ISPs also tend to treat UDP traffic worse than TCP and rate limit it much more aggressively. This is due to the DDoS issue and the fact that UDP protocols tend to have bad congestion control and won’t react to packet loss and ECN the same way TCP does (i.e. by reducing the window).
UDP protocols are harder to troubleshoot - any competent network engineer knows how the TCP state machine works and how to troubleshoot connectivity issues. For UDP-based protocols, custom tooling and dissectors are required for proper troubleshooting.
I do agree that from a theoretical point of view, the consensus mechanism would be better off with UDP, but that there are important practical considerations in favor of TCP. Head-of-line blocking is a real issue, especially with packet loss and multiplexing - in the presence of packet loss and an elephant flow inside the multiplexed stream, a node will quickly miss rounds while TCP is busy retransmitting. A UDP-based protocol or SCTP would handle this more gracefully, at the expense of higher complexity. Building a resilient UDP gossip protocol which won’t suffer from the usual issues in adverserial networks is a lot of work and hard to get right.
I think the above thread demonstrates that there was a great deal of due diligence and thought that went into the design. There is also some excellent insight in the discussion about the relative merits of UDP and TCP for different applications. This discussion appears over so I’m going to go ahead and lock this thread.