[Proposal] Cosmos Hub adopt the Skip Block SDK

Thank you for the additional information @hxrts

Adopting an EIP-1559 mempool is a mistake for the Cosmos Hub and Osmosis chains.

It makes absolutely no sense to constrain what fee the next block proposer must charge for transactions based on how full the previous block is.

In a Cosmos style blockchain, we need to treat each block as a unique auction that occurs within the next proposer.

It doesnā€™t matter if the previous proposer proposed a lot of transactions or a few transactions.

It is a mistake to treat previous blocks as a view of congestion in the network.

This is the wrong design and will result in worsening UX for Cosmos.

7 Likes

I think this could be accurate.

Since we have full finality, each block is fully independent of the others, especially if we do not have a mempool, aka satoshis mind prison.

3 Likes

The implementation of the ā€œFeeMarketā€ module we have does not necessarily mandate including previous block capacity as a parameter, and doesnā€™t mandate copying EIP-1559. If we find previous block capacity is not a good signal (though I think in periods of high load like today on Osmosis, itā€™s hard for me to argue it wasnā€™t), then we can hot-swap other parameters. EIP-1559 would be a general improvement on the fee market today, and weā€™d love to be engaged on finding yet better ones over time that are dependent on other factors.

Finality does not seem to change the calculus of why EIP-1559 is worse for Cosmos than it is for Ethereum. In Ethereum, even though finality takes longer, block-by-block base fees are still independent, as they would be in this implementation. However, this version of EIP-1559 allows for adjustable learning rates so we could scale up/down how much we index on block size or capacity.

If we wanted a pure PGA on the Hub, it can support this too. The module is a general framework to build stateful-aware fee markets, and dynamically adjust learning parameters on a set of inputs - i.e. can create any fee market. With the Block SDK wrapping the module, we can constrain the fee market to a lane that starts at 1% of total blockspace, and increase it gradually to 100%, on any timeframe, as we gain confidence in its effects.

See (working) spec here: https://skip-protocol.notion.site/RFC-011-FeeMarket-Module-c4a7d374854e42acb3ddf752d8cc8e72?pvs=4

2 Likes

Here are my assumption.

  1. The current Cosmos mempool is riddled with extensive security flaws and will not be a part of Cosmos in near term.

  2. The idea of a widely gossiped mempool makes no sense and is not part of any successful blockchains block proposal process.

  3. There are a bunch of flaw in post proof of stake Ethereum related EIP-1559. The reality is 50% of Ethereum block space is DEX swaps and CEX/DEX arbitrage is a big consumer of Ethereum block space .

1 Like

What are the flaws? The change to proof-of-stake on Ethereum happened over a year ago and the implementation of EIP-1559 went live over two years ago, and all is well as far as I can tell. I ask sincerely since Iā€™m 100% sure that you know more about the subject than I do.

What kind of security flaws which current cosmos mempool have?

Could you give reasoning for it?

Security flaw of the current mempool design.

The current mempool allow malicious nodes to join the network and flood their peers with transactions until they become unsynched from the network. Peers are allowed to consume unlimited compute and bandwidth by spamming nodes.

1 Like

EIP-1559 was designed for a proof of work system where it was unpredictable who would propose the next block. Once Ethereum moved to proof of stake, it made a number of attacks cheaper and interacts problematic ways with the rest of the protocol.

There are many documented problems with EIP-1559. It enables a bunch of cheap censorship attacks. For instance, it might be in the interest of one malicious proposer to increase the costs of the next proposer to include transactions by filling prior blocks with txs.

There is really zero reason to believe that blocks being full represents real user demand not value extraction opportunities by validators.

Generally Cosmos validators tend to be more altruist and less rational than Ethereum validators but there is a lot of potential problems.

here are some additional references about problems that emerge once EIP-1559 was applied in a proof of stake setting.

2 Likes

Why it is a flaw? All txs must have tx fee. Even if there is a flood of txs, it doesnā€™t matter because they will pay the fee.
Txs without tx fee will be dropped by peers automatically.

It could be a flaw only if current mempool accepts FREE txs (no tx fee included).

There is a lot to be said about the references you posted here, some of the analysis hinges on control over the gas limit (in the first reference it also incorrectly assumes that you can change it without changing the gas target), which is not at all a requirement of EIP-1559 anyways. To me, the most relevant flaws of EIP-1559 are the following:

  • The mechanism can be defeated by (smart contract) coordination. This is pretty much true of any other mechanism, and I think the details of how this is done matter quite a bit to think through defences.
  • Not giving part of the fee to the block producer (BP) mechanically lowers censorship-resistance, as the value for the BP to include a transaction is no longer equal to the fee-value of its inclusion, so a censor only needs to promise to pay the partial fee received by the BP + epsilon to convince the BP to leave out a transaction. Note that the censor would actually need to pay the partial fee + the MEV induced by the transaction + epsilon, in a fully rational model and without a mechanism for mev-capture.
    • That being said, I am actually not convinced that the fee market should be the place to enforce censorship-resistance, since high CR is then predicated upon high fees, which are undesirable by construction. In an ideal world, users pay subcent fees to transact, which in this case means anyoneā€™s tx is cheap to censor.
    • The naive censorship attack also doesnā€™t hold when it is fully played out. Naively, a user makes the first move and it is their only move: they issue a tx that declares some priority fee PF for the BP. The censor then replies by bribing the BP to not insert the tx, against a payment of PF + epsilon. But if the user had the opportunity to move again, the user would be willing to then keep raising their PF until the total fee they paid was equal to their willingness to pay. In other words, in a fully rational model, the censor must pay the willingness to pay of the user minus the base fee to censor. WTP is assumed to be many multiples beyond the base fee in a world with low fees, so this would not significantly decrease the cost of censorship to the censor.
    • The question is whether the user can make this second move (and possibly iterate until they either reach their WTP or the censor decides itā€™s not worth it anymore). In my opinion, if we assume a world where the per-tx bribing infrastructure is mature enough for censors to bribe BPs, we should also assume that an equivalently mature infrastructure would exist for users to reply to the bribe. (Note that per-tx bribing infrastructure is very different from per-block bribing infrastructure, which is more like PBS)
    • Note also that a world where the user is always fully extorted for their surplus is a terrible world, and we better hope that there is a better solution. So if fee markets arenā€™t the place to resolve CR, I believe Multiplicity-style gadgets are better suited instead.
  • Known proposers might give opportunities for BPs to schedule load in their favour. This has also been known for a while, but again I think the details matter quite a bit. You can schedule load ex ante but demand is not known in advance, so it might not turn out in your favour.

As you mentioned, Cosmos validators being more altruistic than Ethereum validators seems to me an argument for why these are less of a problem here, but still problems to some extent.

To EIP-1559, value extraction opportunities by validators is user demand. EIP-1559 (at least in Ethereum) doesnā€™t intend to price on-chain economic opportunities accurately, it merely attempts to arrive at a price for network resources which is compatible with long-term targets and short-term limits. In fact, as recent literature shows, the willingness to pay base fee is not always a good signal for the BP of whether the transaction should be included or not: a high MEV tx which carries low user fees (below what base fee requires) would induce a rational BP to subsidise the user tx so that they can include it and extract the MEV from it. Such scenarios might be good for users, which is why I suggested looking into charging base fee over the block rather than per-transaction, as is currently done in Ethereum. These scenarios lower the utility of the base fee as an oracle for inclusion (in the sense that willingness to pay basefee should highly correlate with your chances for inclusion), but here the lanes explored by Skip might allow you to discriminate between different types of demand.

It seems that the fee market question here is entangled with other questions relative to the mempool, on which I have less of an opinion, but I hope these arguments illuminate the conversation around EIP-1559, and why it may or may not be desirable in this case.

16 Likes

@barnabemonnotā€™s Cosmos debut! Thanks so much for writing this up. Also for anyone unaware, BarnabĆ© is one of the people at the Ethereum Foundation who led the EIP-1559 research effort. He has published extensively on this topic, but Iā€™ll post a short primer he wrote a while back that may be helpful for some here.

BarnabĆ©ā€™s comment addresses Zakiā€™s criticism about validatorā€™s ability to manipulate the base fee and/or block size via consecutive proposals (block size and transaction fee are dynamically coupled in EIP-1559). To summarize, it is possible to manipulate, however it is also costly to do so. EIP-1559 is not a wholistic solution to blockspace allocation, rather it introduces a commitment to pay on inclusion for anyone submitting a transaction.

Now Zaki also raises a second issue regarding the mempool architecture itself, which is really an orthogonal concern. The fee market aims to price transaction inclusion. The mempool is a distribution system for candidate transactions run by the validators. Today this gossip network is part of Comet, which passes data to the Cosmos SDK and Block SDK module for processing. The Comet gossip system is probably the most cursed part of the entire Cosmos stack. It is extremely inefficient and there have been multiple failed rewrite attempts, in-part due to how interwoven the code is to the rest of the system. Skip is well aware of this issue and this proposal does not address the gossip layer. We have been in dialogue with Informal and Binary about possible solutions, however itā€™s a very large engineering effort that should be led by the teams closest to that part of the codebase. We will continue to support in whatever way we can moving forward.

Full DDOS prevention will require addressing both the fee market and transaction intake system, so itā€™s important to note that the addition of EIP-1559 is only a partial mitigation, and an incremental step toward making the Hub more secure.

12 Likes

i love this discussion even if i understand 1/10 of it.

glad Barnabe came here btw!

@Elijah would be nice to have your (multiplicityā€™s) point of view too :wink:

1 Like

Great discussion, and thanks @barnabemonnot for chiming in - great to see you here! Just wanted to share some notes on the Comet mempool and try to decouple a few things.

The mempool does have a very inefficient flooding (ā€œpushā€) protocol to gossip txs, and neither the mempool nor the underlying peer connection degrade gracefully. However the expectation is that spam prevention is handled at the app-layer by tx fees. The mempoolā€™s interface to this is the CheckTx method of ABCI, which allows the application to accept or reject txs into the mempool (ie. by checking for fees, signature, etc). Only if a tx is accepted into the mempool by the application does the mempool start gossipping it.

This is not to say thereā€™s no problems with the mempool design, but ultimately entry into the mempool is gated by the appā€™s fee system, and a fee system that can adequately modulate entrance into the mempool has been a gaping hole in the ecosystem forever. This is compounded further by a desire to accept low or zero fee IBC txs to improve UX and lower burden on relayers.

Currently validators set subjective min fees in their local config, and increase them / restart nodes as necessary to respond to spam. This is obviously not really sustainable. Apps could also automatically adjust this min fee locally to help validators respond in a more automated way to increased load in their local mempool without encoding something in protocol - this would potentially be easier and less controversial than an in-protocol fee system. Validators could set a local floor and increase based on a PID or some controller that monitors the local mempool size. It would probably need to be tuned a bit. Of course then thereā€™s also the possibility of an in-protocol fee adjustment mechanism like eip-1559 or variations thereupon, but thereā€™s plenty of considerations around that, as others are discussing

There are a few different things being done to improve bandwidth use in the mempool. The most immediately useful is probably this patch I started work on last week (based on an idea from the Comet teamā€™s Dan) that just reduces the number of peers you gossip mempool txs to. Many validators want large peer connectivity to reduce missed blocks, but they really only need to gossip txs to a subset of those peers. This should massively reduce amplification and can be rolled out by individual validators so itā€™s easy to start testing, perhaps as soon as next week.

Thereā€™s also been a bunch of work on a push-pull mempool initially prototyped by Celestia that we upstreamed and fixed to work with Comet. This allows the mempool to move more towards being request (ā€œpullā€) based, which can greatly limit amplification, ultimately at the cost of latency. This is currently targeting Comet v1.0, whose alpha release should hopefully come out before end of year, but unfortunately this might not be able to help with current issues unless we backport. It also seems to be most effective for large txs.

There were also some improvements earlier this year to the block part gossip in the consensus reactor, which isnā€™t related to the mempool, but can significantly improve bandwidth usage overall under high load. Those improvements are also only targeting Comet v1.0, but in principle it shouldnā€™t be hard to back port if thereā€™s significant demand.

There might be more we can do to improve the mempool and relieve some of the pressure here, but ultimately there seems to be widespread agreement that the mempool should just not be cometā€™s concern. The fixes we have should help stabilize things, but the ecosystem probably needs to move towards alternative solutions that move mempool functionality into the app or validator side car processes. Thane started an ADR on that here - weā€™d welcome any feedback there!

9 Likes

agree with zaki, look forward to seeing his proposal

The challenge here is that the Cosmos networking stack doesnā€™t actually have a concept of ā€œvalidatorsā€. It only knows about ā€œnodesā€.

Nodes then implement policies about what they propagate to other nodes but the network doesnā€™t know about where the next block proposal is gonna come from.

So a tx is ingested into the the network at the first node, floods through some number of nodes with unknown filtering policies and then eventually finds itself in the block proposing node which also has opaque policies. This process is subject to zero discrimination and spam, low value txs contention.

A user only gets feedback from the first node in this chain and that feedback is essential irrelevant.

Once you start introducing adaptive behavior into this architecture, the net result is that QoS starts to degrade in completely opaque ways.

Now an on chain adaptive oracle is bad because itā€™s constantly lagging.

Having your network reacting seconds later to issues will create the experience of profoundly unreliable system.

A design goal is have to have systems that can provide back pressure, and credible commitments to inclusion in <100 ms and provide either directly to a client.

2 Likes

To summarize the first half of Zakiā€™s argument:

  1. The Comet network layer does not innately distinguish between validators and any other node
  2. The transaction propagation policy of every node is discretionary and private
  3. The gestalt behavior of such a system is highly opaque and vulnerable to attack

We are 100% aligned on these points. Itā€™s a miracle this system has worked up until now. Everyone agrees the whole transaction ingress system needs to be replaced. The Skip Sentinel was such a bypass mechanism, and the in-protocol oracle system we are working on likewise provides a side-car to bypass Comet transaction intake. We have, and will continue to build systems that provide more robust ways to ingest transactions.

The part of Zakiā€™s argument I disagree with is that the networking layer + the EIP-1559 fee controller will have a reinforcing interaction that will further degrade the system. This argument hinges on a very nuanced premise that the timescale on which the fee controller reacts to information is of a different order than what the gossip system exposes to the validator. Letā€™s break down the possible scenarios to help with our intuition.

So under normal operation, the system should work as follows:

  • mempool signals decreasing load, controller detects decreasing load ā†’ fees decrease
  • mempool signals increasing load, controller detects increasing load ā†’ fees increase

Itā€™s valid to say that in above system the response rate will lead to suboptimal fee pricing, but this is really the price one pays for in-protocol automation and a public fee oracle. In collaboration with the Osmosis team, we actually plan to augment EIP-1559 to reduce this economic leakage via an adaptive learning rate for fee repricing. You can read more about that here.

Ok so now the two scenarios Zaki is concerned about:

  • mempool signals decreasing load, controller detects increase load ā†’ fees increase
  • mempool signals increasing load, controller detects decreasing load ā†’ fees decrease

The first scenario is only possible if during times of low network activity the validator artificially stuffs blocks. In this case validators must pay to manipulate the fee, even if Comet assigns them to propose a series of consecutive blocks. If such an attack is happening, the behavior is also easily detectable because the gossip network is otherwise quiet.

The second scenario is the really the most concerning, and this is what Jacob has been raising a flag about recently. That the gossip layer can be overwhelmed, either by large messages or many small messages. Poor propagation may induce users or nodes to retry submission and further amplify network traffic. Now the only way the controller would fail to detect increased transaction load would be the chain being completely clogged and simply unable to complete a consensus round, in which case validators will have to coordinate to wall their nodes off from new transactions, manually clear their mempool, peer privately, and reload transactions from another source to recover. Alternatively, validators end up with a saturated mempool, but only a small portion of transactions can be used to construct a block. The proposer publishes an unfilled block, itā€™s accepted by the network, but fees decease because the in-protocol signal is that transaction demand is low. Recovery from this state would look similar to Comet failing to finalize rounds.

Now the concern is that the controller is pushing the fee down as all this is going on. And here Iā€™d like to note that Skipā€™s version of EIP-1559 will have a global min fee that can be set by governance, so while this is a real failure mode, itā€™s no worse than what we have today and should be more robust unless the network is actively pushed into this state. For this reason Iā€™ve tried to be careful to state that the proposed work is not a complete solution to the transaction supply chain problems faced by Comet and the Hub today, it is however a small step that chips away at one specific problem: repricing the cost of transaction inclusion given at least a directionally accurate indication of load.

5 Likes

Weā€™re sorry we couldnā€™t provide feedback on this proposal before having it on-chain. This mempool architecture discussion was much needed indeed. As we are late to the party, we will skip on our regular 3-step breakdown (context, analysis & solution). If we had to share our opinion on the matter, this is probably the best quote we would like to take out of this entire discussion:

We are entirely aligned with this analysis.

CONCLUSION:
We think the best strategy to overcome the situation has to be set in a two step process.
One adressing the base fee to prevent the basic direct DDOS attack via economic measures.
The second should be addressed directly at the SDK layer regarding the Comet gossip system which clearly needs a refresher. There seems to be very interesting paths explored in this thread and we think @ebuchman have raised an elegant longer term solution by making the mempool more ā€œpullā€ based. We also believe this is the most suitable strategy for a durable cure. In the meantime, if development takes a longer toll than expected, then it would be worth exploring more short term fix by simply limiting the number of peers to gossip mempool transactions. This would at least propose a temporary solution until the more sophisticated solution is implemented.

3 Likes

You think wrong. Notional did the research, learned of the CAT mempool by @cmwaters from @valardragon, and made the first PR to comet to adopt cat. @AdiSeredinschi told us that CAT was non viable, and bucky ā€“ as well as all of informal, was fully disengaged from the process of solutions engineering for the p2p storms issue.

The only contribution Informal made here was slowing down the fix and creating horrific vibes.

Proof:

Our intention was to get CAT into the v0.34.x branch as rapidly as possible so that it could be deployed onto the hub. Alas, Informal has merge rights on v0.34.x and the hub so, a good deal of stuff tends to not get merged or even reviewed.

Informal is not a collaborative team, and has recently threatened to ban me from contributing to CometBFT despite the fact that I spent 2.5 months on reproducing the p2p storms issue.

If you have not voted on proposal 839, donā€™t. Let it fail due to not reaching quorum. If you have voted, change to veto.

1 Like

I donā€™see why we shouldnā€™t give this a try, it is an improvement compared to the current fee market. YES!

1 Like