Our understanding of the Cosmos Hub mempool issues

jtremback · November 7, 2023, 10:59pm

EDIT: @Rarma used his own homebrew BananaKing, not one he got from Jacob Gadikian

Around 11/2/2023, @Rarma submitted a large volume of transactions to the Cosmos Hub using a shell script. I’ll refer to this as Someone Posting A lot of Messages, aka S.P.A.M.

The transactions submitted were of the “BananaKing” type, IBC transfers with a long random string inserted into the memo field. This random string bulked up the transaction size. Due to gas pricing factors that I will get into further down, it also didn’t cost a huge amount of gas (only 2,000,000 gas units per tx).

However, there doesn’t seem to be anything particularly malicious about “BananaKing” transactions. Many legitimate IBC transfer transactions could have large memo fields. A good example is a complicated workflow using Packet Forward Middleware and IBC hooks.

The script ran for a short time, and got a very large number of transactions into the network. Over the next few days, a number of validators and full nodes struggled, with missed blocks and network saturation, and one of the S.P.A.M transactions got in every few minutes- even though the script stopped running days ago.

I found this very interesting. Working together with the Informal Comet team, we (the Informal Hub team) developed a hypothesis for what was going on. There are several components.

Excessive mempool size

Networking and uptime was degraded for some validators and full nodes. They seemed to be under a heavy load. Not all validators were affected, which I will address in the next section. I’ll call the ones affected the “struggling subgraph”.

The most likely cause was something called the mempool. When blockchain nodes receive a transaction that a user would like to put on the chain, they store it on their system in something called a mempool, and send it out to every validator they are connected to. By sending transactions from the mempool around like this (a process called “gossip”), validators make sure that everyone has the transactions and whoever proposes the next block can put them in.

During the S.P.A.M. event, so many transactions were sent that it filled the mempools of the affected validators (the “struggling subgraph”). They also seem to have used a ton of bandwidth gossiping the S.P.A.M. transactions around. This is likely to have caused the strain on those nodes.

But even if nodes were gossiping unending spam transactions, why should the rest of their system be so heavily affected? Isn’t there some setting to limit the resource consumption of the mempool?

Yes. There is a setting called max_txs_bytes. This limits the size of the mempool. On the Hub, the default is currently around 1gb. This has been a default setting since way back in the day. As far as we can tell, there is no reason to have a mempool this large. Our reasoning is as follows: a user is going to retry a transaction themselves if it takes more than a few minutes. There seems to be no reason for the mempool to store a backlog of transactions that could take hours to clear. A mempool size of 2x-10x the block size should be entirely sufficient.

Maybe someone who was working on the Hub at launch can chime in if we’re wrong and they know why a default mempool of 1gb was chosen.

A relatively conservative adjustment of max_txs_bytes to 10mb (50x a 200kb blocksize, and 5x a 2mb blocksize) could cut mempool bandwidth and memory usage by 100x. If there are no unexpected side effects of reducing the size, this should ensure that nodes run smoothly even during a S.P.A.M. scenario.

Hypha is currently testing this adjustment, and should have results by Friday. Hopefully it will help performance without any bad side effects, and when we confirm it, we will work to roll it out across all Hub validators.

The Struggling Subgraph problem (inconsistent transaction validity criteria)

Even with the mempool size thing, there were still a couple of mysteries. Why did only a few nodes struggle? Why were these nodes the only ones who put the S.P.A.M. transactions in blocks? Why did the S.P.A.M. keep going days after the script stopped?

Most validators have the recommended setting from our documentation- a gas price of 0.0025uatom. But a minority of validators had a different, lower gas price set. S.P.A.M transactions were rejected by those with the recommended setting, and accepted by those with a different (lower) setting.

Due to the fact that only a minority of validators were even handling these transactions, they only made it into a block every few minutes. This is why they continued hanging around for so long after the script stopped running.

If a minority of validators on the network have different transaction inclusion criteria than the rest, it is possible to fill their mempools with transactions that use up resources but only get into blocks very slowly, if at all. These nodes form a “struggling subgraph” in the gossip network.

So, it is important that all validators share the same transaction inclusion criteria, in this case gas prices. One step towards this is prop 843, which sets a global minimum price of 0.005uatom.

We will audit the default config for other settings which could cause inconsistent transaction validity criteria, and work with validators to make sure that the network is consistent. We will also work with the Cosmos-SDK team to think about whether it makes sense to disable customization of these settings on a per-node basis.

Other possible improvements

We make two recommendations above:

Reduce the mempool size to something that can get cleared out within a few blocks to cut down on unnecessary mempool resource usage.
Make sure that validators do not have inconsistent transaction validity criteria to avoid a struggling subgraph problem.

We are still working to test these changes, but if they work, they could make it so that the network behaves a lot better in a S.P.A.M. scenario. I would go so far as to say that if these changes work, CometBFT and the network will be functioning as intended. However, even if the network keeps humming along nicely, there are still ways for the Hub to allow more legitimate transactions to get in, and to make more money during a S.P.A.M. scenario.

Increasing block size

If the recommendations above work and there is no network degradation, a S.P.A.M. scenario is no longer a problem, since the spammer is using the chain for what it’s meant for: paying to put transactions into blocks. But it’s still not great. Other people’s transactions are not going to get in often, and the spammer will really not need to pay all that much in gas to tie up the Hub’s block space for a period of time.

It’s not really a technical problem if there’s no network degradation, it’s just not charging enough for the Hub’s time.

One way to fix this is to simply raise the global min gas price somewhere above 0.005uatom, to correctly price the Hub’s blockspace. However, this isn’t great for regular users.

We can also make the Hub’s blocks bigger. Prop 845 proposes just that. Raising the block size from 200kb to 2mb means that around 10x the number of S.P.A.M. transactions get in during a given period of time, earning the Hub 10x the fees from full blocks. This makes it 10x more expensive to tie up the block space for a given period of time, and makes the Hub 10x more money during it.

Optimizing gas pricing

Another factor is that this type of “BananaKing” transaction writes a lot of data, while only using a moderate amount of gas. It’s possible that gas is mispriced in the Cosmos-SDK when it comes to data writes and we should be charging more. Gas tuning is a dark art, and we’ve done very little of it in Cosmos. Ethereum has made a bunch of small tweaks to gas prices over the years.

Transaction size seems to be one of the heaviest sources of load for gossip, and it sticks around forever in the blockchain state afterward. Data writes should probably be one of the more expensive things in terms of gas.

Comet bandwidth improvements

The Comet team has also been working on bandwidth improvements in comet through the year, both for block gossiping and for the mempool, all of which may help with S.P.A.M events like this. These improvements were largely summarized by Bucky recently on the forum, as part of the larger discussion on integrating a fee market into the Hub. See his post for more details.

Conclusion and a note on fee markets

So, in conclusion, we are making two recommendations (elimination of inconsistent transaction validity criteria, and reduction of mempool size) which should allow the network to handle load from a S.P.A.M. scenario gracefully. We also support a recommendation to raise block sizes to 2mb, just to increase the network’s throughput, which is good in general.

But as I’ve alluded to, there is something else which will work synergistically with the above recommendations to make the network run smoothly, as well cutting any S.P.A.M. scenario short, and making the network a lot of money in gas fees: fee markets.

Fee markets raise the gas price when there is a lot of demand for block space. Under normal circumstances, they make the chain a lot of money during high usage, while giving users low prices when usage is low.

They also have a lot of benefits under an S.P.A.M. scenario like the one described here. As a large volume of transactions comes in, blocks start getting full. The price ramps up, which automatically removes a lot of the spammer’s transactions from the mempool. If the spammer raises the price, they are then paying even more money. It quickly becomes completely non-viable.

A fee market can actually fix a lot of problems, even without the other fixes I’ve talked about here. A fee market smoothes over a lot of other potential performance tuning issues, while also improving the chain’s economics. Prop 842 proposes to install a fee market on the Hub through Skip’s BlockSDK, and I am very excited to get it installed on the Hub.

ebuchman · November 7, 2023, 11:23pm

To underscore, there are architectural and design problems in the Cosmos fee system and mempool. These are well known and being worked on through a variety of efforts. Validators are expected to respond to S.P.A.M by adjusting their fee and mempool settings. The Comet team has been working all year on ways to reduce bandwidth usage and have some patches that should help (as I summarized here). Ultimately Cosmos needs a more sustainable fee system and mechanisms for building blocks and gossiping txs in app-specific ways.

LitBit · November 7, 2023, 11:38pm

Appreciate you highlighting the props we submitted to mitigate against this. Afaik Jacob has not given the script to anyone. The method Jacob has refined through weeks of research has far more severe consequences than what Rarma submitted onchain and most certainly not something we ever want to see tentatively tested on mainnet. There are a few ways to skin this cat as we have highlighted in the report we sent to relevant teams. 100% share you thoughts on a fee market.

zaki_iqlusion · November 7, 2023, 11:55pm

One of the points I am trying to make is that most of what is good about the current mempool design.

Txs rarely get stuck in subgraphs that never reach a block proposer
Latency from random part of the node graph to proposer is generally <1 second.

As you start harden the mempool against attacks, these properties diminish and basically the current design just becomes less and less usable.

These scenarios seem to primarily consider situations where innocent nodes are recruited via RPC to participate in the attack because of the tx fee policies on their RPC.

The scenario that worries me a lot is a fleet of bots that connect to the p2p layer rather than the rpc layer to inject txs into the networks and also try to trick node into connecting to them.

The way to mitigate this is going to be tracking source ip addresses and rate limiting connection churn. These properties would have be correctly tuned.

But if we eliminate the in node mempool concept, we have a lot more degrees of freedom in designing a secure and performant system that offers back pressure in terms of rising fees, doesn’t require every node to handle bursts of expensive to verify messages and ensure the consensus vote and block propagation layer remains available.

jacobgadikian · November 8, 2023, 2:35am

Jehan, how about recieve addresses? Can legitimate IBC transactions have multi megabyte receive address fields?

What has informal systems been doing since September 21 2023?

Do you and @ebuchman agree with @zmilosevic when he says:

Jacob, the issue you are talking about is a mempool/protocol fee design issue, so it is not security issue, it is a complex design issue that will take some time to be properly designed and implemented. And it is not comet only issue: it involved things to be designed and implemented at the level of the whole stack and then every application will need to also implement what makes sense to them. We have heard you when you surfaced it for the first time; at this point in time, there are workarounds that involve validators adjusting fees or someone proposing global feel to be non zero. I personally don’t believe that having non zero global feel is a solution for anything as validators can already adjust it locally and the benefit is that we can be more adaptive. Changing global params take weeks. You repeating the same thing every day will not change that this is reality. From our perspective we don’t see you trying to collaborate in a professional way as each time you are not happy with our perspective you go public and talk shits for days. This does not seem like a good way to create a healthy relationship. We have tried putting some structure in place so we can make sure we have healthy communication between our two orgs, but I don’t see it being used. You prefer to attack us in public, and faire enough, but don’t be surprised when we say it does not work for us to work like this. On repositories we steward, we expect people to respect each other and to offer technical and product perspective. Everyone doing this is more than welcome and appreciated for input. We don’t believe that there is a room for personal attacks, and this is a very common practice in all decent open source projects, so nothing really new here. On our side, there is no problem with you and Notional contributing to projects, you are more than welcome, but we expect everyone to approach others with respect and to try to understand the other side.

That message where @zmilosevic cleanse that there are no security problems is dated November 6th 2023.

Proof:

I strongly disagree with his claim.

Notional began to research this in a channel called invalid block parts, which was later renamed to p2p-storms. Informal team members, as well as stride, range security, and skip team members were present from the start. We were working on this issue because it had been experienced on Stride.

Proof:

How was that harassment, @ebuchman @zmilosevic ?

I have been reporting the replication of incidents seen on Stride to you @jtremback since September 20, 2023.

Proof:

Informal team members began to leave on September 19, 2023.

Proof:

I attempted to add @Jessysaurusrex to invalid-block-parts on September 21, 2023. She left and subtweeted.

Proof:

That is to say that had the channel actually just been read, what informal is apparently only realizing today, could have been known to informal and amulet as of the 21st of September, 2023.

The hub secures billions of dollars.

Cosmos itself secures many billions of dollars.

Therefore it’s very easy to see the informal systems has been gambling with billions of dollars. I do not approve of this.

I completed the reproduction and delivered scripts to ICFormulet on the 25th of September.

Proof:

Yes sir, you mean like we reported here

@jtremback we were taking down the cosmos hub replicated security testnet for 2 months before @Rarma did his thing. Why was it only interesting then?

Because he hit mainnet?

Sir that is what I was trying to prevent, with zero help, and plenty of obstruction from ICFormulet.

Indeed, we’ve been discussing that since Istanbul. The only informal systems team member who participated in those conversations is @jtremback.

It also like to add another scenario that worries me a great deal, seeing the hub get exploited, and notional being blamed, despite doing all of this work to ensure that mainnet is safe.

Claim: informal has no grip on this issue because they made no effort to understand it.

** PLEASE NOTE THAT THIS IS FACTUAL INFORMATION, NOT A PERSONAL ATTACK**

jacobgadikian · November 8, 2023, 2:55am

That is absolutely correct.

Of course in order to know that, informal systems would have needed to spend less time making personal attacks against myself, such as claims that I’m unprofessional, and “a hysterical child”, and that my incident report was nothing more than a series of personal attacks against informal systems team members. If any informal systems team member is aware of any personal attack made by myself or anyone at notional, they should please let us know here in public.

But the honest truth is that they did not. Furthermore, and formal systems was in possession of numerous video and statistical data, as well as the full code to the attack, since the 25th of September 2023.

** PLEASE NOTE THAT THIS IS FACTUAL INFORMATION NOT A PERSONAL ATTACK**

jtremback · November 8, 2023, 4:25am

I’m mostly looking at it from the perspective of trying to make sure that the network doesn’t struggle under a high volume of transactions. Limiting mempool gossip traffic by limiting the size of the mempool seems like a common sense step that I’m surprised nobody has suggested before.

Once you do this, the mempool could still be filled with crap but the nodes will be running fine.

With the addition of a fee market, it seems that any sustained attack will start to hit a limit as its transactions get in and are charged for gas. What’s the scenario that you’re imagining?

zaki_iqlusion · November 8, 2023, 5:03am

We studied small mempools in Istanbul. I’ve also studied them in 2020 times. We found it doesn’t really help. @joe-bowman will be familiar with the Istanbul experiments.

It helps if the only attack vector is via RPC methods on publicly available nodes.

Once you introduce " attack nodes" that stream large amounts of txs over p2p, those attack nodes are still able to introduce unstable subgraphs.

The only mitigations that I can think of rate limiting both amount of data a peer can send to you and the frequency with which a new peer can connect.

But there are lots of scenarios where for things like

Node catch up
A network self healing after a large/ high compute block that putting these limits in will slow recovery.

If you remember during the early eras of Osmosis, the entire p2p network would collapse and be rebuilt after the osmosis epoch.

If you start limiting connection rates and churn, you loose the the self healing function of the current design.

jacobgadikian · November 8, 2023, 6:13am

This. But as you are aware, informal is 2.5 months behind and focused on blaming the reporter and banning them from the comet repo.

Proof:

jacobgadikian · November 8, 2023, 6:15am

@zaki_iqlusion quote=“jtremback, post:8, topic:12040”]
Once you do this, the mempool could still be filled with crap but the nodes will be running fine.
[/quote]

You should base nothing on the kangaroo attack. Please run the attack yourself. I have asked you this many many times, since the 25th of sept. please do it.

That is how you will understand it.

I do not know what @Rarma did. I know what I can do. These aren’t the same thing.

Please note this is not a personal attack (I need to include this in all informal comms now)

block gossip

Hi @jtremback @ebuchman I’m surprised that this analysis does not even touch on the issues with block gossip. Do you have an understanding of the issues with black gossip? In your opinion, do any such issues exist?

timeline

@jtremback @ebuchman @Jessysaurusrex

The timeline of this issue is incredibly important. Do you see any inaccuracies here? Please respond.

@zaki_iqlusion Is the timeline that I have presented accurate in your opinion?

dogemos · November 8, 2023, 6:49am

Hey @jtremback, thanks for adressing this–and hopefully allowing open discussions/suggestions to take place to make CometBFT + Cosmos SDK more robust.

Wanted to pitch in our 2uatoms here.

Keplr team has had a lot of experience running node endpoints, and have frequently ran into issues where nodes would miss blocks at times when S.P.A.M. happens. While we haven’t done a full-on detailed investigation into this (as much as we’d love to, we don’t have sufficient resources on this side rn).

We hypothesize that the blocks dropping issue is unrelated to the issues Jacob raised, nor block size, nor gas pricing, etc, but something more fundamental.

In the words of my co-founder (since he’s the technical one not me–and pardon if i mess up the technical nuance in translation), he mentioned that it seems likely that because ABCI is not multithreaded and goes through running the checkTx() sequentially, checkTx() takes up all the process when processing a massive amount of transactions. This leads to issues where important ABCIs such as beginBlock(), endBlock(), deliverTx(), commit(), (or even potentially other p2p comms) is stalled until the backlog of massive checkTx() has completed.

Parallellizing these queries are working when queried through gRPC or REST interface because they are concurrent at the SDK layer, but this is not the case at the abci_query layer as far as our understanding goes.

I believe a similar issue was raised by Michael Fig from Agoric several years ago here: Allowing ABCI Queries while computing blocks · Issue #6899 · tendermint/tendermint · GitHub

Raising the gas prices, bandwidth of mempool communications, raising max_txs_bytes, etc feel like a solution at the high level, whereas the low level issue is just how checkTx() is handled in relations to mission critical processes such as beginBlock(), endBlock(), deliverTx(), commit().

Since beginBlock(), endBlock(), deliverTx(), commit() as a group should be handled sequentially, this can be left as-is, but maybe there is a way to multithread checkTx, abci query?

It also reflects our experiences where running a more powerful (single threaded) CPU on validators / endpoint nodes reduced the chance of missing blocks as it’s likely that it powers through the checkTx() faster to get to the mission critical processes such as beginBlock() etc faster.

Any thoughts on this being the potential problem here?

jacobgadikian · November 8, 2023, 7:19am

If he had used the attack methods that I have developed the hub would have stopped.

I did not give them to @Rarma.

How sad that it took a real attack on mainnet to get informals attention, when i have been taking down testnet for months.

This isn’t stewardship, it borders on sabotage and is without a doubt gross negligence.

No one should blame the most responsive person at informal though (@jtremback)

Blame should rightly be placed on the executives who have been denying the entire thing while falsely claiming to be being harassed and simultaneously describing myself as a hysterical child and kindergartner (actual harassment)

So, specifically I am naming @ebuchman and @zmilosevic.

Furthermore, blame should be placed on the security contractor who allowed a universal chain halt to be published, @Jessysaurusrex.

I would love to have a purely technical conversation, but unfortunately, those conversations have been deliberately stopped by the above named executives.

@crainbf sir, I hope you take appropriate action.

All of cosmos owes @Rarma some thanks.

Of course, @zarko_informal did say that there was no security problem…

AFTER RARMA HAD ATTACKED MAINNET

So So what that means is that even an attack on the cosmos hub directly, does not get informals attention. @AdiSeredinschi and @thanethomson threatened to ban me from contributing to comet

AFTER RARMA HAD ATTACKED MAINNET

Please, everyone, 839 should be veto and we should help Jehan, Marius, and frens to get funded in any other organization because work can’t be done properly at informal systems. It is not permitted.

The informal systems team claims there’s no security problem, while in possession of this video:

More videos coming. They just need to be carefully censored so that the exact mechanism isn’t shown.

jacobgadikian · November 8, 2023, 9:50am

The title refers to mempool issues in the cosmos hub. This is untrue. These are global and universal issues for every chain that uses comet.

AdiSeredinschi · November 8, 2023, 10:03am

Two things related to this:

Did you have pruning enabled in those nodes? Pruning can affect the amount of time a node spends holding the ABCI lock (in commit if I recall correctly) and delay block production. Injective disabled pruning for this reason in some of their nodes.
More importantly: In Comet we added a fine-grained locking that will be shipped in v1. This allows higher levels of parallelism across ABCI calls. dYdX already using fine-grained locks, and I suspect Osmosis also. We can investigate backporting this to older version of Comet – and the SDK would need to absorb these changes – but if there’s interest this is a good way forward.

AdiSeredinschi · November 8, 2023, 10:12am

Forgot to say also, as someone mentioned earlier here, NewMetric team have been investigating parallelism across ABCI queries and we’ve been chatting with them to learn from their experience. Seems like the direction with fine-grained locking is appropriate. One problem left that we’re not sure yet how to approach (but this is getting off-topic) is that CheckTx is considered intrusive in the way it requires lock holding at any moment during block lifecyle (whenever a tx arrives). This problem overlaps with mempool architecture, which is where ADR 110 Remote mempool comes into play, following recommendations from community.

kjessec · November 8, 2023, 11:21am

IMHO there being a lock itself is a problem, no matter how efficient and practical the state tree is via pruning. Technically and at least logically, CheckTx() doesn’t need to be locked because it’s really just doing 2 things - dry-running the transaction to see if the transaction makes sense, and putting into CList.

Dry-run to check validity of a transaction doesn’t need lock; it is checked against a previous state root + merge-sorted in-memory CacheKV which resides in the app side. Therefore it could technically be an Async call where response is eventually routed back to Mempool.

One thing we get by locking is serializability between Txns in the order they were received. But AFAIK this isn’t the goal of CometBFT, neither does it make sense – the transaction order received at any validator’s mempool is never guaranteed to be ordered in the first place.

kjessec · November 8, 2023, 11:31am

Given the lock could be lifted in Mempool related ABCI functions, another interesting lock point is mutex in CList. I believe this can be removed too; we could just use a simple lockless RingBuffer with in-place tombstones for Reap. We are doing that internally for our product (along with experimental lockless CheckTx…) this seems to work great, as long as the RingBuffer is big enough to handle incoming Txns.

The scenario that @dogemos mentioned (which I’m too familiar with ) is very true, and often renders as the single point of failure for liveness goal. This is also true for any queries as most of them are routed through QueryABCI.

Fine-grained Lock is great, but I think most of the control should just be moved to Application for methods that aren’t related to consensus-critical serializability. This IMO leaves BeginBlocker, DeliverTx, EndBlocker, Commit and Proposal related methods.

btw: ADR110 rocks

rootulp · November 8, 2023, 12:34pm

This post frames a block size increase as strictly positive. In order to enable voters on proposal#845 to make an informed decision, can this post please also document the downsides associated with a block size increase?

jacobgadikian · November 8, 2023, 2:06pm

Hey, @rootulp – to my knowledge there aren’t any downsides to a block size increase, until the block size reaches ~5mb, at which point there could be issues propagating the block to peers due to issues with block gossip.

That said, there’s a bigger issue at play, which came up during the testing that Notional did on the replicated security testnet:

200kb and 21mb perfomed about the same, for different reasons
- with 200kb blocks it is much easier to fill the mempool and cause it to freak out.
- with 21mb blocks it is much easier to get the block gossip mechanism to freak out

1mb was much better than 200kb
2mb was a bit better than 1mb

jacobgadikian · November 8, 2023, 2:40pm

Banana Kings use the receiver address field.

Topic		Replies	Views
[Proposal] Cosmos Hub adopt the Skip Block SDK Signaling/Text draft	46	6475	November 13, 2023
Set a minimum gas price of 0.005uatom Hub Proposals	21	4513	November 12, 2023
Cosmos Hub v17.1 Chain Halt - Post-mortem Security	6	1158	June 7, 2024
Performance of Tendermint in Case of Large Number of Nodes Tendermint	16	8332	December 15, 2021
[last call] Increase the size of the memo field to 100kb and 10x the cost of bytes Hub Proposals	10	1174	October 2, 2023

Our understanding of the Cosmos Hub mempool issues

Excessive mempool size

The Struggling Subgraph problem (inconsistent transaction validity criteria)

Other possible improvements

Increasing block size

Optimizing gas pricing

Comet bandwidth improvements

Conclusion and a note on fee markets

block gossip

timeline

AFTER RARMA HAD ATTACKED MAINNET

AFTER RARMA HAD ATTACKED MAINNET

Related topics