EDIT: @Rarma used his own homebrew BananaKing, not one he got from Jacob Gadikian
Around 11/2/2023, @Rarma submitted a large volume of transactions to the Cosmos Hub using a shell script. I’ll refer to this as Someone Posting A lot of Messages, aka S.P.A.M.
The transactions submitted were of the “BananaKing” type, IBC transfers with a long random string inserted into the memo field. This random string bulked up the transaction size. Due to gas pricing factors that I will get into further down, it also didn’t cost a huge amount of gas (only 2,000,000 gas units per tx).
However, there doesn’t seem to be anything particularly malicious about “BananaKing” transactions. Many legitimate IBC transfer transactions could have large memo fields. A good example is a complicated workflow using Packet Forward Middleware and IBC hooks.
The script ran for a short time, and got a very large number of transactions into the network. Over the next few days, a number of validators and full nodes struggled, with missed blocks and network saturation, and one of the S.P.A.M transactions got in every few minutes- even though the script stopped running days ago.
I found this very interesting. Working together with the Informal Comet team, we (the Informal Hub team) developed a hypothesis for what was going on. There are several components.
Networking and uptime was degraded for some validators and full nodes. They seemed to be under a heavy load. Not all validators were affected, which I will address in the next section. I’ll call the ones affected the “struggling subgraph”.
The most likely cause was something called the mempool. When blockchain nodes receive a transaction that a user would like to put on the chain, they store it on their system in something called a mempool, and send it out to every validator they are connected to. By sending transactions from the mempool around like this (a process called “gossip”), validators make sure that everyone has the transactions and whoever proposes the next block can put them in.
During the S.P.A.M. event, so many transactions were sent that it filled the mempools of the affected validators (the “struggling subgraph”). They also seem to have used a ton of bandwidth gossiping the S.P.A.M. transactions around. This is likely to have caused the strain on those nodes.
But even if nodes were gossiping unending spam transactions, why should the rest of their system be so heavily affected? Isn’t there some setting to limit the resource consumption of the mempool?
Yes. There is a setting called
max_txs_bytes. This limits the size of the mempool. On the Hub, the default is currently around 1gb. This has been a default setting since way back in the day. As far as we can tell, there is no reason to have a mempool this large. Our reasoning is as follows: a user is going to retry a transaction themselves if it takes more than a few minutes. There seems to be no reason for the mempool to store a backlog of transactions that could take hours to clear. A mempool size of 2x-10x the block size should be entirely sufficient.
Maybe someone who was working on the Hub at launch can chime in if we’re wrong and they know why a default mempool of 1gb was chosen.
A relatively conservative adjustment of
max_txs_bytes to 10mb (50x a 200kb blocksize, and 5x a 2mb blocksize) could cut mempool bandwidth and memory usage by 100x. If there are no unexpected side effects of reducing the size, this should ensure that nodes run smoothly even during a S.P.A.M. scenario.
Hypha is currently testing this adjustment, and should have results by Friday. Hopefully it will help performance without any bad side effects, and when we confirm it, we will work to roll it out across all Hub validators.
Even with the mempool size thing, there were still a couple of mysteries. Why did only a few nodes struggle? Why were these nodes the only ones who put the S.P.A.M. transactions in blocks? Why did the S.P.A.M. keep going days after the script stopped?
Most validators have the recommended setting from our documentation- a gas price of 0.0025uatom. But a minority of validators had a different, lower gas price set. S.P.A.M transactions were rejected by those with the recommended setting, and accepted by those with a different (lower) setting.
Due to the fact that only a minority of validators were even handling these transactions, they only made it into a block every few minutes. This is why they continued hanging around for so long after the script stopped running.
If a minority of validators on the network have different transaction inclusion criteria than the rest, it is possible to fill their mempools with transactions that use up resources but only get into blocks very slowly, if at all. These nodes form a “struggling subgraph” in the gossip network.
So, it is important that all validators share the same transaction inclusion criteria, in this case gas prices. One step towards this is prop 843, which sets a global minimum price of 0.005uatom.
We will audit the default config for other settings which could cause inconsistent transaction validity criteria, and work with validators to make sure that the network is consistent. We will also work with the Cosmos-SDK team to think about whether it makes sense to disable customization of these settings on a per-node basis.
We make two recommendations above:
- Reduce the mempool size to something that can get cleared out within a few blocks to cut down on unnecessary mempool resource usage.
- Make sure that validators do not have inconsistent transaction validity criteria to avoid a struggling subgraph problem.
We are still working to test these changes, but if they work, they could make it so that the network behaves a lot better in a S.P.A.M. scenario. I would go so far as to say that if these changes work, CometBFT and the network will be functioning as intended. However, even if the network keeps humming along nicely, there are still ways for the Hub to allow more legitimate transactions to get in, and to make more money during a S.P.A.M. scenario.
If the recommendations above work and there is no network degradation, a S.P.A.M. scenario is no longer a problem, since the spammer is using the chain for what it’s meant for: paying to put transactions into blocks. But it’s still not great. Other people’s transactions are not going to get in often, and the spammer will really not need to pay all that much in gas to tie up the Hub’s block space for a period of time.
It’s not really a technical problem if there’s no network degradation, it’s just not charging enough for the Hub’s time.
One way to fix this is to simply raise the global min gas price somewhere above 0.005uatom, to correctly price the Hub’s blockspace. However, this isn’t great for regular users.
We can also make the Hub’s blocks bigger. Prop 845 proposes just that. Raising the block size from 200kb to 2mb means that around 10x the number of S.P.A.M. transactions get in during a given period of time, earning the Hub 10x the fees from full blocks. This makes it 10x more expensive to tie up the block space for a given period of time, and makes the Hub 10x more money during it.
Another factor is that this type of “BananaKing” transaction writes a lot of data, while only using a moderate amount of gas. It’s possible that gas is mispriced in the Cosmos-SDK when it comes to data writes and we should be charging more. Gas tuning is a dark art, and we’ve done very little of it in Cosmos. Ethereum has made a bunch of small tweaks to gas prices over the years.
Transaction size seems to be one of the heaviest sources of load for gossip, and it sticks around forever in the blockchain state afterward. Data writes should probably be one of the more expensive things in terms of gas.
The Comet team has also been working on bandwidth improvements in comet through the year, both for block gossiping and for the mempool, all of which may help with S.P.A.M events like this. These improvements were largely summarized by Bucky recently on the forum, as part of the larger discussion on integrating a fee market into the Hub. See his post for more details.
So, in conclusion, we are making two recommendations (elimination of inconsistent transaction validity criteria, and reduction of mempool size) which should allow the network to handle load from a S.P.A.M. scenario gracefully. We also support a recommendation to raise block sizes to 2mb, just to increase the network’s throughput, which is good in general.
But as I’ve alluded to, there is something else which will work synergistically with the above recommendations to make the network run smoothly, as well cutting any S.P.A.M. scenario short, and making the network a lot of money in gas fees: fee markets.
Fee markets raise the gas price when there is a lot of demand for block space. Under normal circumstances, they make the chain a lot of money during high usage, while giving users low prices when usage is low.
They also have a lot of benefits under an S.P.A.M. scenario like the one described here. As a large volume of transactions comes in, blocks start getting full. The price ramps up, which automatically removes a lot of the spammer’s transactions from the mempool. If the spammer raises the price, they are then paying even more money. It quickly becomes completely non-viable.
A fee market can actually fix a lot of problems, even without the other fixes I’ve talked about here. A fee market smoothes over a lot of other potential performance tuning issues, while also improving the chain’s economics. Prop 842 proposes to install a fee market on the Hub through Skip’s BlockSDK, and I am very excited to get it installed on the Hub.