problem statement
There is no environment where it is possible to properly test load. But let’s just be real and direct and say that the fact that testnet doesn’t resemble mainnet poses a threat to the network, because there is nowhere to test load scenarios with a real world reaction.
This governance proposal authorizes world+dog to run load tests on mainnet, including load tests that could impact the performance of the cosmos hub.
Reasons for load testing on mainnet:
-
No surprises - so far at pretty much every value/load spike, we have had significant performance degradation. Since no testnet matches any mainnet, we have seen teams like informal act “surprised” when issues they’ve been aware of for ages harm user funds.
-
Actual environment - mainnet is > 10x as large as testnet in terms of node count
-
Learn definitively what is going on - actually we have very few network peaks to study. A good one is the Luna death cycle and another is the exploit of osmosis to exploit levana. But there have been others, as well, many - hundreds if I was guessing. But often we don’t even notice - especially if the impact isn’t very large or performance doesn’t change that much.
-
Identify inadequate validator setups. I would rather learn which ones before an incident instead of after, as happened with Luna and many other incidents. While I think there are some vals who simply aren’t great, there’s another reality here: validators don’t get the opportunity to experience load. They don’t know what load will do or require of them. We should give them load.
rules
- Techniques may not steal. Theft will be prosecuted.
- Don’t take the hub down for more than 8 hours unless by mistake
- Be present in the cosmos-sdk channel of cosmonaut hq and report on your techniques as you’re doing it
- P2P only (ddos or P2P API exploration) isn’t encouraged by this. This proposal is about transaction volume, specifically, so we can work on that specific issue
- Unfortunately P2P storms is way worse than ddos: ddos requires lots of machines and it would be super hard to ddos cosmos even just from an expense perspective. P2P storms causes amplification across the stack. P2P storms is definitely not one issue. As of 2 years ago it was 18 issues in comet alone.
additional note
It is already entirely legal to use the cosmos hub to its limits, and beyond as its documentation describes capacities that either the code or validators are incapable of meeting.
Transactions in cosmos are paid for. People can pay for as many as they want and any downtime is entirely the fault of validators, as validators determine tx inclusion. Validators are standing there with a sign: pay me for transactions.
It is never a users fault that they paid for an advertised service.
so why gov prop?
Because unfortunately this isn’t risk-free, and it makes sense to get the community’s consent and support. Taking a direct approach to diagnosing and understanding network problems at scale requires pushing the network to its actual limits, and learning what those are, and why they limit the network’s performance.
While I am aware that the osmosis and Celestia teams did huge work on comet P2P, I am also aware of a number of frankly puzzling performance/liveness issues.
The incident on the hub 2 weeks ago, for example, where 3000 bank sends were put into one multisend and then sent over and over to existing hub addresses was actually very surprising. I knew that it was possible to significantly degrade performance with just bank send due to some research that I did recently with a client. But I really didn’t know that packaging the transactions in that manner would have an enormous liveness impact like that. If we look at what happened, I don’t actually think that it was an attack. I think that somebody was just spamming. I could be wrong about that of course.
Point being: I think we probably have a lot to learn and the best place to learn it is directly on mainnet.
- It’s scary as hell to test on mainnet. Freaks me out. Freaks out skip too. So I thought of this gov prop.
what was P2P storms anyhow?
Making large, valid transactions could cause contention between block gossip and mempool related P2P layers. There were also other performance issues that would come up. Some are in my opinion still not yet identified.
what is bank storms?
It’s whatever happened 2 weeks ago. Bank storms was a very surprising finding. The bank module is very well audited. I now suspect that this is possible with any tx type and moreso high gas ones that are heavy in some area that’s blocking.
what would we do?
We would just be agreeing to a social standard where this kind of thing is the norm and encouraged, even though (or maybe because) it’s a pain in the ass.
We really don’t want the pain in the ass to hit at some financially or technically volatile moment outside of our control.
…among other things we build simple apps with web wallets that compete in some fashion, most # of * txs wins a rubber chicken.
In order to understand this threat model, so that we have it beat before we have enormous tx volume on the hub for an economic event and the hub just fails.
what’s the goal?
There are a set of poorly understood performance problems that occur with Cosmos networks at scale. Some of them are very surprising. The goal is to identify all of these problems and fix them, so that even extreme load testing on mainnet does not cause it to stutter, with stuttering being defined as a block longer than 10 seconds.
It doesn’t work on testnets for many reasons, including:
- No economic incentive for testnets (I’m not really saying there should be)
- Lack of validator interest
- Lack of usage resulting in wildly divergent state and state sizes
- Different patches
…all in all, I have come to think that testnets aren’t a viable way to test for these issues because they’re so different from mainnets.
I think that this issue is the key blocker for the whole ecosystem. It isn’t just about security, it’s also about being performant and reliable. Cosmos chains should be really fast and it’s kind of wierd that they aren’t.
Maybe what is even stranger is that the breakages and downtimes are not really very predictable. I tried bank send in meteorite because I figured that would never work. But sure enough I took out 2 testnets with it, both pretty badly in November and December of 2024.
But what happened 2 weeks ago was far more damaging than what I had baked into meteorite.
Other tx types may be the same or similar. Sometimes we commit with partial or no gas and I can’t say I know why.
why so extreme man?
Right now the hub is basically the only place where not much is time sensitive. We want to change that, and I’m very supportive of changing it. We should figure out what we are dealing with first.
Plus I want our networks to be capable of provable feats of volume like the Trump launch. Think what you will of trump that was very impressive blockchain performance. And we can totally have that, or better. Just takes practice.
why’d you say 8 hours that’s crazy
To be honest I’ve never really tried to load test a mainnet. Any of the tests I have run has been incredibly gentle. And the results often don’t show up if you do exactly the same thing on a testnet.
That points to a P2P issue because testnets have far simpler P2P networks
But it isn’t that clear at all because testnets also have a lower lifetime tx count – so much lower that the kV store is so much smaller than the seeks don’t take nearly as long on a testnet. There are countless scenarios where testnets outperform mainnets by 100x and that’s really the problem. That’s why mainnets explode as they do.
And this is why every single attempt (and osmosis and Celestia made great attempts) to work at this from testnets + code + instrumentation has failed. Something happens on mainnets that doesn’t happen on testnets. Sadly, that could be one of many many things.
…so I said 8 hours because I don’t actually know what some tx types would do if there were lots of them. Also combinations of tx types. But I know it’s wasted energy to do it on a testnet because we could be lied to by it: it might work fine on testnet and fail on mainnet. Which means that the testnet itself is the problem.
I kinda doubt anything could take the network down for 8 hours BUT I would want it to be us that found it, where “us” is defined as people who enjoy cosmos hub blocks.
alternative path
Every single alternative without exception provides no proof of real world results.
Building the test systems needed to simulate mainnet is impossible. Mainnet is a totally unique set of circumstances, especially after about 1000 nodes total.
So basically, if we do not do this or something like it our next test of this issue will be the next time that volume spikes due to price volatility or the next time that spammer hits us up. And to be painfully direct, we keep failing over and over.
This is because we have pursued alternatives to testing on mainnet.
What is the worst possible time for performance to degrade?
When price volatility causes tx volume to spike
Why?
Because that’s when users need to transact and it is most important for the chain to work properly
vote option
yes - authorize load testing that can cause performance degradation on the hub so we can learn what degrades performance and fix it now