Syncing on gaia-7001 (graphs!)

Some users noticed syncing new nodes on gaia-7001 was slow, and getting slower over time, so I ran some tests and generated some graphs.

To generate the data for these, I ran commands like:
gaiad start --pruning=x | grep "Executed block" | tee x-machine_size.log
I then used the following script to get the difference in timestamps between successive blocks from the log.

import time
fname = "everything-micro"
l = open(fname+".log","r").readlines()
ts = [li[2:20] for li in l]
tsf = [time.mktime(time.strptime("2018-"+t[:-4], "%Y-%m-%d|%H:%M:%S"))+(int(t[-3:])/1000) for t in ts]
diffs = [tsf[i+1]-tsf[i] for i in range(len(tsf)-1)]
open(fname+"-diff.txt","w").write("\n".join([str(d) for d in diffs]))

To generate the graphs I used gnuplot, and simple commands like:
plot "everything-micro-diff.txt"

First test I ran was with pruning=everything on a t2-micro box on AWS (1core 1gb of ram).
By default nodes are pruning most old states, only keeping one every 10,000 and the last 100. Since the network was below block 10,000, most nodes would just be keeping the last 100. By setting pruning=everything, the node would delete historical states as soon as a new one was saved.


Around block 3500 it started being unable to get new blocks, I think because of network congestion.

At the time I thought it might be not enough resources on the box, so I switched to a t2-medium (2core 4gb ram) and ran the same test again.


Around block 5300 it caught up, and switched to the consensus reactor. Block time predictably moved to around 5 seconds per block thought with some interesting stratification of slower times. My assumption is that those lines correspond to slower block times when the proposer was further away from my node in the topology, but it could be something else.
For comparison with the micro graph which stopped at 3500, I cropped this graph similarly.

The trendline is quite close to the results from the micro box, with the only difference being that on the micro box there’s a clear secondary trendline at higher times. Presumably this is due to resource contention.

So see if this was being caused by state pruning, I then ran the test again on the same t2-medium box, with pruning set to nothing. With pruning=nothing, the node would store all old state trees, and presumably save some time on each block since it could skip the deletion process.


Here the syncing started getting choppy around block 5000, but it wasn’t fully synced, so I think this was due to network congestion.
I also cropped this graph to 3500 blocks for comparison.

The results are strikingly similar to those with pruning=everything, which would imply that pruning has very little effect on block syncing time.

These graphs do show a clearly increasing trendline for block syncing time though.

A community member managed to find a particular bug, https://github.com/cosmos/cosmos-sdk/issues/1736, which may be the cause of this, or is at least a contributing factor. A fix has been merged, though I believe it cannot be deployed on 7001 without breaking consensus, so we will have to wait till 7002 to test it.

I’m planning to rerun these tests once that fix is live and we can see if we get a nice flat line, or if there are other causes of slowing down over time. Also if anyone else wants to run these steps and produce graphs by all means feel free, and if anyone has better ways to generate or display this sort of data please chime in.

2 Likes