Persistent peers management on gaiad


#1

case study

  1. one of sentry was dead for more than 20min
  2. dead sentry became alive after dead period
  3. a relay has the sentry as a persistent peer
  4. relay node config : pex/seedmode false

tested result

  1. relay never retry after the dead sentry alive
  2. rpc dial_peer from relay throw error : Permanently Removed
  3. only way to re-establish connection is to restart the relay node, or rpc-dial relay node from sentry

problem and solution

  1. although sentry had been offline for some time, relay should try reconnect the sentry at least every 1 minute. no harm to do it. also it is fare based on meaning of PERSISTENT. 1 minute can be configured in config.toml.
  2. gaiad should never prohibit users to manually dial a peer via rpc endpoint. human does that when he has enough good reason to do it.

opinion

  1. it is quite a urgent thing to fix because it affects a validator’s and the whole network’s connection stability.
  2. especially, prohibiting human manual rpc-dialing is malfunctioned in my opinion.

#2

Good stuff ! Is there an issue submited?


#3

not yet. i will do it if there are enough agreements on this subject.


#4

I think we talked about this a little bit in the riot chat and I agree. We at least need to have a clear definition on what ‘persistent’ means if we activate it. In my opinion, persistent peers should never be removed automatically if they go offline. There is a reason that people want a peer to persist and I think it should also account for an extended downtime.


#5

Did this issue ever get submitted, and if so was there a resolution?


#6

Not yet submitted. I will do now on github, because this mis-function keeps giving me headache on stability of persistent peers connection. I will suggest 2 solution.

  • Dial persistent peers “FOREVER” with given frequency(default every 1 minute)
  • Don’t prohibit RPC dial_peer in ANY CIRCUMSTANCES

github link : https://github.com/cosmos/cosmos-sdk/issues/2436