Net_info vs prometheus peer num and send failed

I experienced the number of peers seen from net_info endpoint and prometheus metric “p2p_peers” diverses. This is observed when i change my elastic public ip of a sentry.

Net_info shows 30 but prometheus shows only 13. At the same time i saw large number of send failed error from more than 10 peers ip.

From this, I suspect that gaiad at this time, couldn’t find out the disconnectivities with some peers so that gaiad kept trying to send a lot of data to already disconnected peers.

So, I suspect this is a bug from gaiad and also the most critical reason for send failed errors.

I think gaiad need to check its socket connection with peers very often and when it looks like disconnected, it should drop the ip from its peer and stop sending meaningless packets to already disconnected peers.

Another solution is to build a RPC endpoint so that user can disconnect and remove certain peers, so can be called “hanguppeers”. Users can monitor the traffic or connection health frequently then they can distinguish zombie connection so that they can disconnect the zombies and dial it again through RPC.

1 Like

Came across the same issue , I got negative peer numbers some time.
This is not good:rofl:

Thanks for the reports. Let’s track issues like this on github. In this case there already was one: https://github.com/tendermint/tendermint/issues/2332. We’ll look into it!