Net_info vs prometheus peer num and send failed

bharvest · October 14, 2018, 3:11am

I experienced the number of peers seen from net_info endpoint and prometheus metric “p2p_peers” diverses. This is observed when i change my elastic public ip of a sentry.

Net_info shows 30 but prometheus shows only 13. At the same time i saw large number of send failed error from more than 10 peers ip.

From this, I suspect that gaiad at this time, couldn’t find out the disconnectivities with some peers so that gaiad kept trying to send a lot of data to already disconnected peers.

So, I suspect this is a bug from gaiad and also the most critical reason for send failed errors.

I think gaiad need to check its socket connection with peers very often and when it looks like disconnected, it should drop the ip from its peer and stop sending meaningless packets to already disconnected peers.

Another solution is to build a RPC endpoint so that user can disconnect and remove certain peers, so can be called “hanguppeers”. Users can monitor the traffic or connection health frequently then they can distinguish zombie connection so that they can disconnect the zombies and dial it again through RPC.

suyu · October 18, 2018, 1:33am

Came across the same issue , I got negative peer numbers some time.
This is not good:rofl:

ebuchman · October 25, 2018, 1:28am

Thanks for the reports. Let’s track issues like this on github. In this case there already was one: https://github.com/tendermint/tendermint/issues/2332. We’ll look into it!

Topic		Replies	Views
Send failed error Validation	3	611	August 27, 2018
Persistent peers management on gaiad Validation	5	1493	October 4, 2018
Persistent connection between sentry and validator Validation	0	782	July 19, 2018
Connection failed @ recvRoutine (reading byte) Validation	5	1373	June 11, 2023
Error trying to join mainnet "Connection failed @ recvRoutine" Miscellaneous	4	827	November 20, 2019

Net_info vs prometheus peer num and send failed

Related topics