Send failed error

I am witnessing send failed error from gaiad log.
It occur with one or several peers at the same time.

For each peer, error count can be up to 30000 times in an hour. When concentrated, several peer can cause up to more than 200 errors in 1 second.

Send failed occuring peer changes over time. I saw more than 20 peers causing this error so it is not node specific problem I guess. More like a structural problem of gaiad software.

During the error messages, the origin(who occured send failed error) and victim had no issue with their hardware resources including cpu/ram/traffic/maxpeernum/etc.

I can suspect two problems in gaiad software or its configure.

  1. too much attemp to send data to specific peer although the sending is failing repeatedly.
  2. receiver’s mempool is too small compared to its strong hardware.

Let’s discuss further on this topic and get over this together.

1 Like

We are seeing these as well. Coincidentally, I have just created an issue to get better Prometheus metrics. One of them (tendermint_p2p_peer_pending_transmit_bytes) would help pinpoint lagging peers easily, without the wall of text.

It might still make good sense to log this, but I think it might make better sense to give up quickly on the peer. Say, disconnect immediately and then increase a counter (perhaps persisted in addrbook) that tracks how many times this happened. The likelyhood of connecting to this peer should then decrease as this counter increases (so new peers candidates should be ordered by counter, desc)

3 Likes

I really like the idea of better prom metrics. Where is that issue?

1 Like