Sentry nodes - What they are! - How they work! - Why they exist!

Let’s fill this thread with our questions and answers about everything related to sentry nodes.

3 Likes

hello i have this for setup sentrys nodes, is ok, or is worn, ty

also i open this issue, happens when my Validator work with sentry nodes and config.toml setup Pex=false

I’ve been thinking it’d be nice to have something a bit more lightweight than a full node to act as a sentry.

We’re running our sentries on GCP, and there are a lot of downsides to doing that. While we have nice beefy servers in our datacenter dedicated to our validator nodes, on GCP the monthly cost of a full node is pretty expensive ($50/mo).

Running a full node also means we have to keep the full state. We can periodically snapshot these instances so we can spawn new sentries quickly, but that’s still a lot of data to hang onto.

I’ve been wondering if it would be possible to have something a bit more lightweight… sort of a like a caching proxy for talking to a validator. I think this might be a fun thing to write in Rust for a few reasons: we’ll soon have a Rust implementation of SecretConnection available, and I think having the sentry written in something different from the validator would help ensure that if there is a severe (e.g. RCE-style) compromise in one, it hopefully wouldn’t be present in both.

3 Likes

There is an ongoing issue that impacts any node that runs in an environment where the local IP address of an instance does not match it’s public IP address. This is the case with Google Cloud and AWS, where instances always have an RFC1918 IP address which is mapped to a public IP address. Gaiad nodes running on GCP/AWS instances never get dialed, and are unable to maintain consistent outbound connections.

I think there are currently two open tendermint issues that represent different approaches to resolving this issue, but neither of them made it into the the release for the gaia-7000 testnet.

758 suggested letting a node configure the IP that it self reports to it’s peers. 758 was superseded by 873, which develops that idea into a node remembering the IP a peer is coming from, regardless of what IP the peer reports. If I read it correctly, 873 is suggesting that a node should maintain it’s address book using the real IP addresses of peer connections rather than the IP a peer reports.

1720 takes the opposite approach, and suggests that if an id@ip:port is set in persistent_peers I should keep on dialling that up, even if the peer reports back a different listen address. 1720 is saying the persistent_peers should override the address book.

There’s an additional complication, the impact of which I’m a little unsure of. Instances running in GCP/AWS and others environments with similar setups will communicate with peers using two or more different IP addresses. Peers that communicate internally –or externally via VPN or VPC peering– will see the internal address. Peers that communicate externally over the public internet will see that same node peering with the external IP address.

I find it difficult to maintain healthy sentry nodes on GCP or AWS, because of the peering issues. We are more successful using Digital Ocean, OVH, and a few other cloud providers who provision routable IP addresses. In Figment’s architecture, we would like to spread sentries across numerous platforms, and take advantage of the sophisticated services that only the large platforms offer. It seems to me that the approach suggested by 1720 is limited, in that it will allow us to establish and maintain persistent_peer relationships with these GCP/AWS nodes, but will not help those nodes establish public peering relations. If I understand correctly, the approach suggested by 1720 would allow these nodes to get gossiped about, and dialled by other nodes via the PEX. My knowledge of the p2p layer’s internals is shallow, so my opinion is not strong, but I think a solution is needed.

I’m interested to know what the team’s thinking is on this, and also hoping that others will share their experience.

1 Like

Good title , but would be good to read what they are , how they work and why they exist. ATM seems we have people whom have this knowledge just writing about technical problems.

Good point. I’ll try to explain it shortly for those of us who don’t know what sentry nodes are.

What are sentry nodes & how do they work?
Sentry nodes are Full Nodes, so nodes that store the whole blockchain. They mostly run on cloud providers like AWS, GCP etc… Sentry Nodes are used to isolate your validator from the public. Your validator node only establishes private connections to your sentry nodes and they connect to the rest of the cosmos network.

By doing this, Sentry Nodes protect your validator from being attacked. One of the most common attack vectors is DDOS. Sentry Nodes can mitigate those attacks. This is especially important, since a DDOS attack will prevent a validator node from communicating with the rest of the network. This leads to downtime and slashing. Therefore, it is a must-have for validators to secure their setup with sentry nodes.

What is being discussed above?
It is tricky to create a secure and reliable sentry node architecture. There are a lot of details that need to be discussed because currently we see some technical issues. Most of those issues are about peering and keeping the connection to the validator node.

Hope this helps :slight_smile:

1 Like

I am interested in how autoscaling can be done with the sentry nodes. As long as the validator node is connecting to the sentry nodes, the validator node will be protected. However, if all the connected sentry nodes were down due to DDoS, the validator node still can’t connect to the network and it still leads to down time of the validator. I think sentry nodes are for protecting direct attack to the validator node but might not prevent from down time.

1 Like

I think the idea of autoscaling is that services like AWS provide you with auto-scaling. So that new sentry nodes are automatically spawned when the traffic increases (Auto Scaling with Elastic Load Balancing in AWS). However, what I don’t know and would hope someone can answer me is: How can the Validator Node add new peers (sentries) without restarting gaiad with a changed config.toml? How can something like this be achieved in a live system?

It should be able to be done with /dial_peers endpoint via RPC. When a new sentry node has been spawned, a request is made to /dail_peers of the validator node and add the listen address of the new sentry node.

I can’t find any information about the /dial_peers endpoint other that this: https://github.com/tendermint/tendermint/issues/866

Is the endpoint implemented yet? If so, where can I find more information about it?

Very much indeed. Perhaps should be pinned at the top of the thread. It is an excellent introduction. Thanks.

Would love to see any work here! It would really increase operational flexibility if we could do it that way. You could also drop this in the tools post: List of tools created by validators for validators

@katernoir This doc has some more information about the /dial_peers endpoint. You need to enable it by setting [rpc]unsafe = true in your config.toml.

Also for enabling autoscaling you will need to be taking regular snapshots of full node state in order to quickly spin up nodes. I’ve got a high-level overview on how to do this using GCE in a notes github repo.

@mattharrop Looks like this issue will be fixed by the external_addr option added to the config. That will be available for the upcoming testnet.

1 Like

external_addr is good news. I think this will resolve the peering issues for GCP/AWS nodes, and solve the problem some of us had with validators losing connectivity.

I wonder if will cause an inverse problem in a sentry node topology that will likely be common. GCP, AWS and other complex cloud providers provision private IPs in virtual private cloud networks, and allow public IPs to be mapped to the private IP. Bare metal validators in co-location facilities connected to sentries in GCP/AWS using VPN connectivity will likely be a common topology for resilient validators. In this topology, the sentry will be peering with outside nodes via it’s public IP, and with the validator via it’s private IP. I haven’t thought through the network routing issue in depth, this may be easy to solve with static routes in the VPN, and I don’t know enough about the P2P layer to to know if it’s even possible for one instance of gaiad to peer using different IP addresses.

Anyone have thoughts about this? There hasn’t been much discussion about the topologies that validators will build. I assume that VPN from co-lo to cloud, private peering relationships between operators using VPC peering, and eventually SDN or private links between sites will be used to develop the resiliency that we want to build.

external_addr option is a good option. Can we use domain name on this option instead of showing the IP address?