Sentry Node Architecture Overview

Overview

The Sentry Node Architecture (referred to as SNA in this document) is an infrastructure example for DDoS mitigation on Gaia / Cosmos Hub network validator nodes.

Disclaimer

It is important to understand that this is only one example of solving DDoS mitigation for validator nodes. For diversity in the network, validators are encouraged to implement their own solutions. Each validator is responsible for their own solution. This example might be missing crucial security features that need to be implemented for production use.

Problem description

On the Cosmos Hub, a validator node can be attacked using the Distributed Denial of Service method. The validator node has a fixed IP address and it opens a RESTful API port facing the Internet.

Proposed solution

To mitigate the issue, multiple distributed nodes (sentry nodes) are deployed in cloud environments. With the possibility of easy scaling, it is harder to make an impact on the validator node. New sentry nodes can be brought up during a DDoS attack and using the gossip network they can be integrated into the transaction flow.

Network layout

The solution provided here is based on Amazon AWS services. Google Cloud has similar solutions to solve this issue.

The proposed network diagram is similar to the classical backend/frontend separation of services in a corporate environment. The “backend” in this case is the private network of the validator in the data center. The data center network might involve multiple subnets, firewalls and redundancy devices, which is not detailed on this diagram. The important point is that the data center allows direct connectivity to the chosen cloud environment. Amazon AWS has “Direct Connect”, while Google Cloud has “Partner Interconnect”. This is a dedicated connection to the cloud provider (usually directly to your virtual private cloud instance in one of the regions).

All sentry nodes (the “frontend”) connect to the validator using this private connection. The validator does not have a public IP address to provide its services.

Amazon has multiple availability zones within a region. One can install sentry nodes in other regions too. In this case the second, third and further regions need to have a private connection to the validator node. This can be achieved by VPC Peering (“VPC Network Peering” in Google Cloud). In this case, the second, third and further region sentry nodes will be directed to the first region and through the direct connect to the data center, arriving to the validator.

A more persistent solution (not detailed on the diagram) is to have multiple direct connections to different regions from the data center. This way VPC Peering is not mandatory, although still beneficial for the sentry nodes. This overcomes the risk of depending on one region. It is more costly.

Logical configuration

The validator is only going to talk to the sentry nodes, while sentry nodes have the ability to talk to the validator node on the private channel and talk to public nodes elsewhere on the Internet. Optionally, they could be set up to talk to each other on the private network too.

The config.toml configuration is going to determine the logical setup of the network. Four parameters define how a node communicates:

  • pex: boolean value. It turns the peer exchange reactor (gossip protocol) on or off in a node. When pex=false, only the list of nodes in the persistent_peers list are available for connection.
  • persistent_peers: comma-separated list of nodeid@ip:port values that define a list of peers that are expected to be online at all times and the node is expected to be able to connect to them. This is necessary at first startup so the node has a few other nodes to connect to. It is not as crucial when the peer exchange reactor already filled a list of nodes that are available. If some nodes are not available, they will be skipped and later retried for a while before completely dropping them. If no nodes are available from this list and pex=false, then the node will not be able to join the network.
  • private_peer_ids: comma-separate list of nodeid values, that should not be gossiped at all times. This setting tells which nodes should not be handed out to others, when pex=true. If pex=false, this setting can be omitted.
  • addr_book_strict: boolean value with a twisted name. In short, turn this off if some of the nodes are on a LAN IP. By default, only nodes with a routable address will be considered for connection. This is what “strict” address book means. If this setting is turned off (false), non-routable IP addresses, like addresses in a private network, can be added to the address book.

Validator node configuration

Config Option Setting
pex false
persistent_peers list of sentry nodes
private_peer_ids omitted
addr_book_strict false

The validator node should have pex=false set so it doesn’t even try to gossip. The validator node will only communicate with the sentry nodes. The sentry nodes should be added to the persistent_peers list, so the validator is able to connect to them. As pex=false, the private_peer_ids setting can be omitted. Since the validator is on a private network and it will connect to the sentry nodes also on a private network, addr_book_strict=false has to be set.

Sentry Node Configuration

Config Option Setting
pex true
persistent_peers validator node, optionally other sentry nodes
private_peer_ids validator node id
addr_book_strict false

The sentry nodes should be able to talk to nodes on the Internet and they should benefit from the peer exchange reactor, hence pex=true is set. They should also make sure they don’t gossip the validator node id and IP address, hence the private_peer_ids should contain the validator node’s ID. Also, the validator node is expected to be up and running and since it’s not gossip-ed, the only way to connect to it is to add it to the persistent_peers list. Because the validator is on a private network, addr_book_strict=false needs to be set.

It was implied that sentry nodes have both a public and a private address but only the public IP should be gossip-ed. This can be achieved by explicitly setting the --external-ip setting during the init of the sentry node. Unfortunately, as of this writing this option is under review and not implemented.

Challenges

Direct connection

Although both Google Cloud and Amazon AWS have direct connection capabilities, costs can go up quickly when multiple direct connections are established to the data center. The network configuration becomes convoluted too.

On the other hand, it is the only way to make cloud connectivity redundant. If the region where the direct connection is established goes down, the regions connected using VPC peering lose connection to the validator.

Also, not all data centers have direct connect capabilities. Check the relevant documentation in the cloud and with the data center.

Dynamic scaling

The sentry node architecture in the cloud begs for automated scaling. Unfortunately to add a new sentry node to the network, there are few challenges.

  • The persistent_peers settings in the configuration need to be updated and the service reloaded at least on the validator. This requires some kind of configuration management that is out of scope for tendermint right now. (Look into Devops tooling)
  • The new node will take a long time to sync if it has to start from scratch. It’s an interesting idea to save the blockchain state from other nodes at a regular basis (for example to S3 or by snapshotting an instance) and deploy that when a new node is added. The added complexity requires the scaling services like CloudFormation and AutoScaling to add extra logic for proper deployment.

Ops

The validator node doesn’t require a public Internet connection for its service but it still requires maintenance, security updates and monitoring. This can be achieved through a server hosted in the cloud, or a separate Internet connection in the data center (preferably with VPN connectivity). It is strongly recommended to hire/contract someone with operations experience for a secure setup.

The sentry nodes require maintenance and security updates too. Since they have a public-facing interface, extra care need to be taken for proper maintenance. For example, we don’t want to end up with snapshots that contain malware, that was introduced on an instance earlier.

Why not all-in with the cloud?

Putting the validator in the cloud has technical difficulties. The KMS services provided by Amazon and Google are missing some of the algorithms Tendermint is using. Setting up a validator without KMS is a big security risk for production use.

The other issue is that VPC Peering is not available among different cloud providers. The alternative option is VPN over public Internet. Or the validator is locked into one cloud provider, which is also a risk.

Other attacks to resolve

The DDoS attack described here is not the only attack to be resolved for a validator node. For example the HTTP port open to the Internet is susceptible for man-in-the-middle attacks. Although this attack doesn’t impact the Cosmos Hub, it can impact the validator node. A malicious attacker can present an altered state of the network to the validator node and force it to behave incorrectly from the networks perspective - which can lead to slashing of the validator node’s tokens. The SNA does not target to resolve this, however it provides some coverage by making the man-in-the-middle attack harder, because of the multiple sentry nodes.

Summary

The above solution provides a way to hide the IP address of the validator node and provide a more easily scalable list of public IP addresses for DDoS mitigation. As mentioned before, this is one such proposed architecture and hopefully more will surface in the community with active discussion around them.

21 Likes

To give credit where it is due, please thank @Greg for his great work on this doc! There will be additions and more info added here!

3 Likes

Thank for this post.
And just one note. In persistent_peers for sentrys. No need write the validator data. Or other sentrys. Just add other full-nodes and don’t your validator
Like .

2 Likes

Great overview! Thanks Jack and Greg!

Just wanted to comment on the Sentry Node Configuration section where the text “They should also make sure they don’t gossip the validator node id and IP address, hence the private_peer_ids should contain the validator node’s ID.” is not supported by the table which shows that private_peer_ids is omitted (which would be true for the validator configuration above).

From my understanding, the body text is correct and the table should be updated to include the validator ID in the private_peer_ids field.

Cheers

2 Likes

This passage is excellent! It gives us more knowledge on security concerns and the architecture.

A little confusion on the pex=false and provate_peer_ids.

As discussed before, if pex=false is set on the validator node, it is not necessary to put the node id of the validator node in the private_peer_ids of the sentry nodes as the validator node will not gossip anyway. Or I have gotten this wrong?

2 Likes

Schema is hard to read in dark theme :confused:

2 Likes

Thanks for this very helpful information. I plan to read it a second time, to process it more thoroughly. Until then , I’m wondering -

1 - About the benefit of allowing sentry nodes to communicate directly to each other, i.e. w/the optional private connections?

2 - Is there still a concept of public and private sentry nodes? I can’t remember if I saw this mentioned in Riot somewhere or not.

2 Likes

Jack is fixing it right now. Thanks for pointing it out.

2 Likes

I have the draw.io files, we can create a light version. (The PNGs are transparent.) Will do later.

1 Like

pex=false on the validator node means that the validator node will not gossip to anyone. pex=true on the sentry node means that the sentry node will gossip to everyone. This means that the persistent_peers list on the sentry node will be gossiped together with every other node detail that the sentry node receives through other gossips.
In effect, the sentry node will give away the nodeid & private IP address of the validator node. Unless of course if the sentry node is explicitly told not to do that by listing the nodeID in the private_peer_ids.

As Zaki pointed out to me in an internal discussion, it’s only the private IP of the validator node. My security standards tell me to not give away any crumb of information unless necessary but it’s not the end of the world, if private_peer_ids are not set.

2 Likes

Interesting idea. I was under the assumption, that a new sentry node can automatically talk to the validator node, if persistent_peers is set up - which is true, but it might not help the validator at all. (The sentry might deplete the validators resources while trying to sync up and the validator will not try to connect to the sentry node.) I’ll do some more digging and if I don’t see any problems with this, I’ll update the doc.

One thing I don’t like is that a sentry node only becomes useful when the validator node is updated (Sentry node added to persistent_peers on the validator.) Currently, this is only possible during a maintenance schedule of the validator, since it requires the validator to be restarted. (So, I wouldn’t automate it just yet.) I have to check if sending a NOHUP signal would reload the config, I’m not sure.

The other thing that comes to mind is that there is a limitation, how many peers the validator node is going to connect to. (Current default is 50.) It’s something to be aware of when someone sets up too many sentry nodes. :wink:

3 Likes

I was not going to address this because I think these are the kind of questions the community should discuss among the members. But alas, here’re my two cents:

  1. This is something to discuss. I see the benefit of connecting to trusted nodes - especially in a hostile environment. A scenario I can think of is trying to sync up during a DDoS attack and sync-up is hindered because of malicious nodes timing out on you. (Or your public Internet connection is already saturated.) It’s definitely not necessary for the core SNA setup but you might find it useful when creating your threat model.

  2. Based on the configuration, you can set up a sentry node to be private. (The task is left for the reader as an excercise. :wink: ) You can use it to have “warmed up” nodes that you want to add soon-but-not-just-yet to your defense system or other use-cases, like, making snapshots of the private node so you can use that as your template for sentry nodes (see the issue with slow syncing when you bring up a node and automation ideas). This is again up to you and you have to find your own use-case. The core SNA didn’t discuss it. Maybe it’s worth creating extensions to this document that deals with problems not yet resolved.

2 Likes

hi, there. thanks.

**when you said. **
One thing I don’t like is that a sentry node only becomes useful when the validator node is updated (Sentry node added to persistent_peers on the validator.)

the sentry is useful all time, because is connected almost to more 40 peers all time, and the validator is only connected to the sentrys example 4.

i am missing some translation maybe. help me to understand. :wink:

and in the other hand

when you said: “Currently, this is only possible during a maintenance schedule of the validator, since it requires the validator to be restarted. (So, I wouldn’t automate it just yet.)”

is not automate yet, but when validator lost peers form peersisten_peer list, need to connect automate again for my undertand and for logical to the sentrys for the peersitent_peer: list. now is one issue open about this, but is in Terdermint.
for that now sentrys not sure is going to work in the next release gaia-7000 for this

1 Like

But is different the setup for one sentry work private_peer, and other peersitent_peer.

sentry can work private, or public. (different setup each mode.)

for what i understand.

in Private mode have to used private_peer: only id, no need id+ip+port .

@kwunyeung i have some notes from the test, i am wrong now?

Thanks for the reply. This is what I expected to happen but it’s not. I was confused because in gaia-6002, when I put the nodeID of the validator node in the private_peer_ids, one can still see the info of the validator node under n_peers with the private IP of the node and is_outbound equals false. This made me think of setting up private sentry which won’t gossip and only let the validator node to connect to, and it connects to the public sentry in private network. Then the public sentry will gossip info about the private sentry but never get the info of the validator node.

Maybe the behavior is different in gaia-7000. Let’s see how it happens with the setup.

Thanks!

What @Greg wanna explain was that when there is a new sentry, the node address has to be updated on persistent_peers of the validator node. Otherwise, the new sentry node won’t be useful as the validator node is not talking to the new sentry node.

He meant the validator node has to be restarted as the config.toml file was updated with the new sentry node address.

@Greg there is a way to avoid restarting the validator node which is to add the a new peer using the /dial_peers endpoint. @jack has mentioned about this in the following thread.

Question, let’s say I’m not using a private network to connect a sentry node to a validator, should -

addr_book_strict = false

In this case too?

1 Like

If there are no private addresses you can leave addr_book_strict as true.

2 Likes

when you reference private addresses, do you mean addresses you don’t want gossiped or private networking addresses?