I’m not sure what kind of setup @archjun is running, but it seems totally possible to run your Validator behind a Load Balancer with a master/slave type setup. To speak specifically to your question, @jack posted an answer to this:
When your Sentrys are spun up you can use the /dial_peers rpc endpoint to add them to your validator. This can easily be automated.
I think there are some important distinctions to make here. First, what I was referring to was a type of HA setup for the Validator only. (nothing to do with Sentry Nodes). We were discussing how you would run more than 1 Validator in order to make it High Available. We determined that it theoretically should be possible to run a Validator and a Full Node side by side behind a type of Load Balancer or Proxy. You could have some kind of health check running on the validator and if it fails, it turns the Full Node into a Validator (with the same key) and routes traffic to it. This isn’t really “load balancing” because there is only 1 server receiving traffic at all times. But a load balancer could be used to achieve it. I hope that makes sense.
Another thing to note with this setup, is you would need a way to manage state between the two Validators. Like when a new Sentry comes online, you could use the /dial_peers endpoint to add it to the Validator. But you would have to make sure the secondary Validator knows about the new Sentry as well in case of failover.
For the Sentry Nodes specifically, that’s something I’m try to dig into now. I’m not familiar enough with them and how they work yet, so I can’t really speculate on how to handle those. Maybe someone could answer a question I have regarding those…
Where does the node_id come from and what is it used for? It seems like maybe it’s derived from the ~/.gaiad/config/node_key.json? I understand that other peers are expecting that node_id to match the node_id of the Sentry they are connecting to, but what’s stopping you from running 10 Sentrys with the same node_id? For example, if I had a load balancer on the domain validator.example.com and I have 10 Sentrys behind that load balancer all with the same node_id, wouldn’t that work fine for others to connect? I’m probably missing something there…
I’ve been thinking that validators should consider adding another type of node to their Architecture- Relay Nodes.
Here is the definition of a Relay node. A Relay Node is a full node that only make connections to sentry nodes of other validators that the operator of the Relay Node. it runs with pex disabled. The firewall on the Relay node blocks all connections from ip address other than what is on a white list.
The Relay node operator will white list the ip addresses of other validators sentries on the relay node firewall and the validators will add the Relay node’s id & ip address to persistent peers.
The presence of a small number of relay nodes could help ensure that consensus operates at maximum efficiency.
I think these relay nodes from different validators can connect to each other too. Assuming the validator node only connect the relay nodes, the validator node depends the network performance of the relay nodes. Actually the relay node you are defining is my original interpretation of “private sentry”.
I do have some questions, and hopefully I can find some answers:
with the Sentry node architecture, the validators IP is hidden. How will the other validators know each other’s IP address to perform consensus, or is this not required and we solely rely on broadcasting.
If the validators do not know each other’s IP address, wouldn’t there be a performance hit? (additional relay time)
How would the network measure a validators up time now that they are hidden.
This is an interesting idea. So would Relay Nodes be an alternative to Sentry Nodes? If I’m understanding correctly, these seem the same as Sentry Nodes, but they only allow connections from other trusted Sentry Nodes.
If this is instead a proposal for Relay Nodes to be an addition to Sentry Nodes, then it seems like maybe an over complication. But maybe I need to evaluate it more.
I’d extend this idea by suggesting that private sentry / relay nodes can communicate over private links, rather than public internet. VPC networks within GCP and AWS can peer within a cloud platform, and connect between platforms using VPN links.
1.) Yes, Validator IP Addresses are hidden. They shouldn’t even have a public IP at all. The rest of the network will know about it because its votes and blocks are broadcasted via the Sentry Nodes.
2.) Yes, theoretically there would be a small performance hit when using Sentry Nodes. I think it’s just expected to create an infrastructure for your Validator that is highly optimized and as quick as possible. As far as I know, the performance of a Validator doesn’t matter. As long as it’s adhering to the rules of no double-sign, no downtime, and, always vote, it will stay in the network. Maybe @zaki could chime in here about performance requirements of Validators.
3.) It’s not like the network does a ping to the Validator to see if it’s up. It’s checked by it’s participation in the network. If it’s private, you should still see it participating in the network.
I’ve started to look into Sentry Nodes more in depth and I think my original vision for them was a little off. I was originally imaging something like SentryA, SentryB, and SentryC where each of those actually go to a Load Balancer. Then have a group of servers behind the load balancer with the same external_addr and ~/.gaiad/config/node_key.json. These servers would scale based on # of requests or something to handle DDoS attacks. (I’m still exploring the possibility of this even being an option.)
But the more I’ve thought about it, shouldn’t Sentry Nodes be transient? Setup something like a lifecycle for Sentry Nodes where they only live for like 6 or 12 hours. Then automatically replace the servers with fresh ones (and fresh IPs). These could too scale based on traffic in the event of a DDoS Attack. And if one server is getting hammered, just remove it and create a new one. This would be much more of a moving target rather than strictly handling the load horizontally. Of course, this method also requires a lot more tooling for handling this dynamic of an infrastructure.
Sentry Node availability is just as important as Validator availability, right? If all your Sentrys go down, your Validator does as well, correct?
That’s an interesting idea. I’m also thinking the private sentry/relay nodes should not always connecting to the same set of sentry nodes in the persistent peers. As the relay nodes won’t gossip and they rely on the public sentry nodes to connect to the network, it the small numbers of public sentry nodes disconnect, the validator node can’t be synced and push votes.
It will be interesting if the relay nodes would switch to connect to different known health sentry nodes from time to time. The list of sentry nodes should be managed by the validators themselves.
This is exactly how I experienced in 7001. The sync speed was slow. Even all my connected sentries are healthy and can sync up-to-date, the validator node was always out-of-sync evening catching up was false. It had to wait
public network > sentry > relay > validator
The validator node had to wait until relay to be synced, the relay waited until sentry to be synced. That made the validator node always missed votes. If we need the validator node to be HA, the performance and availability of the front facing façade are also very important.
Currently, a single core instance would be enough for the sentries as their job is mainly for keeping them in sync. The public sentries need more memory as when they connect to more peers, they take up more memory usage. Relay nodes use less memory than public sentries as they only connect to a limited number of persistent peers. The validator node requires at least at 2-core instance with a similar amount of memory as the private sentries. Memory quite depends on the number of the peers connecting to while the validator node needs more CPU cores to keep in sync while signing votes.
Where/how do we find the “ID” for a node on which gaiad isn’t running yet?
“gaiacli status” won’t work if gaiad isn’t running. I don’t want to run gaiad first, because then the validator would be visible. Ideally, there’s a way to find “ID” on a node where gaiad isn’t running.
I think @kwunyeung might have pointed me to this earlier in Riot…
Avoids the hazard of requiring sentries to “dial in” to the validator(s), but instead let the validator discover sentries and only establish outbound connections.
Enables “local peer” discovery between sentries
Requires unsafe RPC (unsafe = true in config.toml)
As a consequence, the RPC should be proxied by nginx or similar to ensure only /status is exposed
Sentry RPC must be behind a load balancer which will distribute traffic among instances (round robin)
So the basic idea is that anyone (be it a 3rd party, sentry or validator) requesting /status via the load balancer will receive a random status containing node-id, ip and port. Do this enough times, periodically, and one will eventually learn about all sentry nodes.
A local node (sentry or validator) can then feed this information into the local gaiad instance using the /dial_peers RPC. Like so:
I posted this to Cosmos Discord too, but in the interest of time and greater exposure, am posting it here too. I want to get this right, and need expert feedback on sentry architecture. Promise to write a medium post on this once I am done :).
Here is what I’ve designed as my sentry-validator architecture based on numerous posts I have seen. I am a bit puzzled why nobody has suggested a VPN for both sentries and validators as I mention here (unless it amounts to crazy VPN costs).
Would greatly appreciate any input as I am in the process of automating this. ONE big note: I am placing the validators in the cloud – not in a data center with dedicated hardware. Please tell me your most critical thoughts. Also, I don’t see any specific mention of a “sentry” P2P option and assume that Sentry is a result of settings and context.
We need a VPN on which all the sentries and validators have an IP address. This VPN itself is inaccessible to the public Internet, with respect to the addresses on the VPN. The VPN IP addresses are therefore not Internet accessible.
Each sentry is assumed to be PAIRED with one or more validators each.
Validators: ONLY have a single interface, which is their interface to the VPN.
Sentries: DOUBLY homed (two interfaces), with one interface being to the VPN, and the other to the public Internet. Internet-facing interface is how clients (CLI, REST server, etc) communicate with the chain.
Validators: pex = false in config.toml
Sentries: pex = true in config.toml
Validators: addr_book_strict = false in config.toml
Sentries: addr_book_strict = false in config.toml
Validators: persistent_peers = list of nodeid@IP of all sentries
Sentries: persistent_peers = nodeid@IP of paired validators + (optionally) nodeid@IP of some or all the other sentries
Sentries: private_peer_ids = nodeid of paired validators (is this even necessary if the IP of the validators is private?)
Overall, I am trying to simplify this while also keeping security as utmost importance as well as bandwidth minimization. It occurs to me that the following cardinal rules (proposed) could apply:
a. Validator AND Sentry’s persistent peers are ALL validators and ALL sentries. (literally the same value for both)
b. Sentry’s persistent_peer_ids are ALL validators.
Connecting the sentries and validator node via VPN have been mentioned many times in different posts. It was also mentioned in the first post from @jack in this thread. Sentry, Relay, Validator are how we call the nodes depends how you set them up in the infrastructure, just like proxy and load balancer. It’s more on the functionality and how you connect the nodes. If you are looking for setting up VPN between nodes, you may consider WireGuard or Tinc.