@Greg or @jack Assuming a sentry and validator are running, how can one confirm the validator is running through the sentry and not directly through to the network, i.e. how can one confirm the validator is running “behind the sentry”?
What logging are you seeing in the validator? Is it signing blocks? Thats one way to definitely tell. You should also see your validator in the dump_consensus_state
results:
http://<node_ip>:26657/dump_consensus_state
Yes, it’s signing blocks and I see it in the dump_consensus_state results. However, How do I know it’s connecting through the sentry rather than simply connecting directly to the network?
You can see what peers your node is connected to by using the net_info
route:
http://<node_ip>:26657/net_info
Thanks. So I see my validator connected to ~38 peers. If it was connected to my sentry, I would only see the sentry correct?
That’s correct. You can get a quick list of all the peers your validator (or any node) is connected to with:
curl -s http://localhost:26657/net_info |grep moniker
In order to get your validator to only connect to your sentries, you need to do a few things. On the validator:
pex needs to be false.
no seeds
only your sentries in persistent peers
Since you already have a bunch of peers, and you want to get rid of them, delete your address book.
On your sentries
Do not put your validator in persistent peers
Do put your validator’s ID (just ID, not ID@IP:port) in private peers
pex needs to be true
I think you will still have a problem, because at least 32 nodes know about your validator, and they will try to reconnect. Firewall rules to prohibit incoming connections will help, so that your validator decides who it connects to, and/or move it to a new IP number that won’t be in anyone’s address book.
I’m pretty new to Cosmos and still understanding Validators, so forgive me if this is a stupid question.
Is it possible to run multiple Validator servers behind your Sentry Nodes? Either by load balancing them, or just having your Sentry Node talked to one of them directly.
@gkrizek:HA validators is something I’m thinking about, too. You can’t just set up HAProxy in front of 2 validators and Round-Robin between them because of the double-signing issue. Let’s say delegators have staked their tokens with your validator but you are running two carbon copies behind an LB…
The carbon copies will also have the same ~/.gaiad/config/priv_validator.json
which will cause you to double-sign blocks and get slashed and unbonded.
What I am thinking of doing is running a full node alongside my live validator; the full node will keep a copy of all the blocks but when the validator dies, I will overwrite the full node’s priv_validator.json
with the one from the validator (and perhaps re-run gaiacli stake create-validator ...
)
Fully automated failover will require scripting and some way to trigger the script when the validator dies. Any suggestions?
Glad I’m not the only one! That’s exactly what I was afraid would be the problem, double-signing. That’s not a bad idea to have a stand-by node to flip to for a Validator. Although I would still much prefer to run multiple Validators at once, if that’s not possible then this would do. If you didn’t do something like this, how would you preform maintenance? You get slashed for downtime, so how are you supposed to update your Validator?
Yeah you could definitely script the whole failover. How would depend on what software you are using, but regardless of that, you should have a service running monitoring on your whole Validator infrastructure. You should be able to check if that Validator is running by hitting the /health
endpoint. (There might be a better way to check if it’s running than that, but it would work). If it doesn’t respond or responds in a bad way, then trigger your failover script. There are tons of ways to actually script that out, either with bash scripts and HA proxy or with some Cloud solutions.
Hello. Are you guys also using LoadBalancers ? If so, how do you guys inject newly spun up instances [id@ip:port] in config.toml? Each new node should have a new ID and maybe even a new IP in some cases. Cheers !
I’m not sure what kind of setup @archjun is running, but it seems totally possible to run your Validator behind a Load Balancer with a master/slave type setup. To speak specifically to your question, @jack posted an answer to this:
When your Sentrys are spun up you can use the /dial_peers
rpc endpoint to add them to your validator. This can easily be automated.
Edit: Fixed link
But I don’t quite understand how load balancing is working on a p2p connection.
I think there are some important distinctions to make here. First, what I was referring to was a type of HA setup for the Validator only. (nothing to do with Sentry Nodes). We were discussing how you would run more than 1 Validator in order to make it High Available. We determined that it theoretically should be possible to run a Validator and a Full Node side by side behind a type of Load Balancer or Proxy. You could have some kind of health check running on the validator and if it fails, it turns the Full Node into a Validator (with the same key) and routes traffic to it. This isn’t really “load balancing” because there is only 1 server receiving traffic at all times. But a load balancer could be used to achieve it. I hope that makes sense.
Another thing to note with this setup, is you would need a way to manage state between the two Validators. Like when a new Sentry comes online, you could use the /dial_peers
endpoint to add it to the Validator. But you would have to make sure the secondary Validator knows about the new Sentry as well in case of failover.
For the Sentry Nodes specifically, that’s something I’m try to dig into now. I’m not familiar enough with them and how they work yet, so I can’t really speculate on how to handle those. Maybe someone could answer a question I have regarding those…
Where does the node_id
come from and what is it used for? It seems like maybe it’s derived from the ~/.gaiad/config/node_key.json
? I understand that other peers are expecting that node_id to match the node_id of the Sentry they are connecting to, but what’s stopping you from running 10 Sentrys with the same node_id? For example, if I had a load balancer on the domain validator.example.com
and I have 10 Sentrys behind that load balancer all with the same node_id, wouldn’t that work fine for others to connect? I’m probably missing something there…
I’ve been thinking that validators should consider adding another type of node to their Architecture- Relay Nodes.
Here is the definition of a Relay node. A Relay Node is a full node that only make connections to sentry nodes of other validators that the operator of the Relay Node. it runs with pex disabled. The firewall on the Relay node blocks all connections from ip address other than what is on a white list.
The Relay node operator will white list the ip addresses of other validators sentries on the relay node firewall and the validators will add the Relay node’s id & ip address to persistent peers.
The presence of a small number of relay nodes could help ensure that consensus operates at maximum efficiency.
I think these relay nodes from different validators can connect to each other too. Assuming the validator node only connect the relay nodes, the validator node depends the network performance of the relay nodes. Actually the relay node you are defining is my original interpretation of “private sentry”.
I do have some questions, and hopefully I can find some answers:
- with the Sentry node architecture, the validators IP is hidden. How will the other validators know each other’s IP address to perform consensus, or is this not required and we solely rely on broadcasting.
- If the validators do not know each other’s IP address, wouldn’t there be a performance hit? (additional relay time)
- How would the network measure a validators up time now that they are hidden.
Thank you
This is an interesting idea. So would Relay Nodes be an alternative to Sentry Nodes? If I’m understanding correctly, these seem the same as Sentry Nodes, but they only allow connections from other trusted Sentry Nodes.
If this is instead a proposal for Relay Nodes to be an addition to Sentry Nodes, then it seems like maybe an over complication. But maybe I need to evaluate it more.
I’d extend this idea by suggesting that private sentry / relay nodes can communicate over private links, rather than public internet. VPC networks within GCP and AWS can peer within a cloud platform, and connect between platforms using VPN links.
1.) Yes, Validator IP Addresses are hidden. They shouldn’t even have a public IP at all. The rest of the network will know about it because its votes and blocks are broadcasted via the Sentry Nodes.
2.) Yes, theoretically there would be a small performance hit when using Sentry Nodes. I think it’s just expected to create an infrastructure for your Validator that is highly optimized and as quick as possible. As far as I know, the performance of a Validator doesn’t matter. As long as it’s adhering to the rules of no double-sign, no downtime, and, always vote, it will stay in the network. Maybe @zaki could chime in here about performance requirements of Validators.
3.) It’s not like the network does a ping
to the Validator to see if it’s up. It’s checked by it’s participation in the network. If it’s private, you should still see it participating in the network.
I’ve started to look into Sentry Nodes more in depth and I think my original vision for them was a little off. I was originally imaging something like SentryA, SentryB, and SentryC where each of those actually go to a Load Balancer. Then have a group of servers behind the load balancer with the same external_addr
and ~/.gaiad/config/node_key.json
. These servers would scale based on # of requests or something to handle DDoS attacks. (I’m still exploring the possibility of this even being an option.)
But the more I’ve thought about it, shouldn’t Sentry Nodes be transient? Setup something like a lifecycle for Sentry Nodes where they only live for like 6 or 12 hours. Then automatically replace the servers with fresh ones (and fresh IPs). These could too scale based on traffic in the event of a DDoS Attack. And if one server is getting hammered, just remove it and create a new one. This would be much more of a moving target rather than strictly handling the load horizontally. Of course, this method also requires a lot more tooling for handling this dynamic of an infrastructure.
Sentry Node availability is just as important as Validator availability, right? If all your Sentrys go down, your Validator does as well, correct?