Redundancy of Validator Server (Physical Datacenter)
Security and maintenance have been mentioned multiple times among validators, since it plays a critical role in safe operation of validator nodes within the network.
While considering multiple security structures, such as the article “Sentry Node Architecture”, our team has come to an issue of making a backup for Validator server as one of crucial issue.
Documentation of the Cosmos Network recommends that validators keep their Validator nodes located within a local datacenter, while operating Sentry Nodes in cloud environment such as AWS or GCP.
But even in a well-managed datacenter, there can be several unexpected issues that will bring the validator node down:
Power of data center goes down
Numerous reasons that can possibly affect the healthy operation of a validator node.
Thus, we have come to a new idea to prevent the above problems through setting the validator node by the procedure below:
Connect NFS to two servers
Set up validator in the connected NFS
Create two identical accounts in each of the two servers
Run validator on one of the newly created accounts within a server
If one server get shutdown, or goes down for any reason, we conveniently use the other, but identical server to run validator again. Since both servers are connected to 1 network storage, there should only be one block data, thus, preventing the issue of double-signing.
This method allows following up with the current block height, but the network connection speed is extremely slow. If SAN (Storate-Area Network) method was used instead of NFS, this issue should be cleared, but the cost of SAN is extremely high and won’t be effective when evaluated in multiple point of views.
Knowing that connection speed is a problem, it would still be very helpful to know what other validators think of backing up validator server with this method. Or if any other effective method can be adopted, please feel free to share!
Some notes on this approach:
Your NFS server will be a single point of failure, and running a highly available NFS cluster (or anything that touches storage, really) is a science in itself. You now have two separate interdependent HA clusters to care about instead of just one (the validator and NFS).
A highly available enterprise SAN is very expensive and there’s still a chance of failure.
NFS is very latency-sensitive, so you can’t distribute it across multiple data centers. Same goes for a SAN - there are mechanisms for cross-data center mirroring, but they’re asynchronous (and therefore useless).
You will need a bullet-proof failover mechanism like pacemaker with an odd number of nodes to ensure that there’s always ever at most one validator process running, otherwise, you will end up double signing. Pacemaker and friends aren’t designed for cross data center operation, either, and finnicky to operate.
Failover will be rather slow and you will miss blocks.
By sharing the disk storage, you effectively have a single failure domain: there are a number of failure scenarios that you can’t recover from, like corrupted files, a filled-up disk or filesystem corruption.
Most importantly: This setup does not reliably prevent a split brain/double signing scenario - there’s plenty of edge cases. If your active validator crashes at just the right time, you will double sign. Write barriers are hard enough with local storage, and even harder with any network file system (we believe we just found a Tendermint bug while verifying this).
With Tendermint/Cosmos, you’re always going to want to sacrifice availability for consistency (a “CP” system in terms of the CAP theorem - any reliable distributed system needs to be partition-tolerant). The penalty for double signing is much harsher than missing blocks.
In practical terms, this means that unless you have a solid distributed systems background and operational experience, you might be better off running a single node on highly redundant enterprise hardware rather than building a HA setup on your own.
thanks for your kind answer
i guess its important to understand that ‘slashing’ or any type of punishment is greater for double signing and actions that are ‘arbitrary’, compared to missing few blocks once in a while.
Can use DRBD for replication dual node, but important to manage the risk of double signature !!!