Backing up Validator Server (Physical Data Center)

certus_zl · August 23, 2018, 8:59pm

Some notes on this approach:

Your NFS server will be a single point of failure, and running a highly available NFS cluster (or anything that touches storage, really) is a science in itself. You now have two separate interdependent HA clusters to care about instead of just one (the validator and NFS).
A highly available enterprise SAN is very expensive and there’s still a chance of failure.
NFS is very latency-sensitive, so you can’t distribute it across multiple data centers. Same goes for a SAN - there are mechanisms for cross-data center mirroring, but they’re asynchronous (and therefore useless).
You will need a bullet-proof failover mechanism like pacemaker with an odd number of nodes to ensure that there’s always ever at most one validator process running, otherwise, you will end up double signing. Pacemaker and friends aren’t designed for cross data center operation, either, and finnicky to operate.
Failover will be rather slow and you will miss blocks.
By sharing the disk storage, you effectively have a single failure domain: there are a number of failure scenarios that you can’t recover from, like corrupted files, a filled-up disk or filesystem corruption.
Most importantly: This setup does not reliably prevent a split brain/double signing scenario - there’s plenty of edge cases. If your active validator crashes at just the right time, you will double sign. Write barriers are hard enough with local storage, and even harder with any network file system (we believe we just found a Tendermint bug while verifying this).

With Tendermint/Cosmos, you’re always going to want to sacrifice availability for consistency (a “CP” system in terms of the CAP theorem - any reliable distributed system needs to be partition-tolerant). The penalty for double signing is much harsher than missing blocks.

In practical terms, this means that unless you have a solid distributed systems background and operational experience, you might be better off running a single node on highly redundant enterprise hardware rather than building a HA setup on your own.

Topic		Replies	Views
Collecting up Validator Security Resources Security	1	1987	November 15, 2019
Sentry nodes - What they are! - How they work! - Why they exist! Validation	15	9450	July 5, 2018
Cosmos Vulnerable to DDoS attacks? Miscellaneous	5	2013	September 30, 2021
Sentry nodes... how many? Miscellaneous	2	777	December 16, 2019
Validators need a Sentry? What's the required architectural setup? Validation	1	1814	February 18, 2021

Backing up Validator Server (Physical Data Center)

Related Topics