[Sane Defaults] Universal use of remote signers

The community consensus is quite clear that using a remote signer is the correct way to configure one’s validator.

Why doesn’t Notional use a remote signer?

human error

I personally call validators after they get slashed. In the case of this most recent slashon the hub, one of them is a dog and so I haven’t called them.

The other is @serejandmyself and I haven’t called them because they have produced very detailed documentation on the incident.

Both were victims of human error – and not using a remote signer.

Using a remote signer goes outside the default flow of comet. In the default flow of comet, there is a signing key in the filesystem, eg:

cd ~/.gaia/config
config % ls -a -l
total 239248
drwxr-xr-x  9 faddat  staff        288 Jun 18 04:33 .
drwxr-xr-x  4 faddat  staff        128 Jun 16 15:22 ..
-rw-r--r--  1 faddat  staff    1058374 Jun 16 15:30 addrbook.json
-rw-r--r--  1 faddat  staff       9446 Jun 16 15:22 app.toml
-rw-------  1 faddat  staff        742 Jun 18 04:33 client.toml
-rw-r--r--  1 faddat  staff      18605 Jun 16 15:22 config.toml
-rw-r--r--  1 faddat  staff  121386449 Jun 16 15:22 genesis.json
-rw-------  1 faddat  staff        148 Jun 16 15:22 node_key.json
-rw-------  1 faddat  staff        345 Jun 16 15:22 priv_validator_key.json

In fact, this makes adding a remote signer more dangerous.

A number of double signing incidents have actually come from users in the process of adopting a remote signer, forgetting to take the key out of the file system, and then and ending up double signing. this is my terror.. No one is perfect, so we’ve got to design systems that avoid problems like these. It’s happened much more than you’d think.

Because of our current situation, not having a remote signer, I have always made the decision not to adopt it. There are a couple of other things motivating this decision, for example the fact that we operate generally on-premises at our office in Hanoi.

It should not be hard to to eliminate these double signing issues.

This is also the only reason that I can think of to not slash @pupmos and @serejandmyself. They were operating the software by its defaults.

To give a brief run through of what happened:

Both the dog and the robot were running the old version of neutron and had signed the block of the halt. When they changed the binary to the new version of neutron, they did not preserve the priv-validator-state.json file, which would have prevented signing with the new version of Neutron.

So indeed in both cases, this is an incident of human error.

However, I have always been a really big believer in sane defaults. Our current defaults are insane. If we want for remote signers to be used, then naked signing should no longer be an option, and it most certainly should not be the default.

I’m going to suggest that comet have explicit modes of operation, first of all a testing mode, where he’s are simply kept in ram because they don’t need to stick around for a long time. In the case of a long-running test, like the testnet command, maybe we can have a flag or something like --long-running-test. And then it could sign from a key in the file system. However, it is my preference to actually remove that feature altogether.

I would like to thank @zaki_iqlusion for reviewing this concept with me, and invite @marbar, @AdiSeredinschi, @valardragon and other ecosystem technology leads to review this and provide commentary.

motivation

I asked myself: what would get me to adopt a remote signer?

And then it hit me, I had made all of these calls to validators who had gotten slashed in so many of them had gotten slashed in the process of adopting a remote signer. The thing that would make me adopt a remote signer is making that impossible by default. That would mean that we take the signing key out of the file system.

I think that it’s really important to protect production environments and always want to have the most secure performant setup possible. I think that by making the most secure mode of operation the default for production environments, we should be able to fully avoid this issue.

How I came up with this

These are my favorite quotes on security:

Bullet points

  • Default behavior should be safe behavior
  • currently, signing by default from a naked key in the filesystem means that default behavior isn’t safe
  • Given numerous reports of validators getting slashed while adopting remote signers due to human error, I’ve personally been hesitant to adopt a remote signer.
  • It seems to me that by making sure that default behavior is safe and sane, we can:
    • eliminate the opportunity to double sign while configuring a remote singer
    • reduce or eliminate equivocation in cosmos by changing default software behavior to require a remote signer to sign blocks

Hi Jacob,

The remote signer migration

When you adopt a remote signer, you set the priv_validator_laddr

# TCP or UNIX socket address for Tendermint to listen on for
# connections from an external PrivValidator process
priv_validator_laddr = "tcp://a.b.c.d:1234"

Once this setting is set, the node stop looking at the local priv_validator_state but will interact with the remote signer to get the signatures to broadcast.
This is on this point the validators fails. Because they do not report the current validator node state to the remote signer.

Migrating to remote signer is something to do when you are not drunk/tired/distracted.

I also suspect the validators who double sign on migration to not have made the migration on a testnet first to exercise the process.

The node data reset

In the case of the Pumpmos and Citizen Cosmos, the error come from a reset of the node data.
ie the command <bin> tendermint unsafe-reset-all
even if the “unsafe” word is in the command, most of the validators are using it without much precautions.

Most of the case, we are using this command because

  • The node database is corrupt,
  • The database start to be too big and we want to state-sync or restart from a pruned snapshot.

But this command cleanup the <node_home>/data which contain the priv_validator_state.
Without this file, the node cannot know what rounds/blocks are signed or not.

When you are doing this with new blocks flooding, you have almost zero risk to double sign. the time to state-sync or extract the snapshot will create a catching-up and you will not try to sign 2 time the same round/block.

But in the case of a network halt, If you already have signed a round and reset you node,
The node will catch-up to the halt-block and sign again the rounds => leading to a double sign.

conclusion

Instead of enforcing the use of a remote signer, I think it could be better and simpler to:

  • move the priv_validator_state.json to to the <node_home> directory (change of the default priv_validator_state_file option in the config.toml),
  • Enforce the keep of the priv_validator_state.json on the <bin> tendermint unsafe-reset-all command,
  • Add an option --reset-state to the <bin> tendermint unsafe-reset-all command.

The dPoS is a contract between the both, where the delegator places his trust in the validator.
It is up to the validator to do everything in his power to fulfill his mission.
But when a validator is not confident enough to make an action, he must avoid doing it.

1 Like

Exactly my point.

You’ve described a complex and error-prone process, that people can do wrong. I’d love to make it impossible for people to do it wrong.

My point again. So, we’ve tested remote signers, and my concern is human error.

Without too terribly much work, we can fully eliminate the possiblity of human error in this process, so that:

  1. migrations are completely safe
  2. everyone uses remote signers in production

Sounds like a lot of overhead and changes for simply asking validators to NOT double sign. The system works amazing as-is.

Just need validators to operate their machines as if they have hundreds of thousands (millions) at stake…which they do.

2 Likes

It’s actually not a ton of overhead though. I think that the changes are relatively minor.

We already have a test mode that keeps the key and ram and adding a flag is not hard.

The real goal here is to put an end to conversations where we encourage validators to do things that are outside of the default flow.

There’s also this insight that I’ve gained from seeing even really excellent teams screw up various processes, those interviews that I have had with validators after getting slashed I’ve been a real eye-opener. I wish that I had been keeping formal notes, and we’ll do so in the future, but going from memory, I would say that we have had no malicious double signs except for the case on chronos where where an actual attacker had seized control of a validators system, and the attacker deliberately chose to double sign because the team that was being attacked kept trying to take control of their node again.

I would also say that approximately half of the teams I’ve contacted post/ have failed in the configuration of a remote signer. Contrary to what @David_Crosnest is claiming above, these have at times been very experienced teams who know the stack extremely well, where I really have no doubt of their operational capabilities.

So that’s what’s driven my choices and I recognize that it means that there is room for improvement in our ops at notional.

I decided to look at the situation from that perspective specifically, “Jacob what the hell is holding you back?”

And that is where I found the answer.

What I’m saying is, from looking at the actual data from double signing events, it’s more dangerous right now to switch to using a remote signer, than it is to not use one.

And specifically that’s what was holding me back.

If the server is compromised, and the remote signer is running next to the node, the result will be the same.

And we cannot go against the opinion of @Golden-Ratio-Staking

When you have the responsability of milions of USD, the saving of peoples that trust you to not loose it, you must learn every days and make many test before to do it live.

If they fail in the configuration on a testnet, that’s fine, it is supposed to happen.
and for this point you are able to learn from your mistake and make multiple successful try before to make it for real on mainnet.
Before to move to Horcrux on our mainnet nodes, we have run 6 month on testnets in order to ensure us to not make mistake.

1 Like

There isn’t one single way to accept and serve out responsibility @David_Crosnest.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.