How to upgrade or maintaining in a live mainnet

Can we discuss how to upgrade or maintaining a live validators without stoping gaiad.

As we know. stoping service will get atom slashed but we have to upgrade or fix bugs sometimes.

Do you have any ideas?

The mainnet should have a longer downtime threshold for validators to fix the downtime issues. If the validator can bring itself up again during the threshold, it should not be slashed. So you can upgrade software and restart your service in that period. We are expecting to experience more on this on gaia-7000.

We have been talking about some autoscaling idea to keep uptime without restart service in another thread. You may refer to the usage of /dial_peers endpoint here.

1 Like

thanks
Can we put validator node and sentry nodes into a cluster like kubernetes or docker swarm.
then we can update each instance one by one.
but I am not sure if this will cause double signs

If you put them inside a kubernetes, I believe you still have to make each pod to have its own IP to connect, can’t treat them as one single node. Using kubernetes is good to deploy and upgrade at once but maintaining them as seperate nodes not be appropriate. Seems @aurel is using kubernetes.

1 Like

@ping Running in a dynamic environment like that would be difficult. To bring up nodes with full data you would need snapshots of the validator which would be difficult with current kubernetes APIs.

One way to do this would be to have a “warm backup” gaiad that you move the validator key over to after shutting down your running validator.

2 Likes

I agree that warm backup is a way.
We will try this later and share with everyone

1 Like

From my knowledge, liveness slashing will occur after 5000 missed blocks. That should be enough for any kind of update/upgrade.

1 Like

I agree that this should be enough for every single validator to update/upgrade. However, what happens if a new update is released and a majority of validators try to update at the same time? In that case the chain will halt, because we go below the threshold, right? This could raise issues if every major update makes the chain halt for a few thousand blocks.

if a validator node could missed 5000 blocks without slash, it is enough!

@katernoir yes, that could happen, so it should have a update plan for validators

If validators are diversified enough, they should spread over different time zones and have different maintenance hours. The effects should be minimal as the update should be done in 15mins. Unless individual validator has been delegated with too many tokens.

Or, can the validator temporarily unbond themselves before the upgrade? There are new commands in gaiacli for unbond. Not sure if they are related.

I agree that in theory, they should have different maintenance hours. However, I think that large validator will want to upgrade their nodes as fast as possible. Therefore a new release could create unexpected downtime of the network.

Maybe it’s best to communicate scheduled updates between validators. If we coordinate this over the chat/forum, we can prevent downtimes. Or maybe the cosmos team has already figured out a way smarter option to do this :slight_smile:

1 Like