Monitoring/Alerting for your Validator

jack · July 10, 2018, 6:49pm

This post is a discussion about what metrics are important to monitor/alert on when running a validator. The following are places where metrics can be obtained and some notes on those locations:

Prometheus Port (26660)
- Number of network peers (p2p_peers) can be alerted on if below a threshold
- Time between blocks (consensus_block_interval_seconds) can be alerted on if above a threshold
- Many of the other statistics are useful for tracking validator profitability and should be tracked such as:
  - consensus_validators_power
  - consensus_num_txs
RPC Port (26657)
- The /health endpoint can be used as a confirmation that the process is still running

Would be interested in some feedback from @ajc and the rest of the Figment team on this post as they are working on alerting and monitoring in their hubble tool.

ajc · July 10, 2018, 8:24pm

We think there will be a couple different kinds of alerts and long running metrics to track.

A web app comparison: you definitely want to monitor your servers closely for uptime and error logs, and you also use a tool like New Relic to track more complex and longer ranging metrics.

We want Hubble to be more like New Relic - surfacing more complex things in near real-time. Stuff that would be difficult for every Validator to reimplement and run on their own infrastructure.

Hubble currently syncs the blockchain once per minute and processes new blocks in less than a second, so it won’t be instantaneous, but should be fast enough.

The initial version of Alerts was built around this spec:

Event types

In and out of Validator set
M of N pre-commit votes missed
N in a row pre-commit votes missed
Voting Power changed by N percent

Notification types

Instantaneous
Daily summary

Instantaneous notifications will only be sent once every 15 minutes per-user per-validator.

So that’s what we’ve got so far. Curious to hear what kind of alerts and metrics are interesting for others!

mattharrop · July 10, 2018, 11:42pm

Just expanding on that thought a little bit. Hubble will alert for some things that we think are highly relevant to a us as validator, and hopefully to others, but it doesn’t help diagnose or provide any general health monitoring. We don’t think Hubble is a replacement for internal monitoring tools. Normal system monitoring, network health, security, ID and monitoring specific to gaiad on each node will be necessary. We think an external monitoring system like Hubble is an important complement to internal tools, because it tells us how our validator is performing from the perspective of the blockchain.

suyu · July 11, 2018, 1:40am

I’m loving it!!!

kwunyeung · July 11, 2018, 3:35am

Hello, the “Engineer”!

ping · July 11, 2018, 5:13am

@suyu good job!
will you share with us

wimel · July 11, 2018, 8:23pm

Good point, but I have one question for the others validators, wich program to monitor are you using to? I’ve read about Graylog, Zabbix and ELK; have any recommendations?
thanks!!

jack · July 11, 2018, 8:52pm

@suyu Can you share that dashboard code? Would love for the community to have an easy Grafana dashboard for the prometheus metrics!

jack · July 11, 2018, 8:54pm

ELK is a great choice for the logging portion of it. As far as metrics I would recommend either Prometheus (what that grafana dashboard up there is pulling it’s data from), or InfluxDB which provides a similar feature set.

wimel · July 11, 2018, 8:59pm

I’ll read about them, thanks!

aurel · July 12, 2018, 7:35am

Wow ! That’s so cool. Good job @suyu

katernoir · July 18, 2018, 3:29pm

I created a very simple dashboard for Grafana using the Prometheus metrics. Uploaded it to Grafana, so anyone can use it. Here is the link: https://grafana.com/dashboards/7044
Feedback on useful metrics welcome!

jack · July 18, 2018, 3:54pm

This is awesome @katernoir! I’ll give it a try!

wimel · July 18, 2018, 8:55pm

simple??? I think it’s not simple
great work!!

kwunyeung · July 19, 2018, 5:07am

This is awesome! Great work @katernoir!

pbostrom · July 20, 2018, 3:11am

I’m trying to set up Grafana/Prometheus and I feel like I’m missing something.
I set prometheus = true in config.toml. I checked that it’s listening on port 26660, and it appears Grafana is connecting:

netstat -an |grep 26660

tcp 0 0 127.0.0.1:41142 127.0.0.1:26660 ESTABLISHED
tcp 0 0 127.0.0.1:41146 127.0.0.1:26660 ESTABLISHED
tcp6 0 0 :::26660 :::* LISTEN
tcp6 0 0 127.0.0.1:26660 127.0.0.1:41142 ESTABLISHED
tcp6 0 0 127.0.0.1:26660 127.0.0.1:41146 ESTABLISHED

However, in the dashboard I see this:

And I see lots of 503 errors in the logs when the dashboard is running:
t=2018-07-20T03:09:37+0000 lvl=info msg=“Request Completed” logger=context userId=1 orgId=1 uname=admin method=GET path=/api/datasources/proxy/1/api/v1/query status=503 remote_addr=127.0.0.1 time_ms=1 size=59 referer=“http://localhost:3000/d/ajjGYQdmz/cosmos-network-dashboard?refresh=5s&orgId=1”
t=2018-07-20T03:09:37+0000 lvl=info msg=“Request Completed” logger=context userId=1 orgId=1 uname=admin method=GET path=/api/datasources/proxy/1/api/v1/query status=503 remote_addr=127.0.0.1 time_ms=4 size=59 referer=“http://localhost:3000/d/ajjGYQdmz/cosmos-network-dashboard?refresh=5s&orgId=1”

kwunyeung · July 20, 2018, 3:16am

You need to run Prometheus to monitor the 26660 target by editing the prometheus.yml. It will listen at port 9090. You then point your datasource in Grafana to <your_address_running_prometheus>:9090

The config in prometheus.yml can be as simple as this.

ajc · July 22, 2018, 6:51pm

We’ve turned on Hubble Alerts and Events for gaia-7001.

Instructions for how to use and subscribe are here:

chris-chainflow · July 31, 2018, 1:32pm

@katernoir Do you plan on updating this for 7004?

chris-chainflow · July 31, 2018, 1:33pm

@kwunyeung How about the telegram bot for 7004?

Topic		Replies	Views
I made specific validator's metrics to prometheus Validation	0	1134	September 24, 2018
List of tools created by validators for validators Validation	6	1920	July 19, 2018
[Validators] Cosmos & Tendermint Guides, Tools & Learning Resources Ecosystem	3	8749	October 1, 2020
Feedback Requested on Validator Monitoring and Alerting Tool Validation	0	550	November 21, 2019
Mempool Size view on the web Tendermint	0	460	March 29, 2020

Monitoring/Alerting for your Validator

Event types

Notification types

netstat -an |grep 26660

Related topics