Monitoring/Alerting for your Validator


#1

This post is a discussion about what metrics are important to monitor/alert on when running a validator. The following are places where metrics can be obtained and some notes on those locations:

  • Prometheus Port (26660)
    • Number of network peers (p2p_peers) can be alerted on if below a threshold
    • Time between blocks (consensus_block_interval_seconds) can be alerted on if above a threshold
    • Many of the other statistics are useful for tracking validator profitability and should be tracked such as:
      • consensus_validators_power
      • consensus_num_txs
  • RPC Port (26657)
    • The /health endpoint can be used as a confirmation that the process is still running

Would be interested in some feedback from @ajc and the rest of the Figment team on this post as they are working on alerting and monitoring in their hubble tool.


#2

We think there will be a couple different kinds of alerts and long running metrics to track.

A web app comparison: you definitely want to monitor your servers closely for uptime and error logs, and you also use a tool like New Relic to track more complex and longer ranging metrics.

We want Hubble to be more like New Relic - surfacing more complex things in near real-time. Stuff that would be difficult for every Validator to reimplement and run on their own infrastructure.

Hubble currently syncs the blockchain once per minute and processes new blocks in less than a second, so it won’t be instantaneous, but should be fast enough.

The initial version of Alerts was built around this spec:

Event types

  1. In and out of Validator set
  2. M of N pre-commit votes missed
  3. N in a row pre-commit votes missed
  4. Voting Power changed by N percent

Notification types

  1. Instantaneous
  2. Daily summary

Instantaneous notifications will only be sent once every 15 minutes per-user per-validator.

So that’s what we’ve got so far. Curious to hear what kind of alerts and metrics are interesting for others!


#3

Just expanding on that thought a little bit. Hubble will alert for some things that we think are highly relevant to a us as validator, and hopefully to others, but it doesn’t help diagnose or provide any general health monitoring. We don’t think Hubble is a replacement for internal monitoring tools. Normal system monitoring, network health, security, ID and monitoring specific to gaiad on each node will be necessary. We think an external monitoring system like Hubble is an important complement to internal tools, because it tells us how our validator is performing from the perspective of the blockchain.


#4


I’m loving it!!!


#5

Hello, the “Engineer”!


#6

@suyu good job!
will you share with us :blush:


#7

Good point, but I have one question for the others validators, wich program to monitor are you using to? I’ve read about Graylog, Zabbix and ELK; have any recommendations?
thanks!! :grinning:


#8

@suyu Can you share that dashboard code? Would love for the community to have an easy Grafana dashboard for the prometheus metrics!


#9

ELK is a great choice for the logging portion of it. As far as metrics I would recommend either Prometheus (what that grafana dashboard up there is pulling it’s data from), or InfluxDB which provides a similar feature set.


#10

I’ll read about them, thanks! :hugs:


#11

Wow ! That’s so cool. Good job @suyu


#12

I created a very simple dashboard for Grafana using the Prometheus metrics. Uploaded it to Grafana, so anyone can use it. Here is the link: https://grafana.com/dashboards/7044
Feedback on useful metrics welcome!


#13

This is awesome @katernoir! I’ll give it a try!


#14

simple??? I think it’s not simple :wink:
great work!!


#15

This is awesome! Great work @katernoir!


#16

I’m trying to set up Grafana/Prometheus and I feel like I’m missing something.
I set prometheus = true in config.toml. I checked that it’s listening on port 26660, and it appears Grafana is connecting:

netstat -an |grep 26660

tcp 0 0 127.0.0.1:41142 127.0.0.1:26660 ESTABLISHED
tcp 0 0 127.0.0.1:41146 127.0.0.1:26660 ESTABLISHED
tcp6 0 0 :::26660 :::* LISTEN
tcp6 0 0 127.0.0.1:26660 127.0.0.1:41142 ESTABLISHED
tcp6 0 0 127.0.0.1:26660 127.0.0.1:41146 ESTABLISHED

However, in the dashboard I see this:

And I see lots of 503 errors in the logs when the dashboard is running:
t=2018-07-20T03:09:37+0000 lvl=info msg=“Request Completed” logger=context userId=1 orgId=1 uname=admin method=GET path=/api/datasources/proxy/1/api/v1/query status=503 remote_addr=127.0.0.1 time_ms=1 size=59 referer=“http://localhost:3000/d/ajjGYQdmz/cosmos-network-dashboard?refresh=5s&orgId=1
t=2018-07-20T03:09:37+0000 lvl=info msg=“Request Completed” logger=context userId=1 orgId=1 uname=admin method=GET path=/api/datasources/proxy/1/api/v1/query status=503 remote_addr=127.0.0.1 time_ms=4 size=59 referer=“http://localhost:3000/d/ajjGYQdmz/cosmos-network-dashboard?refresh=5s&orgId=1


#17

You need to run Prometheus to monitor the 26660 target by editing the prometheus.yml. It will listen at port 9090. You then point your datasource in Grafana to <your_address_running_prometheus>:9090

The config in prometheus.yml can be as simple as this.


#18

We’ve turned on Hubble Alerts and Events for gaia-7001.

Instructions for how to use and subscribe are here:


#19

@katernoir Do you plan on updating this for 7004?


#20

@kwunyeung How about the telegram bot for 7004?