This post is a discussion about what metrics are important to monitor/alert on when running a validator. The following are places where metrics can be obtained and some notes on those locations:
Number of network peers (p2p_peers) can be alerted on if below a threshold
Time between blocks (consensus_block_interval_seconds) can be alerted on if above a threshold
Many of the other statistics are useful for tracking validator profitability and should be tracked such as:
consensus_validators_power
consensus_num_txs
RPC Port (26657)
The /health endpoint can be used as a confirmation that the process is still running
Would be interested in some feedback from @ajc and the rest of the Figment team on this post as they are working on alerting and monitoring in their hubble tool.
We think there will be a couple different kinds of alerts and long running metrics to track.
A web app comparison: you definitely want to monitor your servers closely for uptime and error logs, and you also use a tool like New Relic to track more complex and longer ranging metrics.
We want Hubble to be more like New Relic - surfacing more complex things in near real-time. Stuff that would be difficult for every Validator to reimplement and run on their own infrastructure.
Just expanding on that thought a little bit. Hubble will alert for some things that we think are highly relevant to a us as validator, and hopefully to others, but it doesn’t help diagnose or provide any general health monitoring. We don’t think Hubble is a replacement for internal monitoring tools. Normal system monitoring, network health, security, ID and monitoring specific to gaiad on each node will be necessary. We think an external monitoring system like Hubble is an important complement to internal tools, because it tells us how our validator is performing from the perspective of the blockchain.
Good point, but I have one question for the others validators, wich program to monitor are you using to? I’ve read about Graylog, Zabbix and ELK; have any recommendations?
thanks!!
ELK is a great choice for the logging portion of it. As far as metrics I would recommend either Prometheus (what that grafana dashboard up there is pulling it’s data from), or InfluxDB which provides a similar feature set.
I created a very simple dashboard for Grafana using the Prometheus metrics. Uploaded it to Grafana, so anyone can use it. Here is the link: https://grafana.com/dashboards/7044
Feedback on useful metrics welcome!
I’m trying to set up Grafana/Prometheus and I feel like I’m missing something.
I set prometheus = true in config.toml. I checked that it’s listening on port 26660, and it appears Grafana is connecting:
netstat -an |grep 26660
tcp 0 0 127.0.0.1:41142 127.0.0.1:26660 ESTABLISHED
tcp 0 0 127.0.0.1:41146 127.0.0.1:26660 ESTABLISHED
tcp6 0 0 :::26660 :::* LISTEN
tcp6 0 0 127.0.0.1:26660 127.0.0.1:41142 ESTABLISHED
tcp6 0 0 127.0.0.1:26660 127.0.0.1:41146 ESTABLISHED
And I see lots of 503 errors in the logs when the dashboard is running:
t=2018-07-20T03:09:37+0000 lvl=info msg=“Request Completed” logger=context userId=1 orgId=1 uname=admin method=GET path=/api/datasources/proxy/1/api/v1/query status=503 remote_addr=127.0.0.1 time_ms=1 size=59 referer=“http://localhost:3000/d/ajjGYQdmz/cosmos-network-dashboard?refresh=5s&orgId=1”
t=2018-07-20T03:09:37+0000 lvl=info msg=“Request Completed” logger=context userId=1 orgId=1 uname=admin method=GET path=/api/datasources/proxy/1/api/v1/query status=503 remote_addr=127.0.0.1 time_ms=4 size=59 referer=“http://localhost:3000/d/ajjGYQdmz/cosmos-network-dashboard?refresh=5s&orgId=1”
You need to run Prometheus to monitor the 26660 target by editing the prometheus.yml. It will listen at port 9090. You then point your datasource in Grafana to <your_address_running_prometheus>:9090
The config in prometheus.yml can be as simple as this.