Monitoring/Alerting for your Validator

This is awesome! Great work @katernoir!

I’m trying to set up Grafana/Prometheus and I feel like I’m missing something.
I set prometheus = true in config.toml. I checked that it’s listening on port 26660, and it appears Grafana is connecting:

netstat -an |grep 26660

tcp 0 0 127.0.0.1:41142 127.0.0.1:26660 ESTABLISHED
tcp 0 0 127.0.0.1:41146 127.0.0.1:26660 ESTABLISHED
tcp6 0 0 :::26660 :::* LISTEN
tcp6 0 0 127.0.0.1:26660 127.0.0.1:41142 ESTABLISHED
tcp6 0 0 127.0.0.1:26660 127.0.0.1:41146 ESTABLISHED

However, in the dashboard I see this:

And I see lots of 503 errors in the logs when the dashboard is running:
t=2018-07-20T03:09:37+0000 lvl=info msg=“Request Completed” logger=context userId=1 orgId=1 uname=admin method=GET path=/api/datasources/proxy/1/api/v1/query status=503 remote_addr=127.0.0.1 time_ms=1 size=59 referer=“http://localhost:3000/d/ajjGYQdmz/cosmos-network-dashboard?refresh=5s&orgId=1
t=2018-07-20T03:09:37+0000 lvl=info msg=“Request Completed” logger=context userId=1 orgId=1 uname=admin method=GET path=/api/datasources/proxy/1/api/v1/query status=503 remote_addr=127.0.0.1 time_ms=4 size=59 referer=“http://localhost:3000/d/ajjGYQdmz/cosmos-network-dashboard?refresh=5s&orgId=1

1 Like

You need to run Prometheus to monitor the 26660 target by editing the prometheus.yml. It will listen at port 9090. You then point your datasource in Grafana to <your_address_running_prometheus>:9090

The config in prometheus.yml can be as simple as this.

2 Likes

We’ve turned on Hubble Alerts and Events for gaia-7001.

Instructions for how to use and subscribe are here:

2 Likes

@katernoir Do you plan on updating this for 7004?

@kwunyeung How about the telegram bot for 7004?

You mean the Grafana dashboard? It should work with any network your node is running, as long as the connection to Prometheus doesn’t get changed. So no need to update it :slight_smile: I haven’t tried it in 7004 yet, though.

We have updated it and I’m testing with it now. I keep receiving absent validator notification if our validator node didn’t send vote to a certain height. You can add the bot and subscribe to your validator address to try.

2 Likes

Where is prometheus.yml?

Thanks for your work on this! I’m trying to configure the dashboard to monitor one of my validators. Could you please provide some guidance on the required settings for the Prometheus data source, show in the screenshot?

The URL in the HTTP section needs to be configured to the HTTP API URL of the Prometheus server that is scraping your validator input.

The Grafana documentation on this topic might be helpful as well.

Can you share your prometheus.yml so that we can understand the setup better

hey, try switchin in data sources in grafana from server to browser.

Because I’ve been asked this a lot, I provided a smat step-by-step instruction to setup Grafana with my dashboard. Hope this works and will help people to get started.

  1. Step: Install Grafana (http://docs.grafana.org/installation/debian/) & start it

  2. Step: In .gaiad/config.toml set prometheus=true

  3. Step: Restart gaiad to apply config changes

  4. Step: Download prometheus (https://prometheus.io/docs/introduction/first_steps/), edit prometheus.yml
    Add the following:

       # COSMOS MONITORING
       # The job name is added as a label `job=<job_name>` to any timeseries scraped$
       - job_name: 'cosmops'
    
       	# metrics_path defaults to '/metrics'
       	# scheme defaults to 'http'.
    
     	static_configs:
     	- targets: ['localhost:26660']
     		labels:
     			group: 'cosmops'
    
  5. Step: start prometheus with: ./prometheus --config.file=prometheus.yml

  6. Step: Open Grafana in Browser & Do initial Setup

  7. Step: Under Configuration -> Data Source -> Add a new Data Source

Name: CosmosDataSource
Type: Prometheus
URL: http://localhost:9090
Scrape Interval: 5s
Rest is Default

-> Save&Test should add DataSource

  1. Step: In Grafana goto Dashboard -> Import
  2. Step: Paste 7044 (this is my Dashboard template for Grafana), Choose “CosmosDataSource” as Data Source
  3. Step: You should now have a working Dashboard :slight_smile:
6 Likes

I’m working on some monitoring and alerting for validators and sentries -

1 - Using Icinga for alerts

2 - Updating a Grafana/Prometheus dashboard

3 - Log analysis

I plan to open source the tools when they’re ready.

For starters, I’m wondering if anyone has done any research into log patterns that indicate missed pre-commits?

1 Like

Adding feedback from -

@mattharrop If set to do so, gaiad will write every signature in each block to syslog. Just check for your validator’s ID in the block of signature, if it’s not there, that’s a miss.

@haasted We’ve created a tool to monitor for pre-votes. https://github.com/validator-network/votewatcher Feedback welcome

2 Likes

In our old gaiabot, it utilizes a systemd package to monitor the journal. If you run your gaiad as systemd service, then the journal can be received from it. You may take a look.

However, I don’t quite like this approach as it uses a lot of resources to keep checking every line of journal log of the process to decide if it should send out an alert message.

1 Like

Feedback from @jack https://twitter.com/jack_zampolin/status/1115987603243683841

A Grafana dashboard compatible with all the cosmos-sdk and tendermint based blockchains: https://github.com/zhangyelong/cosmos-dashboard


1 Like

This post of mine is almost 2 years old. I’m not keeping this up-to-date anymore. Please find some more recent information.