History

bel ceeb6f0385 cold		2021-09-12 22:16:11 -06:00
..
.jenkins	cold	2021-09-12 22:16:11 -06:00
alert_configs	cold	2021-09-12 22:16:11 -06:00
alert_routing_lookup	cold	2021-09-12 22:16:11 -06:00
library	cold	2021-09-12 22:16:11 -06:00
webapp	cold	2021-09-12 22:16:11 -06:00
.gitignore	cold	2021-09-12 22:16:11 -06:00
Dockerfile.webapp	cold	2021-09-12 22:16:11 -06:00
README.md	cold	2021-09-12 22:16:11 -06:00
aom_webapp.py	cold	2021-09-12 22:16:11 -06:00
generate_config.sh	cold	2021-09-12 22:16:11 -06:00
publish.sh	cold	2021-09-12 22:16:11 -06:00
run.sh	cold	2021-09-12 22:16:11 -06:00
run_webapp.sh	cold	2021-09-12 22:16:11 -06:00
service.yaml	cold	2021-09-12 22:16:11 -06:00
show_config.py	cold	2021-09-12 22:16:11 -06:00
test_changed.sh	cold	2021-09-12 22:16:11 -06:00
validate_yaml.py	cold	2021-09-12 22:16:11 -06:00
webapp_requirements.txt	cold	2021-09-12 22:16:11 -06:00

README.md

README

This is the new repository for the Alert On Metrics project configurations.

Alert On Metrics (AOM) project allows one to setup alerts to trigger based on tracking a metric value as collected via Metrics as a Service. You "track" your metric via a KairosDB query or Prometheus query so you are not limited to raw metrics - you can sample based on aggregators available in KairosDB to create new metrics views or use PromQL if you are using Prometheus. Typically people use min, max or count. All "tracked" metrics are rewritten to the metrics data store as a new metric telgraf.aom_stats_value but are tagged by Alert-On-Metrics to show their origin.

You can trigger an alert based on any combination of the following:

An upper critical threshold based on the value of a metric increasing
An upper warning threshold based on the value of a metric increasing
A lower critical threshold based on the value of a metric decreasing
A lower warning threshold based on the value of a metric decreasing
Combine any lower and upper threshold to create a 'band'

Sensu and alert subdue. NEW!

Some changes have been introduced into latest AOM versions. Now alerts can be sent through Sensu (email not supported yet). Using Sensu also allows to create check dependencies (vo is now victorops for Sensu).

alerts:
  sensu:
    victorops:
      'blackhole'
    slack:
      '#aom_test_channel'
    dependencies:
    - name_of_check1
    - name_of_check2

Also filters option has been enabled. It works the same way as in Hiera. If you only want to receive critical alerts through one channel you can set "channel"_subdue to true. Example:

filters:
  slack_subdue: true
  victorops_subdue: false

You can make use of anything that sensu api supports. Anything you add to your configuration under sensu will be sent directly to the Sensu API.

Availability metric.

If you want to track how long your check is on CRITICAL state along a given period of time, you can enable this feature by setting this option to true:

availability: true

This will start sending metrics constantly and recording the check output. You can then visualize this metric within the following [dashboard] (https://grafana.eng.qops.net/d/5OsrZSdiz/aom-availability?orgId=1) (or you can create your own). To get a more accurate result don't set the refresh interval lower than 60 seconds.

Routing per tag value. NEW!

This feature allows you to configure a different alert routing using the values of tags in your metric. For instance, let's say you want to have a different alert policy for beta, gamma and prod:

beta: I want to alert my #my-project-dev channel
gamma: I want to alert my #my-project-gamma channel
prod: I want to alert my #my-project channel and page the on-call on VictorOps

We can use the dc tag available in the metric query, define specific configuration for beta and gamma, and use a default one for all other values (prod in this case). Everything is configured inside the alerts object in the yaml configuration. Instead of directly adding the alert configuration, add a lookup key. Inside, you have to provide three values:

default: the alert policy to apply by default if we can't find a configuration for a specific combination of tags. The format is the exact same as classic alerts (sensu, vo, slack, etc.).
tags: the tags that will be used to lookup the alert routing configuration. You can use more than one tag.
lookups: an array, where each element specifies a combination of tag values and the routing to apply in this case.

Here is the configuration of our example:

alerts:
  lookup:
    default:
      sensu:
        slack: my-project
        victorops: my-on-call-key
    tags:
      - dc
    lookups:
      -
        alert:
          sensu:
            slack: my-project-dev
        tags:
          dc: b1-prv
      -
        alert:
          sensu:
            slack: my-project-gamma
        tags:
          dc: g1-iad

You can move the lookups part inside a separate file, so it can be reused accross different AOM configurations. To do that, instead of a lookups key, provide a lookup_file with the filename, including the extension:

alerts:
  lookup:
    default: ...
    lookup_file: my_lookup_file.yaml
    tags: ...

Save this file under the alert_routing_lookup folder. The syntax for the alert routing is the same as before, it is just in a different file:

---
-
  alert:
    sensu:
      slack: my-project-dev
  tags:
    dc: b1-prv
-
  alert:
    sensu:
      slack: my-project-gamma
  tags:
    dc: g1-iad

How do I register a new alert with AOM?

Alert configurations for AOM are just a Kairos DB or Prometheus query specified in a yaml format and wrapped in some controlling configuration that determines how frequently the query is executed, thresholds, occurrences and where to route the alerts. We have built a small UI that is packaged with the AOM gitlab project that will help you generate a suitable yaml configuration. You can rehearse your queries on the [KairosDB UI] (http://kairosdb-metrics.service.eng.consul:8080/) or at any Prometheus endpoint and take a look at other examples in the alert_configs/ folder for help.

Follow the instructions below to launch the yaml generator UI on your local desktop and use it to generate a merge request (Docker is necessary).

Clone the project
cd into the project's directory
Run the script ./generate_config.sh
Once up, navigate in a browser to localhost:80/
Fill out the form and click generate
Hit Crlt+C when you have the alert configuration
Submit the merge request in a new branch

This process will starts a local webserver that provides a convenient interface for generating the yaml you need. Most of the fields have helpful info tips on what each value is and how it's used.

Visualization tool [BETA]

Along with the project, a simple python script to show how your metrics will look like and to help you setting the thresholds, is provided. This tool requires the installation of python3 and some additional python3 modules:

yaml
json
requests
numpy
matplotlib

These modules should be easy to install using 'pip' or 'homebrew'.

Usage: python3 show_config.py [X] alertname_without_yaml_extension

Where X is an optional parameter to define the interval lenght you want to display. It's a multiplier factor, set to 10 by default, that will increase the start_relative (so you will see more datapoints).

The script should open a window showing the metrics along the defined thresholds. If the query doesn't return any value, it will exit.

How does my new alert get to production?

Once you submit a merge request, a Jenkins' job will quickly validate your alert files just checking it contains all required fields and proper syntax. Setting up appropriate thresholds and alerting channels (VictorOps, email, Slack) is user's responsibility.

If Jenkins returns a PASS result for the test, new alert files will be merged into the master branch and a deploy job will be triggered (also from Jenkins). AOM service will be actively looking for changes in the alert_configs folder and will pick up any changes (by default every 300 seconds).

Helpful Tidbits

IMPORTANT: The alert id field must be unique, it might be useful running the grep command within the alert_configs directory to ensure it's not already defined.

Use the UI on the kairosdb box to help you generate / determine the proper query. Remember, you want to get the query down to just one or 2 entries per group-by so that the service can quickly iterate over it.

Once the request has been merged you can check if your query is getting processed by hitting the url

You can also check out the grafana dashboard that has the results of this service's queries and verify your alert metric is showing up regularly.

From KairosDB's doc: You must specify either start_absolute or start_relative but not both. Similarly, you may specify either end_absolute or end_relative but not both. If either end time is not specified the current date and time is assumed. We suggest the usage of end_relative (greater than 1 minute) as this will make steadier graphs (if you draw a graph until Now, some of the latest metrics could be missing so the end of the graph will be lower than it should).

We do not recommend using align_sampling and align_start_time (both false by default so can be skipped) as they might change the alignment of metrics and change graphs over time (If more than one are set, unexpected results will occur).

If you have any doubt about KairosDB's query metrics you can take a look at their documentation here.

The Gotchas

Alerts only fire when KairosDB returns a result. If your KairosDB metric query returns no results for X (currently 10) attempts any active alerts will clear with a message explaining that AOM could not get any further results from KairosDB so user must manually verify RECOVERY. Earlier versions of AOM had no flap protection like this built in. Long term we will move alerting to Sensu which has more advanced built in flap protection. You can reduce flapping of results by building your Kairos query well. Please talk to engineering visibility for help with this.
Metrics are only collected every 60 seconds, so setting an interval below that will automatically get bumped up to 60 seconds from the web based config generation. Match up the interval by how often the metric is collected and measured
The Email field only requires a list of names, and not the @qualtrics bit, as it will only send to qualtrics addresses using the internal-smtp1-app.eng.qops.net box
Email and Slack alerts fire once during an event. This way if an outtage was occuring, you wouldn't get flooded with emails and slack alerts the entire time.
Email and Slack alerts can be helpful to share with the team so they are aware of what is happening.
Email and Slack alerts can be helpful when trying to figure out your alerts before you VO stuff