Alertmanager : Notifications et routage

Prometheus détecte un problème, mais qui est prévenu ? Alertmanager reçoit les alertes de Prometheus et les route vers les bons canaux. Ce guide vous accompagne de bout en bout : créer une alerte → la router → recevoir 1 notif (pas 50) → la muter pendant une maintenance.

La chaîne complète (modèle mental)

Avant de configurer, comprenez ce qui se passe :

Prometheus évalue une règle PromQL → détecte un problème → produit une alerte
L’alerte contient des labels (routing) et des annotations (affichage)
Alertmanager reçoit l’alerte et applique 4 mécanismes :
- Routing : où envoyer (Slack, PagerDuty, email…)
- Grouping : anti-spam (50 pods down = 1 notification)
- Inhibition : supprime les alertes redondantes (cluster down → pas d’alerte pods)
- Silence : mute volontaire pendant une maintenance
Le receiver notifie via Slack, email, PagerDuty, Teams, Webhook…

Architecture Alertmanager : Prometheus envoie les alertes vers Alertmanager qui route vers Slack, PagerDuty et Email

Labels vs Annotations : la distinction clé

Type	Rôle	Exemples	Utilisé pour
Labels	Identifient et routent	`severity`, `team`, `env`, `service`	Routing, grouping, inhibition
Annotations	Expliquent et documentent	`summary`, `description`, `runbook_url`	Contenu de la notification

Votre première alerte complète (30 minutes)

Suivez ce parcours pour envoyer votre première notification.

Étape 1 : Créer une alerte actionnable

Une bonne alerte inclut des labels de routing ET des annotations utiles.

groups:
  - name: app
    rules:
      - alert: ApiHighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
          env: production
          service: api
        annotations:
          summary: "Taux d'erreurs 5xx > 5% ({{ $labels.service }})"
          description: "Erreur 5xx élevée depuis 5 minutes. Vérifier logs + traces."
          runbook_url: "https://wiki.example.com/runbooks/api-high-error-rate"
          dashboard_url: "https://grafana.example.com/d/api-errors"

Étape 2 : Configurer Prometheus → Alertmanager

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "rules/*.yml"

Étape 3 : Configuration Alertmanager minimale

global:
  resolve_timeout: 5m

route:
  receiver: slack-default
  group_by: [alertname, service, env]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: slack-default
    slack_configs:
      - api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
        channel: "#alerts"
        send_resolved: true
        title: '[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}'
        text: >-
          *Service:* {{ .CommonLabels.service }}
          *Env:* {{ .CommonLabels.env }}
          *Severity:* {{ .CommonLabels.severity }}
          *Résumé:* {{ .CommonAnnotations.summary }}
          *Runbook:* {{ .CommonAnnotations.runbook_url }}

Étape 4 : Valider avant de déployer

Ne déployez jamais sans valider — sinon vous debuggez à l’aveugle.

# Côté Prometheus : valider config + rules
promtool check config prometheus.yml
promtool check rules rules/*.yml

# Côté Alertmanager : valider config
amtool check-config alertmanager.yml

Étape 5 : Vérification

Déclencher l’alerte : injectez des erreurs 5xx ou attendez un vrai incident
Vérifier dans Prometheus : http://prometheus:9090/alerts
Vérifier dans Alertmanager : http://alertmanager:9093
Recevoir la notification Slack avec les infos + runbook

Routing : l’arbre de décision

Le routage est un arbre : l’alerte entre à la racine, descend, et s’arrête au premier match (sauf continue: true).

Routes réalistes (prod/staging/équipes)

route:
  receiver: slack-default
  group_by: [alertname, service, env]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Prod + critical → PagerDuty (astreinte)
    - matchers:
        - env="production"
        - severity="critical"
      receiver: pagerduty-prod

    # Prod + warning → Slack prod
    - matchers:
        - env="production"
        - severity="warning"
      receiver: slack-prod

    # Staging → Slack staging
    - matchers:
        - env="staging"
      receiver: slack-staging

    # Team backend → copie en plus au canal backend
    - matchers:
        - team="backend"
      receiver: slack-backend
      continue: true  # continue le matching

receivers:
  - name: slack-default
    slack_configs:
      - api_url: "https://hooks.slack.com/services/..."
        channel: "#alerts"

  - name: pagerduty-prod
    pagerduty_configs:
      - routing_key: "$PAGERDUTY_ROUTING_KEY"
        send_resolved: true

  - name: slack-prod
    slack_configs:
      - api_url: "https://hooks.slack.com/services/..."
        channel: "#alerts-prod"

  - name: slack-staging
    slack_configs:
      - api_url: "https://hooks.slack.com/services/..."
        channel: "#alerts-staging"

  - name: slack-backend
    slack_configs:
      - api_url: "https://hooks.slack.com/services/..."
        channel: "#backend-alerts"

Tester le routing

# Simuler une alerte et voir où elle irait
amtool config routes test --config.file=alertmanager.yml \
  severity=critical env=production team=backend

Grouping : l’anti-spam

Le grouping répond à : combien de notifications pour 50 pods qui tombent ?

Paramètre	Rôle	Valeur typique
`group_by`	Labels qui définissent “un incident”	`[alertname, service, env]`
`group_wait`	Attente initiale pour regrouper	30s
`group_interval`	Cadence pour updates d’un groupe	5m
`repeat_interval`	Rappel si toujours actif	4h

Règle pratique :

Si vous groupez par instance, vous allez spammer (1 notif par pod)
Si vous groupez par service + env, vous recevez “1 incident par service”

route:
  group_by: [alertname, service, env]  # 1 incident = 1 service en 1 env
  group_wait: 30s      # Attendre 30s pour grouper
  group_interval: 5m   # Màj toutes les 5m
  repeat_interval: 4h  # Rappel toutes les 4h si toujours actif

Silences vs Inhibition : deux outils différents

Silences : mute volontaire (maintenance)

Quand	Exemple
Maintenance planifiée	”Upgrade MySQL, mute les alertes DB 2h”
Incident connu	”On sait que l’API est lente, on travaille dessus”

# Via amtool
amtool silence add alertname=HighCPU instance=web-01 \
  --comment="Maintenance planifiée JIRA-1234" \
  --duration=2h

# Via API
curl -X POST http://localhost:9093/api/v2/silences -d '{
  "matchers": [
    {"name": "alertname", "value": "HighCPU", "isRegex": false},
    {"name": "instance", "value": "web-01", "isRegex": false}
  ],
  "startsAt": "2026-02-09T10:00:00Z",
  "endsAt": "2026-02-09T12:00:00Z",
  "createdBy": "admin",
  "comment": "Maintenance planifiée JIRA-1234"
}'

Inhibition : suppression automatique (anti-bruit)

Quand	Exemple
Incident global	”ClusterDown → inutile d’alerter sur chaque pod”
Hiérarchie de sévérité	”Critical firing → supprimer les warnings”

inhibit_rules:
  # Si ClusterDown est firing, supprimer les alertes Pod*
  - source_matchers:
      - alertname="ClusterDown"
      - env="production"
    target_matchers:
      - alertname=~"Pod.*"
      - env="production"
    equal: [cluster]  # Même cluster seulement

  # Si critical actif, supprimer les warning du même service
  - source_matchers:
      - severity="critical"
    target_matchers:
      - severity="warning"
    equal: [alertname, service]

Design de labels (standards recommandés)

Label	Valeurs	Rôle
`severity`	critical, warning, info	Priorité (routing + astreinte)
`team`	backend, frontend, sre, data	Ownership (routing)
`env`	production, staging, dev	Environnement (routing)
`service`	api, auth, payments, db	Découpage produit (grouping)
`cluster` / `region`	eu-west-1, us-east-1	Opérations multi-cluster

Transformer une alerte en runbook

Une alerte sans action = bruit. Incluez toujours :

annotations:
  summary: "Résumé en 1 ligne ({{ $labels.service }})"
  description: "Contexte : quoi vérifier en premier"
  runbook_url: "https://wiki.example.com/runbooks/alert-name"
  dashboard_url: "https://grafana.example.com/d/service-dashboard"
  logs_url: "https://loki.example.com/explore?query={service=\"{{ $labels.service }}\"}"

Template Slack avec liens :

text: >-
  *Service:* {{ .CommonLabels.service }}
  *Résumé:* {{ .CommonAnnotations.summary }}
  *Runbook:* {{ .CommonAnnotations.runbook_url }}
  *Dashboard:* {{ .CommonAnnotations.dashboard_url }}

Installation

Télécharger Alertmanager v0.31.0

cd /tmp
VERSION="0.31.0"
wget https://github.com/prometheus/alertmanager/releases/download/v${VERSION}/alertmanager-${VERSION}.linux-amd64.tar.gz
wget https://github.com/prometheus/alertmanager/releases/download/v${VERSION}/sha256sums.txt
grep "alertmanager-${VERSION}.linux-amd64.tar.gz" sha256sums.txt | sha256sum -c -

Installer

tar xvfz alertmanager-${VERSION}.linux-amd64.tar.gz
sudo cp alertmanager-${VERSION}.linux-amd64/{alertmanager,amtool} /usr/local/bin/
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

Créer le service systemd (hardened)

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=127.0.0.1:9093
ExecReload=/bin/kill -HUP $MAINPID
Restart=always

# Hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/alertmanager

[Install]
WantedBy=multi-user.target

Démarrer

sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

Alertmanager écoute sur 127.0.0.1:9093 (sécurisé par défaut).

docker run -d \
  --name alertmanager \
  -p 9093:9093 \
  -v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro \
  -v alertmanager-data:/alertmanager \
  prom/alertmanager:v0.31.0

Avec kube-prometheus-stack, Alertmanager est inclus :

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      receiver: 'slack-default'
      group_by: [alertname, service, env]
      routes:
        - matchers:
            - severity="critical"
          receiver: 'pagerduty'
    receivers:
      - name: 'slack-default'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/...'
            channel: '#alerts'
      - name: 'pagerduty'
        pagerduty_configs:
          - routing_key: '$PAGERDUTY_KEY'

Receivers natifs

Alertmanager supporte nativement :

Receiver	Usage typique
Slack	Notifications équipe
Email	Backup / management
PagerDuty	Astreinte, escalade
Opsgenie	Astreinte, on-call
Microsoft Teams	Entreprises Microsoft
Webhook	Intégrations custom (CMDB, ticketing)

Voir la liste complète des intégrations.

Stratégie recommandée :

Slack + Email : commencer simple
+ PagerDuty/Opsgenie : pour l’astreinte
+ Webhook : pour intégrer votre ticketing (Jira, ServiceNow)

Templates

Variables disponibles

Variable	Description
`.Status`	”firing” ou “resolved”
`.Alerts`	Liste des alertes
`.CommonLabels`	Labels communs à toutes les alertes du groupe
`.CommonAnnotations`	Annotations communes
`.ExternalURL`	URL d’Alertmanager

Pour chaque alerte dans .Alerts :

Variable	Description
`.Labels`	Tous les labels de l’alerte
`.Annotations`	Toutes les annotations
`.StartsAt`	Timestamp de début
`.EndsAt`	Timestamp de fin (si resolved)

Template externe réutilisable

templates:
  - '/etc/alertmanager/templates/*.tmpl'

{{ define "slack.title" }}
[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}
{{ end }}

{{ define "slack.text" }}
{{ range .Alerts }}
*Instance:* {{ .Labels.instance }}
*Severity:* {{ .Labels.severity }}
*Summary:* {{ .Annotations.summary }}
*Runbook:* {{ .Annotations.runbook_url }}
---
{{ end }}
{{ end }}

Haute disponibilité

Alertmanager supporte le clustering :

# Instance 1
alertmanager --cluster.listen-address=0.0.0.0:9094 \
             --cluster.peer=alertmanager-2:9094

# Instance 2
alertmanager --cluster.listen-address=0.0.0.0:9094 \
             --cluster.peer=alertmanager-1:9094

Prometheus doit pointer vers toutes les instances :

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'alertmanager-1:9093'
            - 'alertmanager-2:9093'

Checklist production

Avant de mettre en prod, vérifiez :

Routes : prod/staging/dev ont des receivers distincts
Grouping : group_by évite le spam (pas par instance)
Inhibition : incident global supprime les alertes enfants
Silences documentés : qui peut créer, avec quel commentaire
Templates : incluent runbook_url et dashboard_url
Tests : promtool check rules + amtool check-config avant reload
HA : clustering si critique
Sécurité : bind sur localhost ou reverse proxy authentifiant

Dépannage

Symptôme	Cause probable	Solution
Pas de notification	Receiver mal configuré	`amtool config routes test`
Doublon de notifications	Pas de clustering	Configurer la HA
Trop d’alertes	group_by trop fin	Grouper par service, pas instance
Silence ne marche pas	Mauvais matchers	Vérifier labels exacts
Alerte ne route pas	Mauvais matchers route	`amtool config routes test`

Commandes utiles :

# Tester la config
amtool check-config alertmanager.yml

# Simuler le routing
amtool config routes test --config.file=alertmanager.yml severity=critical env=production

# Lister les silences actifs
amtool silence query

# Logs
journalctl -u alertmanager -f

À retenir

Labels = routing (qui doit agir), Annotations = contenu (quoi faire)
Routing = arbre de décision, utilisez matchers (pas match)
Grouping = anti-spam, groupez par service pas par instance
Silence = mute volontaire (maintenance)
Inhibition = suppression automatique (cluster down)
Validez toujours : amtool check-config avant de déployer

Prochaines étapes

Exporters Les 3 familles et le pattern multi-target

Kubernetes kube-prometheus-stack et ServiceMonitors

Configuration Recording rules et relabeling avancé