Thiết lập cảnh báo Telegram cho Prometheus

1. Tổng quan.

Trong bài viết này, chúng ta sẽ tìm hiểu cách thiết lập Prometheus và Alertmanager bằng Docker Compose. Prometheus là một hệ thống giám sát và cảnh báo mã nguồn mở, trong khi Alertmanager là một công cụ quản lý cảnh báo. Chúng ta sẽ sử dụng Docker Compose để dễ dàng triển khai và quản lý các dịch vụ này. Bài viết sẽ giải thích chi tiết từng dòng lệnh giúp bạn hiểu rõ hơn về cách cấu hình và triển khai Prometheus và Alertmanager.

2. Quy trình triển khai.

Tạo thư mục dữ liệu.

Lệnh này tạo các thư mục /home/prometheus/data và /home/alertmanager/data để lưu trữ dữ liệu của Prometheus và Alertmanager.

mkdir -p /home/prometheus/data /home/alertmanager/data

Tạo file Docker Compose.

cat > /home/docker-compose.yml << 'OEF'
version: '3'
services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - /home/prometheus/data:/etc/prometheus
      - /home/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - /home/prometheus/rules.yml:/etc/prometheus/rules.yml
    command:
      - --config.file=/etc/prometheus/prometheus.yml
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - /home/alertmanager/data:/etc/alertmanager
      - /home/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - /home/alertmanager/template.tmpl:/etc/alertmanager/template.tmpl
    command:
      - --config.file=/etc/alertmanager/alertmanager.yml
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge
OEF

Đoạn shell này sẽ tạo file docker-compose.yml với cấu hình cho hai dịch vụ: Prometheus và Alertmanager.

version: ‘3’: Định nghĩa phiên bản của Docker Compose.
services: Định nghĩa các dịch vụ sẽ được triển khai.
- prometheus: Định nghĩa dịch vụ Prometheus.
  - image: Sử dụng image Docker prom/prometheus.
  - container_name: Đặt tên cho container là prometheus.
  - ports: Mở cổng 9090 để truy cập Prometheus.
  - volumes: Gắn các thư mục và file cấu hình từ máy host vào container.
  - command: Chạy Prometheus với file cấu hình /etc/prometheus/prometheus.yml.
  - networks: Kết nối dịch vụ vào network monitoring.
- alertmanager: Định nghĩa dịch vụ Alertmanager.
  - image: Sử dụng image Docker prom/alertmanager.
  - container_name: Đặt tên cho container là alertmanager.
  - ports: Mở cổng 9093 để truy cập Alertmanager.
  - volumes: Gắn các thư mục và file cấu hình từ máy chủ vào container.
  - command: Chạy Alertmanager với file cấu hình /etc/alertmanager/alertmanager.yml.
  - networks: Kết nối dịch vụ vào mạng monitoring.
networks: Định nghĩa network monitoring với driver bridge.

Tạo file cấu hình Prometheus.

cat > /home/prometheus/prometheus.yml << 'OEF'
global:
  scrape_interval:     15s
  evaluation_interval: 15s
  external_labels:
      monitor: 'monitoring-system'

rule_files:
  - /etc/prometheus/rules.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']

  - job_name: 'demo_metric'
    static_configs:
      - targets: ['192.168.100.253:9999']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
OEF

Đoạn shell này tạo file cấu hình prometheus.yml cho Prometheus.

global: Định nghĩa các cấu hình toàn cục.
- scrape_interval: Khoảng thời gian giữa các lần thu thập dữ liệu (15 giây).
- evaluation_interval: Khoảng thời gian giữa các lần đánh giá rule (15 giây).
- external_labels: Gán nhãn monitor với giá trị monitoring-system.
rule_files: Định nghĩa file rule cảnh báo.
- /etc/prometheus/rules.yml: Đường dẫn đến file file rule cảnh báo.
scrape_configs: Định nghĩa các cấu hình thu thập dữ liệu.
- job_name: ‘prometheus’: Định nghĩa công việc thu thập dữ liệu từ Prometheus.
  - static_configs: Định nghĩa các cấu hình targets endpoint.
    - targets: Định nghĩa các targets thu thập dữ liệu (Prometheus chạy trên cổng 9090).
- job_name: ‘demo_metric’: Định nghĩa job thu thập dữ liệu từ demo_metric.
  - static_configs: Định nghĩa các targets endpoint.
    - targets: Định nghĩa các targets thu thập dữ liệu (192.168.100.253:9999).
alerting: Định nghĩa cấu hình alerting.
- alertmanagers: Định nghĩa các Alertmanager.
  - static_configs: Định nghĩa các cấu hình tĩnh.
    - targets: Định nghĩa các targets cảnh báo (Alertmanager chạy trên cổng 9093).

Tạo file cấu hình Alertmanager.

cat > /home/alertmanager/alertmanager.yml << 'OEF'
global:
  resolve_timeout: 10s

route:
  group_by: ['alertname', 'alias']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'telegram'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'critical'
    equal: ['alertname', 'alias']

receivers:
  - name: 'telegram'
    telegram_configs:
    - bot_token: '5536432897:AAEsq6I7lfiZdO6mRYDX7lWgU5-bamih-MI'
      chat_id: -863816906
      message: '{{ template "telegram.message" . }}'
      send_resolved: true

templates:
  - '/etc/alertmanager/template.tmpl'
OEF

Đoạn shell này tạo file cấu hình alertmanager.yml cho Alertmanager.

global: Định nghĩa các cấu hình toàn cục.
- resolve_timeout: Thời gian chờ resolve cảnh báo (10 giây).
route: Định nghĩa các route.
- group_wait: Thời gian chờ trước khi gửi cảnh báo đầu tiên đến nhóm (0 giây).
- group_interval: Khoảng thời gian giữa các nhóm cảnh báo (1 giây).
- repeat_interval: Khoảng thời gian lặp lại cảnh báo (1 giờ).
- receiver: Định nghĩa cách nhận cảnh báo.
receivers: Định nghĩa các loại ứng dụng để nhận cảnh báo.
- name: Tên của ứng dụng (telegram).
  - telegram_configs: Cấu hình cho Telegram.
    - bot_token: Token của bot Telegram.
    - chat_id: ID của chat Telegram.
    - message: Template tin nhắn cảnh báo.
    - send_resolved: Gửi cảnh báo đã resolved (true).
templates: Định nghĩa các templates.
- /etc/alertmanager/template.tmpl: Đường dẫn đến file template.

Đoạn code dưới đây bạn nên chèn vào cấu hình Alertmanager để tránh việc nó gửi lại các cảnh báo chưa được resolved khi một cảnh báo được resolved, bạn có thể sử dụng tính năng inhibit rules của Alertmanager. Inhibit rules cho phép bạn ngăn chặn việc gửi cảnh báo nếu một cảnh báo khác đang hoạt động.

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'critical'
    equal: ['alertname', 'alias']

Cụ thể hơn:

inhibit_rules: Đây là phần đầu của cấu hình, cho biết chúng ta đang định nghĩa một tập hợp các quy tắc để ngăn chặn cảnh báo.
source_match: Phần này xác định các điều kiện mà một cảnh báo hiện tại (gọi là “cảnh báo nguồn”) phải đáp ứng để có thể bị ngăn chặn.
- severity: ‘critical’: Điều kiện này yêu cầu cảnh báo nguồn phải có mức độ nghiêm trọng là “critical” (nghiêm trọng).
target_match: Phần này xác định các điều kiện mà một cảnh báo khác (gọi là “cảnh báo mục tiêu”) phải đáp ứng để có thể ngăn chặn cảnh báo nguồn.
- severity: ‘critical’: Tương tự, cảnh báo mục tiêu cũng phải có mức độ nghiêm trọng là “critical”.
equal: [‘alertname’, ‘alias’]: Điều kiện này yêu cầu cả cảnh báo nguồn và cảnh báo mục tiêu phải có cùng tên (alertname) và cùng một biệt danh (alias).

Tạo file mẫu cho Telegram.

cat > /home/alertmanager/template.tmpl << 'OEF'
{{ define "telegram.message" -}}
{{ range .Alerts -}}
{{ if eq .Status "firing" -}}
---- Alerts Firing: {{ .Labels.alertname }} ----
{{ range $key, $value := .Labels -}}
{{ if ne $key "severity" -}}
{{ $key }}: {{ $value }}
{{ end -}}
{{ end -}}
Severity: {{ if eq .Labels.severity "critical" }}critical 🔥{{ else }}{{ .Labels.severity }}{{ end }}
{{ range $key, $value := .Annotations -}}
{{ $key }}: {{ $value }}
{{ end -}}
{{ else if eq .Status "resolved" -}}
---- Alerts Resolved: {{ .Labels.alertname }} ----
{{ range $key, $value := .Labels -}}
{{ if ne $key "severity" -}}
{{ $key }}: {{ $value }}
{{ end -}}
{{ end -}}
Severity: resolved ✅
{{ range $key, $value := .Annotations -}}
{{ $key }}: {{ $value }}
{{ end -}}
{{ end }}
{{ end -}}
{{ end -}}
OEF

Định nghĩa template:

{{ define "telegram.message" -}}: Dòng này bắt đầu định nghĩa một template có tên là “telegram.message”. Tất cả các nội dung bên trong sẽ là cấu trúc của tin nhắn Telegram.
{{ end -}}: Dòng này kết thúc định nghĩa template.

Lặp qua các cảnh báo:

{{ range .Alerts -}}: Dòng này bắt đầu một vòng lặp, lặp qua từng cảnh báo trong danh sách các cảnh báo.

Xử lý cảnh báo đang kích hoạt:

{{ if eq .Status "firing" -}}: Kiểm tra xem cảnh báo hiện tại đang ở trạng thái “firing” (đang kích hoạt) hay không.
Nếu cảnh báo đang kích hoạt, các thông tin chi tiết về cảnh báo sẽ được hiển thị, bao gồm tên cảnh báo, các nhãn (labels), mức độ nghiêm trọng và các chú thích (annotations).
Các nhãn (labels): Đây là những thông tin bổ sung về cảnh báo, ví dụ như tên máy chủ, loại dịch vụ, v.v.
Mức độ nghiêm trọng: Cho biết mức độ nghiêm trọng của cảnh báo (ví dụ: critical, warning).
Chú thích (annotations): Đây là những thông tin mô tả chi tiết về nguyên nhân và cách khắc phục cảnh báo.

Xử lý cảnh báo đã được giải quyết:

{{ else if eq .Status "resolved" -}}: Kiểm tra xem cảnh báo đã được giải quyết (resolved) hay chưa.
Nếu cảnh báo đã được giải quyết, sẽ hiển thị một thông báo khác, cho biết cảnh báo đã được giải quyết.

Tạo file rule cảnh báo cho Prometheus.

cat > /home/prometheus/rules.yml << 'OEF'
groups:
  - name: InstanceDown
    rules:
      - alert: InstanceDown
        expr: demo_metric == 0
        for: 10s
        labels:
          severity: critical
          contact: "hoanghd, thienln"
        annotations:
          summary: "Instance {{ $labels.instance }} with alias {{ $labels.alias }} has been down."
OEF

Đoạn shell này tạo file rule cảnh báo rules.yml cho Prometheus.

groups: Định nghĩa các groups.
- name: InstanceDown: Tên của group.
  - rules: Định nghĩa các rule trong group.
    - alert: InstanceDown: Tên của alert.
    - expr: demo_metric == 0: Biểu thức PromQL để kích hoạt alert.
    - for: 10s: Thời gian chờ trước khi kích hoạt alert (10 giây).
    - labels: Định nghĩa các labels cho alert.
      - severity: critical: Mức độ nghiêm trọng của alert.
      - contact: “hoanghd, thienln”: Define người chịu trách nhiệm giải quyết sự cố.
    - annotations: Đây là những thông tin mô tả chi tiết về nguyên nhân và cách khắc phục alert
      - summary: Tóm tắt mô tả ngắn gọn về alert.

Khởi động các dịch vụ bằng Docker Compose.

docker-compose -f /home/docker-compose.yml up -d

Lệnh này khởi động các dịch vụ được định nghĩa trong file docker-compose.yml ở chế độ nền (-d).

Bạn có thể định nghĩa trong 1 file .sh như sau ví dụ tôi đặt tên file là build.sh.

#!/bin/bash
docker-compose -f /home/docker-compose.yml down
rm -rf /home/prometheus/data /home/alertmanager/data
mkdir -p /home/prometheus/data /home/alertmanager/data

cat > /home/docker-compose.yml << 'OEF'
version: '3'
services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - /home/prometheus/data:/etc/prometheus
      - /home/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - /home/prometheus/rules.yml:/etc/prometheus/rules.yml
    command:
      - --config.file=/etc/prometheus/prometheus.yml
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - /home/alertmanager/data:/etc/alertmanager
      - /home/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - /home/alertmanager/template.tmpl:/etc/alertmanager/template.tmpl
    command:
      - --config.file=/etc/alertmanager/alertmanager.yml
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge
OEF

cat > /home/prometheus/prometheus.yml << 'OEF'
global:
  scrape_interval:     15s
  evaluation_interval: 15s
  external_labels:
      monitor: 'monitoring-system'

rule_files:
  - /etc/prometheus/rules.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']

  - job_name: 'demo_metric'
    static_configs:
      - targets: ['192.168.100.253:9999']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
OEF

cat > /home/alertmanager/alertmanager.yml << 'OEF'
global:
  resolve_timeout: 10s

route:
  group_by: ['alertname', 'alias']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'telegram'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'critical'
    equal: ['alertname', 'alias']

receivers:
  - name: 'telegram'
    telegram_configs:
    - bot_token: '5536432897:AAEsq6I7lfiZdO6mRYDX7lWgU5-bamih-MI'
      chat_id: -863816906
      message: '{{ template "telegram.message" . }}'
      send_resolved: true

templates:
  - '/etc/alertmanager/template.tmpl'
OEF

cat > /home/alertmanager/template.tmpl << 'OEF'
{{ define "telegram.message" -}}
{{ range .Alerts -}}
{{ if eq .Status "firing" -}}
---- Alerts Firing: {{ .Labels.alertname }} ----
{{ range $key, $value := .Labels -}}
{{ if ne $key "severity" -}}
{{ $key }}: {{ $value }}
{{ end -}}
{{ end -}}
Severity: {{ if eq .Labels.severity "critical" }}critical 🔥{{ else }}{{ .Labels.severity }}{{ end }}
{{ range $key, $value := .Annotations -}}
{{ $key }}: {{ $value }}
{{ end -}}
{{ else if eq .Status "resolved" -}}
---- Alerts Resolved: {{ .Labels.alertname }} ----
{{ range $key, $value := .Labels -}}
{{ if ne $key "severity" -}}
{{ $key }}: {{ $value }}
{{ end -}}
{{ end -}}
Severity: resolved ✅
{{ range $key, $value := .Annotations -}}
{{ $key }}: {{ $value }}
{{ end -}}
{{ end }}
{{ end -}}
{{ end -}}
OEF

cat > /home/prometheus/rules.yml << 'OEF'
groups:
  - name: InstanceDown
    rules:
      - alert: InstanceDown
        expr: demo_metric == 0
        for: 30s
        labels:
          severity: critical
          contact: "hoanghd, thienln"
        annotations:
          summary: "Instance {{ $labels.instance }} with alias {{ $labels.alias }} has been down."
OEF

docker-compose -f /home/docker-compose.yml up -d

Và sau đó bạn có thể chạy script này để triển khai nhanh.

shell> bash /home/build.sh
[+] Running 3/3
 ⠿ Container prometheus     Removed                                                                        0.4s
 ⠿ Container alertmanager   Removed                                                                        0.3s
 ⠿ Network home_monitoring  Removed                                                                        0.3s
[+] Running 3/3
 ⠿ Network home_monitoring  Created                                                                        0.1s
 ⠿ Container prometheus     Started                                                                        0.3s
 ⠿ Container alertmanager   Started                                                                        0.4s

Và đây là kết quả nhé.

shell> docker ps
CONTAINER ID   IMAGE               COMMAND                  CREATED             STATUS             PORTS                                       NAMES
31c526f8536d   prom/prometheus     "/bin/prometheus --c…"   About an hour ago   Up About an hour   0.0.0.0:9090->9090/tcp, :::9090->9090/tcp   prometheus
cd20bef343cd   prom/alertmanager   "/bin/alertmanager -…"   About an hour ago   Up About an hour   0.0.0.0:9093->9093/tcp, :::9093->9093/tcp   alertmanager

3. Kiểm tra.

Tôi có đoạn Python để tạo metrics test và tôi sẽ chạy nó như dưới.

shell> python3 cli.py -d 1 0 1 0
-> Set values = ['1', '0', '1', '0'], finished 1 times in 0.05031442642211914 seconds
-> Set values = ['1', '0', '1', '0'], finished 2 times in 5.055529832839966 seconds

Kiểm tra Endpoint test.

Và đây là metrics test.

Cảnh báo khi giá trị trả về 0.

Và tôi nhận được tin nhắn cảnh báo.

Dưới đây là bản copy cho bạn dễ đọc hơn.

---- Alerts Firing: InstanceDown ----
alertname: InstanceDown
alias: ceph-exporter-3
contact: hoanghd, thienln
instance: 192.168.100.253:9999
job: demo_metric
monitor: monitoring-system
Severity: critical 🔥
summary: Instance 192.168.100.253:9999 with alias ceph-exporter-3 has been down.

---- Alerts Firing: InstanceDown ----
alertname: InstanceDown
alias: ceph-exporter-1
contact: hoanghd, thienln
instance: 192.168.100.253:9999
job: demo_metric
monitor: monitoring-system
Severity: critical 🔥
summary: Instance 192.168.100.253:9999 with alias ceph-exporter-1 has been down.

Giờ để test trường hợp Resolve tôi sửa metric cuối giá trị từ 0 sang 1.

shell> python3 cli.py -d 1 0 1 1
-> Set values = ['1', '0', '1', '1'], finished 1 times in 0.0021190643310546875 seconds

Và bây giờ chỉ còn 1 metric có giá trị là 0.

Lúc này bạn còn được 1 alert.

Và tin nhắn resolved metric thứ 4 sẽ gửi cho bạn.

Đây là bản copy cho bạn dễ đọc nhé.

---- Alerts Resolved: InstanceDown ----
alertname: InstanceDown
alias: ceph-exporter-3
contact: hoanghd, thienln
instance: 192.168.100.253:9999
job: demo_metric
monitor: monitoring-system
Severity: resolved ✅
summary: Instance 192.168.100.253:9999 with alias ceph-exporter-3 has been down.

4. Kết luận.

Trong bài viết này, chúng ta đã tìm hiểu cách thiết lập Prometheus và Alertmanager bằng Docker Compose. Chúng ta đã tạo và cấu hình các file cần thiết, bao gồm docker-compose.yml, prometheus.yml, alertmanager.yml, template.tmpl và rules.yml. Bằng cách phân tích từng dòng lệnh trong script build.sh, chúng ta đã hiểu rõ hơn về cách cấu hình và triển khai Prometheus và Alertmanager. Hy vọng rằng bài viết này sẽ giúp bạn dễ dàng triển khai hệ thống giám sát và cảnh báo của riêng mình.

Thiết lập cảnh báo Telegram cho Prometheus

Bài viết gần đây

Truy cập console VM Trong Proxmox VE

Bảng tham khảo về latency, IOPS và throughput

Hiểu về SLOW_OPS trong Ceph và cách xử lý

Acting Set và Up Set trong Ceph

Tìm hiểu về MTU (Maximum Transmission Unit)

Related Stories

Leave A Reply Cancel reply

Đăng ký nhận thông tin bài viết qua email