ZeroLoom

Ubuntu OS に Prometheus と Grafana をインストール

February 7, 2025
6 min read
Table of Contents

事前準備


Prometheus / NodeExporter / AlertManager のバージョンを確認する際に参照

NodeExporter / AlertManager / Loki のドキュメント

Grafana コミュニティの共有 Dashboard

Prometheus をインストール


apt list -a prometheus* で出力される apt 管理のパッケージもあるが、バージョンが古い可能性がある

bash
cd /tmp && curl -OL https://github.com/prometheus/prometheus/releases/download/v3.2.0-rc.1/prometheus-3.2.0-rc.1.linux-amd64.tar.gz && tar xvf prometheus-*.tar.gz
bash
sudo mv prometheus-*/prometheus /usr/local/bin/ && sudo mv prometheus-*/promtool /usr/local/bin/
bash
sudo mkdir /etc/prometheus && sudo mkdir /var/lib/prometheus
bash
sudo mv prometheus-*/{console*,prometheus.yml} /etc/prometheus/
/etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
 
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
bash
sudo vi /etc/systemd/system/prometheus.service
/etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
 
[Service]
User=root
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.listen-address=:9090
 
Restart=always
 
[Install]
WantedBy=multi-user.target
bash
sudo systemctl daemon-reload
bash
sudo systemctl enable prometheus && sudo systemctl start prometheus

NodeExporter をインストール


bash
cd /tmp && curl -OL https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz && tar xvf node_exporter-*.tar.gz
bash
sudo mv node_exporter-*/node_exporter /usr/local/bin/
bash
sudo vi /etc/systemd/system/node_exporter.service
/etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target
 
[Service]
User=root
ExecStart=/usr/local/bin/node_exporter
Restart=always
 
[Install]
WantedBy=default.target

Grafana をインストール


bash
curl -fsSL https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg >/dev/null
bash
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
bash
sudo apt install -y grafana
bash
sudo systemctl enable --now grafana-server
bash
sudo systemctl status grafana-server
Connections -> Data Source -> Add new data source ->  
prometheus -> URL: http://localhost:9090 -> Save & Test

Grafana の Dashboard にて ID 1860 をインポート
https://grafana.com/grafana/dashboards/1860-node-exporter-full/

NVIDIA DCGM と NVIDIA DCGM Exporter をインストール


CUDA ネットワークリポジトリおよび GPG キーをインストールしていない場合

bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb && \
  dpkg -i cuda-keyring_1.1-1_all.deb && \
  rm cuda-keyring_1.1-1_all.deb

CUDA のバージョンを確認

bash
nvidia-smi | sed -E -n 's/.*CUDA Version: ([0-9]+)[.].*/\1/p'
bash
sudo apt install -y datacenter-gpu-manager-4-cuda12
bash
sudo systemctl enable nvidia-dcgm && sudo systemctl start nvidia-dcgm

daemon-reload が必要な場合

bash
sudo systemctl daemon-reload
bash
nvidia-smi && dcgmi discovery -l
bash
sudo apt install -y datacenter-gpu-manager-exporter
bash
sudo systemctl enable nvidia-dcgm-exporter && sudo systemctl start nvidia-dcgm-exporter
/etc/prometheus/prometheus.yml
scrape_configs:
  ...
  - job_name: 'nvidia-gpu'
    static_configs:
      - targets: ['localhost:9400']
bash
sudo systemctl restart prometheus

Grafana の Dashboard にて ID 12239 をインポート
https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/

Prometheus AlertManager をインストール


apt list -a prometheus* で出力される apt 管理のパッケージもあるが、バージョンが古い可能性がある

bash
cd /tmp && curl -OL https://github.com/prometheus/alertmanager/releases/download/v0.28.0/alertmanager-0.28.0.linux-amd64.tar.gz && tar -xvf alertmanager-*.tar.gz
bash
sudo mv alertmanager-*/alertmanager /usr/local/bin/ && sudo mv alertmanager-*/amtool /usr/local/bin/
bash
alertmanager --version
bash
sudo mkdir -p /etc/alertmanager && sudo mkdir -p /var/lib/alertmanager
bash
sudo useradd --system --no-create-home --shell /sbin/nologin prometheus
bash
sudo chown -R prometheus:prometheus /etc/alertmanager /var/lib/alertmanager
bash
sudo vi /etc/alertmanager/alertmanager.yml

アラートのルールや通知先を Grafana Alerting に委託する場合

/etc/alertmanager/alertmanager.yml
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'grafana'
 
receivers:
  - name: 'grafana'
 
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
bash
sudo vi /etc/systemd/system/alertmanager.service
/etc/systemd/system/alertmanager.service
[Unit]
Description=AlertManager for Prometheus
After=network.target
 
[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/var/lib/alertmanager
 
[Install]
WantedBy=multi-user.target
bash
sudo systemctl daemon-reload
bash
sudo systemctl enable alertmanager && sudo systemctl start alertmanager
bash
sudo systemctl status alertmanager
ブラウザで確認
http://localhost:9093
bash
sudo vi /etc/prometheus/prometheus.yml
/etc/prometheus/prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
      - targets:
        - "localhost:9093"
/etc/prometheus/prometheus.yml
rule_files:
  - "/etc/prometheus/alert.rules.yml"
bash
sudo vi /etc/prometheus/alert.rules.yml
/etc/prometheus/alert.rules.yml
groups:
  - name: CPU High Usage
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 1m
        labels:
          severity: "critical"
        annotations:
          summary: "High CPU usage detected on instance {{ $labels.instance }}"
          description: "CPU usage is above 90% for more than 1 minute."
bash
sudo systemctl restart prometheus
ブラウザで確認
http://localhost:9090/alerts
Connections -> Data Source -> Add new data source ->  
alertmanager -> URL: http://localhost:9093 ->  
Receive Grafana Alerts: ON -> Save & Test
Alerting -> Settings ->  
Other Alertmanagers -> alertmanager: enable

エンドポイント


ServiceURL
Prometheushttp://localhost:9090
NodeExporterhttp://localhost:9100
NVIDIA DCGM Exporterhttp://localhost:9400
AlertManagerhttp://localhost:9093
Grafanahttp://localhost:3000

Loki & Promtail をインストール


bash
sudo apt install -y loki promtail

apt リポジトリからインストール可能なバージョンによっては config 周りでエラーが起きる可能性有り

/var/log/syslog
line 44: field enabled not found in type aggregation.Config. Use `-config.expand-env=true` flag if you want to expand environment variables in your config file

2025/02 時点では metric_aggregation エントリーの enabled が config として認識できない項目だったのでコメントアウト

GitHub - metric_aggregation_enabled and metric_aggregation settings do not work in Loki 3.1.2

bash
sudo vi /etc/loki/config.yml
/etc/loki/config.yml
...
pattern_ingester:
  enabled: true
  metric_aggregation:
    # enabled: true
    loki_address: localhost:3100
...
bash
sudo systemctl restart loki.service

Promtail で収集したいログのパスを指定

bash
sudo vi /etc/promtail/config.yml
/etc/loki/config.yml
...
scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      #NOTE: Need to be modified to scrape any additional logs of the system.
      __path__: /var/log/*.log
...
bash
sudo systemctl restart promtail.service
ブラウザで確認
http://localhost:9080
Connections -> Data Source -> Add new data source ->  
loki -> URL: http://localhost:3100 -> Save & Test

その他


Prometheus で収集したデータの保存先

bash
ps aux | grep prometheus

—storage.tsdb.path=/var/lib/prometheus のオプションが存在するはず

Prometheus API

bash
curl http://localhost:9090/api/v1/targets | jq

Grafana API

bash
curl -u username:password http://localhost:3000/api/datasources | jq
bash
curl http://localhost:3000/api/health | jq