事前準備
Prometheus / NodeExporter / AlertManager のバージョンを確認する際に参照
- https://prometheus.io/download/
- https://prometheus.io/download/#node_exporter
- https://prometheus.io/download/#alertmanager
NodeExporter / AlertManager / Loki のドキュメント
- https://prometheus.io/docs/guides/node-exporter/
- https://prometheus.io/docs/alerting/latest/alertmanager/
- https://grafana.com/docs/loki/latest/setup/install/local/
Grafana コミュニティの共有 Dashboard
Prometheus をインストール
apt list -a prometheus* で出力される apt 管理のパッケージもあるが、バージョンが古い可能性がある
cd /tmp && curl -OL https://github.com/prometheus/prometheus/releases/download/v3.2.0-rc.1/prometheus-3.2.0-rc.1.linux-amd64.tar.gz && tar xvf prometheus-*.tar.gz
sudo mv prometheus-*/prometheus /usr/local/bin/ && sudo mv prometheus-*/promtool /usr/local/bin/
sudo mkdir /etc/prometheus && sudo mkdir /var/lib/prometheus
sudo mv prometheus-*/{console*,prometheus.yml} /etc/prometheus/
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
sudo vi /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=root
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--web.listen-address=:9090
Restart=always
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable prometheus && sudo systemctl start prometheus
NodeExporter をインストール
cd /tmp && curl -OL https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz && tar xvf node_exporter-*.tar.gz
sudo mv node_exporter-*/node_exporter /usr/local/bin/
sudo vi /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=root
ExecStart=/usr/local/bin/node_exporter
Restart=always
[Install]
WantedBy=default.target
Grafana をインストール
curl -fsSL https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg >/dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt install -y grafana
sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server
Connections -> Data Source -> Add new data source ->
prometheus -> URL: http://localhost:9090 -> Save & Test
Grafana の Dashboard にて ID
1860
をインポート
https://grafana.com/grafana/dashboards/1860-node-exporter-full/
NVIDIA DCGM と NVIDIA DCGM Exporter をインストール
CUDA ネットワークリポジトリおよび GPG キーをインストールしていない場合
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb && \
dpkg -i cuda-keyring_1.1-1_all.deb && \
rm cuda-keyring_1.1-1_all.deb
CUDA のバージョンを確認
nvidia-smi | sed -E -n 's/.*CUDA Version: ([0-9]+)[.].*/\1/p'
sudo apt install -y datacenter-gpu-manager-4-cuda12
sudo systemctl enable nvidia-dcgm && sudo systemctl start nvidia-dcgm
daemon-reload が必要な場合
sudo systemctl daemon-reload
nvidia-smi && dcgmi discovery -l
sudo apt install -y datacenter-gpu-manager-exporter
sudo systemctl enable nvidia-dcgm-exporter && sudo systemctl start nvidia-dcgm-exporter
scrape_configs:
...
- job_name: 'nvidia-gpu'
static_configs:
- targets: ['localhost:9400']
sudo systemctl restart prometheus
Grafana の Dashboard にて ID
12239
をインポート
https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/
Prometheus AlertManager をインストール
apt list -a prometheus* で出力される apt 管理のパッケージもあるが、バージョンが古い可能性がある
cd /tmp && curl -OL https://github.com/prometheus/alertmanager/releases/download/v0.28.0/alertmanager-0.28.0.linux-amd64.tar.gz && tar -xvf alertmanager-*.tar.gz
sudo mv alertmanager-*/alertmanager /usr/local/bin/ && sudo mv alertmanager-*/amtool /usr/local/bin/
alertmanager --version
sudo mkdir -p /etc/alertmanager && sudo mkdir -p /var/lib/alertmanager
sudo useradd --system --no-create-home --shell /sbin/nologin prometheus
sudo chown -R prometheus:prometheus /etc/alertmanager /var/lib/alertmanager
sudo vi /etc/alertmanager/alertmanager.yml
アラートのルールや通知先を Grafana Alerting に委託する場合
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'grafana'
receivers:
- name: 'grafana'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
sudo vi /etc/systemd/system/alertmanager.service
[Unit]
Description=AlertManager for Prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml --storage.path=/var/lib/alertmanager
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable alertmanager && sudo systemctl start alertmanager
sudo systemctl status alertmanager
http://localhost:9093
sudo vi /etc/prometheus/prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- "localhost:9093"
rule_files:
- "/etc/prometheus/alert.rules.yml"
sudo vi /etc/prometheus/alert.rules.yml
groups:
- name: CPU High Usage
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 1m
labels:
severity: "critical"
annotations:
summary: "High CPU usage detected on instance {{ $labels.instance }}"
description: "CPU usage is above 90% for more than 1 minute."
sudo systemctl restart prometheus
http://localhost:9090/alerts
Connections -> Data Source -> Add new data source ->
alertmanager -> URL: http://localhost:9093 ->
Receive Grafana Alerts: ON -> Save & Test
Alerting -> Settings ->
Other Alertmanagers -> alertmanager: enable
エンドポイント
Service | URL |
---|---|
Prometheus | http://localhost:9090 |
NodeExporter | http://localhost:9100 |
NVIDIA DCGM Exporter | http://localhost:9400 |
AlertManager | http://localhost:9093 |
Grafana | http://localhost:3000 |
Loki & Promtail をインストール
sudo apt install -y loki promtail
apt リポジトリからインストール可能なバージョンによっては config 周りでエラーが起きる可能性有り
line 44: field enabled not found in type aggregation.Config. Use `-config.expand-env=true` flag if you want to expand environment variables in your config file
2025/02 時点では metric_aggregation エントリーの enabled が config として認識できない項目だったのでコメントアウト
GitHub - metric_aggregation_enabled and metric_aggregation settings do not work in Loki 3.1.2
sudo vi /etc/loki/config.yml
...
pattern_ingester:
enabled: true
metric_aggregation:
# enabled: true
loki_address: localhost:3100
...
sudo systemctl restart loki.service
Promtail で収集したいログのパスを指定
sudo vi /etc/promtail/config.yml
...
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
#NOTE: Need to be modified to scrape any additional logs of the system.
__path__: /var/log/*.log
...
sudo systemctl restart promtail.service
http://localhost:9080
Connections -> Data Source -> Add new data source ->
loki -> URL: http://localhost:3100 -> Save & Test
その他
Prometheus で収集したデータの保存先
ps aux | grep prometheus
—storage.tsdb.path=/var/lib/prometheus のオプションが存在するはず
Prometheus API
curl http://localhost:9090/api/v1/targets | jq
Grafana API
curl -u username:password http://localhost:3000/api/datasources | jq
curl http://localhost:3000/api/health | jq