ZeroLoom

GPU Operator の構築手順

September 30, 2024
5 min read
Table of Contents

システム要件


SWVersion
Ubuntu Server22.04
kubectl1.27.3
kubeadm1.27.3
containerd(CRI)1.6.8
Calico(CNI)3.25.1
gpu-operator23.3.2
HWSpecCount
CPU16 core2
RAM32 GB1
Storage128 GB1
GPUGeForce RTX 2070 SUPER2

事前準備


  1. ワーカーノードを起動しているサーバーで GPU が認識されていることを確認
bash
lspci | grep -i nvidia
sudo lshw -C display

Helm & gpu-operator をインストール


  1. Helm(k8s Package Manager)をインストール
bash
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
  && chmod 700 get_helm.sh \
  && ./get_helm.sh
結果
Downloading https://get.helm.sh/helm-v3.12.2-linux-amd64.tar.gz
Verifying checksum... Done.
Preparing to install helm into /usr/local/bin
helm installed into /usr/local/bin/helm
  1. NVIDIA の Helm リポジトリを追加
bash
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
  && helm repo update
結果
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
  1. gpu-operator をインストール
bash
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator

values.yml を指定する場合

bash
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  -f values.yaml
結果
NAME: gpu-operator-1689295438
LAST DEPLOYED: <時刻>
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
  1. Pod がスケジューリングされていることを確認
bash
kubectl get pod,svc -A -o wide
kubectl get pod -n gpu-operator
  1. Pod / Container を作成

Ubuntu

bash
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: ubuntu-server
spec:
  containers:
    - name: ubuntu-server
      image: ubuntu:latest
      command: ["/bin/bash", "-c", "while true; do sleep 1000; done"]
      resources:
        limits:
          nvidia.com/gpu: 2
EOF

CUDA Toolkit

bash
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-app
spec:
  containers:
    - name: cuda-app
      image: nvidia/cuda:12.0.0-devel-ubuntu22.04
      command: ["/bin/bash", "-c", "while true; do sleep 1000; done"]
      resources:
        limits:
          nvidia.com/gpu: 2
EOF

CUDA Toolkit - Local Image

bash
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-app
spec:
  containers:
    - name: cuda-app
      image: <image_name>:<tag_name>
      imagePullPolicy: Never
      command: ["/bin/bash", "-c", "while true; do sleep 1000; done"]
      resources:
        limits:
          nvidia.com/gpu: 2
EOF

CUDA Toolkit - Local Image & Mount

bash
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
    - name: app
      image: <image_name>:<tag_name>
      imagePullPolicy: Never
      command: ["/bin/bash", "-c", "while true; do sleep 1000; done"]
      volumeMounts:
        - mountPath: <path>
          name: <volume_name>
      resources:
        limits:
          nvidia.com/gpu: 2
  volumes:
    - name: <volume_name>
      hostPath:
      path: <path>
      type: Directory
EOF
  1. Pod / Container に接続
bash
kubectl exec -it <pod_name> -c <container_name> -- /bin/bash
  1. Deployment を作成
bash
cat << EOF | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-deployment
spec:
  replicas: 1
  selector:
  matchLabels:
    app: app
  template:
  metadata:
    labels:
    app: app
  spec:
    containers:
      - name: app
        image: <image_name>:<tag_name>
        imagePullPolicy: Never
        command: ["/bin/bash", "-c", "while true; do sleep 1000; done"]
        ports:
          - containerPort: 8080
        resources:
          limits:
            nvidia.com/gpu: 2
EOF
  1. Service を作成
bash
kubectl expose deployment <deployment_name> --type=NodePort --port=8080
  1. 外部向けのポートを確認
bash
kubectl describe svc <service_name>

上記の手順で構築が完了しない場合


gpu-operator をドライバーを管理するオプションで実行すると以下のエラーが発生している可能性がある

apt パッケージの更新処理でノードに割り当てているボリューム or システム全体の容量が不足することが原因でインストールに失敗する可能性がある

ログ
========== NVIDIA Software Installer ==========
 
Starting installation of NVIDIA driver version 525.105.17 for Linux kernel version 5.15.0-76-generic
 
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
/usr/local/bin/nvidia-driver: line 105: cd: /usr/src/nvidia-525.105.17/kernel: No such file or directory
Checking NVIDIA driver packages...
Updating the package cache...
E: List directory /var/lib/apt/lists/partial is missing. - Acquire (28: No space left on device)
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
  1. Nouveau Driver が有効になっている場合は無効にする
bash
lsmod | grep nouveau
sudo sh -c "echo 'blacklist nouveau' > /etc/modprobe.d/blacklist-nouveau.conf"
sudo sh -c "echo 'options nouveau modeset=0' >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo update-initramfs -u
sudo reboot
  1. ホストマシン上に NVIDIA Driver をインストール

バージョンについては適宜変更してください

bash
sudo apt install -y nvidia-driver-530
  1. 以下のコマンドで GPU 情報を確認
bash
nvidia-smi
modinfo nvidia
  1. CUDA Toolkit をインストール(Runfile Local)

バージョンについては適宜変更してください

bash
wget https://developer.download.nvidia.com/compute/cuda/12.0.0/local_installers/cuda_12.0.0_525.60.13_linux.run
chmod +x cuda_12.0.0_525.60.13_linux.run
mkdir -p $HOME/files/tmp
sudo ./cuda_12.0.0_525.60.13_linux.run --tmpdir=$HOME/files/tmp --toolkit --silent --override
  1. PATH を設定
bash
vi $HOME/.bashrc
.bashrc
PATH="/usr/local/cuda-12.0/bin:$PATH"
LD_LIBRARY_PATH="/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH"
bash
source $HOME/.bashrc
  1. CUDA Toolkit をインストール(apt パッケージ)
bash
sudo apt install -y nvidia-cuda-toolkit
  1. 再起動しても読み込まれない場合は以下のコマンドを実行
bash
sudo modprobe nvidia
  1. Nvidia Container Toolkit をインストール
bash
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
  && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
  && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

apt-key が非推奨になっているので以下の方が望ましいかも

bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
bash
sudo apt update && sudo apt install -y nvidia-container-toolkit
  1. containerd のデフォルトのランタイム環境を nvidia に更新
bash
sudo vi /etc/containerd/config.toml
/etc/containerd/config.toml
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
  [plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"
 
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
    privileged_without_host_devices = false
    runtime_engine = ""
    runtime_root = ""
    runtime_type = "io.containerd.runc.v2"
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
      BinaryName = "/usr/bin/nvidia-container-runtime"
bash
sudo systemctl restart containerd
  1. gpu-operator をインストール
bash
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --set driver.enabled=false \
  --set toolkit.enabled=false