システム要件
SW | Version |
---|---|
Ubuntu Server | 22.04 |
kubectl | 1.27.3 |
kubeadm | 1.27.3 |
containerd(CRI) | 1.6.8 |
Calico(CNI) | 3.25.1 |
gpu-operator | 23.3.2 |
HW | Spec | Count |
---|---|---|
CPU | 16 core | 2 |
RAM | 32 GB | 1 |
Storage | 128 GB | 1 |
GPU | GeForce RTX 2070 SUPER | 2 |
事前準備
- ワーカーノードを起動しているサーバーで GPU が認識されていることを確認
lspci | grep -i nvidia
sudo lshw -C display
Helm & gpu-operator をインストール
- Helm(k8s Package Manager)をインストール
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \
&& ./get_helm.sh
Downloading https://get.helm.sh/helm-v3.12.2-linux-amd64.tar.gz
Verifying checksum... Done.
Preparing to install helm into /usr/local/bin
helm installed into /usr/local/bin/helm
- NVIDIA の Helm リポジトリを追加
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
- gpu-operator をインストール
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
values.yml を指定する場合
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
-f values.yaml
NAME: gpu-operator-1689295438
LAST DEPLOYED: <時刻>
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
- Pod がスケジューリングされていることを確認
kubectl get pod,svc -A -o wide
kubectl get pod -n gpu-operator
- Pod / Container を作成
Ubuntu
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: ubuntu-server
spec:
containers:
- name: ubuntu-server
image: ubuntu:latest
command: ["/bin/bash", "-c", "while true; do sleep 1000; done"]
resources:
limits:
nvidia.com/gpu: 2
EOF
CUDA Toolkit
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-app
spec:
containers:
- name: cuda-app
image: nvidia/cuda:12.0.0-devel-ubuntu22.04
command: ["/bin/bash", "-c", "while true; do sleep 1000; done"]
resources:
limits:
nvidia.com/gpu: 2
EOF
CUDA Toolkit - Local Image
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-app
spec:
containers:
- name: cuda-app
image: <image_name>:<tag_name>
imagePullPolicy: Never
command: ["/bin/bash", "-c", "while true; do sleep 1000; done"]
resources:
limits:
nvidia.com/gpu: 2
EOF
CUDA Toolkit - Local Image & Mount
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
containers:
- name: app
image: <image_name>:<tag_name>
imagePullPolicy: Never
command: ["/bin/bash", "-c", "while true; do sleep 1000; done"]
volumeMounts:
- mountPath: <path>
name: <volume_name>
resources:
limits:
nvidia.com/gpu: 2
volumes:
- name: <volume_name>
hostPath:
path: <path>
type: Directory
EOF
- Pod / Container に接続
kubectl exec -it <pod_name> -c <container_name> -- /bin/bash
- Deployment を作成
cat << EOF | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-deployment
spec:
replicas: 1
selector:
matchLabels:
app: app
template:
metadata:
labels:
app: app
spec:
containers:
- name: app
image: <image_name>:<tag_name>
imagePullPolicy: Never
command: ["/bin/bash", "-c", "while true; do sleep 1000; done"]
ports:
- containerPort: 8080
resources:
limits:
nvidia.com/gpu: 2
EOF
- Service を作成
kubectl expose deployment <deployment_name> --type=NodePort --port=8080
- 外部向けのポートを確認
kubectl describe svc <service_name>
上記の手順で構築が完了しない場合
gpu-operator をドライバーを管理するオプションで実行すると以下のエラーが発生している可能性がある
apt パッケージの更新処理でノードに割り当てているボリューム or システム全体の容量が不足することが原因でインストールに失敗する可能性がある
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 525.105.17 for Linux kernel version 5.15.0-76-generic
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
/usr/local/bin/nvidia-driver: line 105: cd: /usr/src/nvidia-525.105.17/kernel: No such file or directory
Checking NVIDIA driver packages...
Updating the package cache...
E: List directory /var/lib/apt/lists/partial is missing. - Acquire (28: No space left on device)
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
- Nouveau Driver が有効になっている場合は無効にする
lsmod | grep nouveau
sudo sh -c "echo 'blacklist nouveau' > /etc/modprobe.d/blacklist-nouveau.conf"
sudo sh -c "echo 'options nouveau modeset=0' >> /etc/modprobe.d/blacklist-nouveau.conf"
sudo update-initramfs -u
sudo reboot
- ホストマシン上に NVIDIA Driver をインストール
バージョンについては適宜変更してください
sudo apt install -y nvidia-driver-530
- 以下のコマンドで GPU 情報を確認
nvidia-smi
modinfo nvidia
- CUDA Toolkit をインストール(Runfile Local)
バージョンについては適宜変更してください
wget https://developer.download.nvidia.com/compute/cuda/12.0.0/local_installers/cuda_12.0.0_525.60.13_linux.run
chmod +x cuda_12.0.0_525.60.13_linux.run
mkdir -p $HOME/files/tmp
sudo ./cuda_12.0.0_525.60.13_linux.run --tmpdir=$HOME/files/tmp --toolkit --silent --override
- PATH を設定
vi $HOME/.bashrc
PATH="/usr/local/cuda-12.0/bin:$PATH"
LD_LIBRARY_PATH="/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH"
source $HOME/.bashrc
- CUDA Toolkit をインストール(apt パッケージ)
sudo apt install -y nvidia-cuda-toolkit
- 再起動しても読み込まれない場合は以下のコマンドを実行
sudo modprobe nvidia
- Nvidia Container Toolkit をインストール
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-key が非推奨になっているので以下の方が望ましいかも
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
- containerd のデフォルトのランタイム環境を nvidia に更新
sudo vi /etc/containerd/config.toml
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
sudo systemctl restart containerd
- gpu-operator をインストール
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false