본문으로 건너뛰기

MPIJob

MPIJob

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
spec:
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- name: mpi-launcher
image: ""
command: [mpirun]
args:
- -n
- "2"
- --bind-to
- none
- --map-by
- slot
- --mca
- pml
- ob1
- --mca
- btl
- ^openib
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- python3
- train.py
Worker:
replicas: 2
template:
spec:
containers:
- name: mpi-worker
image: ""
  • sshAuthMountPath: <dir>
    • 생성된 SSH Secret의 공개키와 개인키가 각각 authorized_keys, id_rsa로 마운트됩니다.
    • 기본값은 /root/.ssh입니다.
  • launcherCreationPolicy: AtStartup|WaitForWorkersReady
  • runLauncherAsWorker: false
  • slotsPerWorker: 1
    • hostfile에 설정될 slot 수입니다.
  • runPolicy
    • backoffLimit: <number>
      • Failed로 간주하기 전에 Launcher를 재시도하는 횟수입니다.
      • 기본값은 6입니다.
    • ttlSecondsAfterFinished: <seconds>
      • Launcher가 완료된 후 삭제되기까지 대기하는 시간입니다.
      • 기본값은 무한입니다.
    • cleanPodPolicy: Running|All|None
      • Launcher가 완료된 후 Pod을 정리하는 정책입니다.
      • Running: Running 상태인 Pod을 정리합니다. 기본값입니다.
    • suspend: true|false
      • Launcher, Worker를 중지할지 여부입니다.
  • mpiReplicaSpecs
    • Launcher
      • Job이 생성됩니다.
      • replicas: <replicas>
        • 기본값은 1 입니다.
      • restartPolicy: Always|OnFailure|Never
        • 기본값은 Nerver입니다.
      • template: <core/v1.PodTemplateSpec>
    • Worker
      • Pod이 생성됩니다.
      • template
        • spec
          • containers
            • command: []
              • 기본 값으로 [/usr/sbin/sshd, -De]가 설정됩니다.
경고

Launcher.template 설정을 변경하면 생성된 launcher Job을 삭제해야 변경된 설정이 적용됩니다.

Container Base

build/base/sshd_config
PidFile /home/mpiuser/sshd.pid
HostKey /home/mpiuser/.ssh/id_rsa
StrictModes no
build/base/Dockerfile
FROM debian:trixie

ARG port=2222

RUN apt update && apt install -y --no-install-recommends \
openssh-server \
openssh-client \
libcap2-bin \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Add priviledge separation directoy to run sshd as root.
RUN mkdir -p /var/run/sshd
# Add capability to run sshd as non-root.
RUN setcap CAP_NET_BIND_SERVICE=+eip /usr/sbin/sshd
RUN apt remove libcap2-bin -y

# Allow OpenSSH to talk to containers without asking for confirmation
# by disabling StrictHostKeyChecking.
# mpi-operator mounts the .ssh folder from a Secret. For that to work, we need
# to disable UserKnownHostsFile to avoid write permissions.
# Disabling StrictModes avoids directory and files read permission checks.
RUN sed -i "s/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g" /etc/ssh/ssh_config \
&& echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config \
&& sed -i "s/[ #]\(.*Port \).*/ \1$port/g" /etc/ssh/ssh_config \
&& sed -i "s/#\(StrictModes \).*/\1no/g" /etc/ssh/sshd_config \
&& sed -i "s/#\(Port \).*/\1$port/g" /etc/ssh/sshd_config

RUN useradd -m mpiuser
WORKDIR /home/mpiuser
# Configurations for running sshd as non-root.
COPY --chown=mpiuser sshd_config .sshd_config
RUN echo "Port $port" >> /home/mpiuser/.sshd_config

이를 바탕으로 이미지를 만드는 경우 MPIJob에는 아래와 같은 설정들이 반영되어야 합니다.

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
spec:
sshAuthMountPath: /home/mpiuser/.ssh
mpiReplicaSpecs:
Launcher:
template:
spec:
containers:
- name: mpi-launcher
securityContext:
runAsUser: 1000
Worker:
template:
spec:
containers:
- name: mpi-worker
securityContext:
runAsUser: 1000
command:
- /usr/sbin/sshd
args:
- -De
- -f
- /home/mpiuser/.sshd_config