使用 Ray 在 GPU 上运行测试

作为 Ludwig GitHub actions PR 检查的一部分，所有 Ludwig 测试必须在有 GPU 和无 GPU 可用的情况下都能通过。

要在 GPU 上调试特定测试，使用 Ray 运行 Ludwig GPU 测试可能很有用。

设置¶

1. 设置一个带有 GPU 的 AWS AMI¶

联系您的 AWS 账户管理员，或自行设置账户。

2. 测试是否已安装 AWS CLI¶

aws s3 ls

如果未安装，请从此处安装。

3. 设置 AWS 密钥¶

AWS 凭证 [您需要设置此项以便 Ray 验证您的身份]

如何创建 AWS 访问密钥 ID

创建后，下载您的访问密钥以备查阅。
运行 aws configure 使用您的访问凭证配置 AWS CLI

配置和凭证文件设置 - AWS 命令行界面
(可选) 获取 AWS PEM 文件

GPU 上的单元测试不需要此文件，因为它不会启动新节点；但如果您希望 Ray 能够启动新节点，则需要此文件。

Amazon EC2 密钥对和 Linux 实例 - Amazon Elastic Compute Cloud

4. 获取 Ray¶

在本地安装 Ray

pip install -U "ray[default]" boto3

5. 设置 Ray 配置¶

vim $HOME/.clusters/cluster.yaml

复制下面的示例 Ray 配置，并编辑所有 <...> 值以匹配您的本地开发环境。

cluster_name: <$USER>-ludwig-ray-g4dn

max_workers: 3

docker:
  image: "ludwigai/ludwig-ray-gpu:master"
  container_name: "ray_container"
  pull_before_run: True
  run_options: # Extra options to pass into "docker run"
    - --ulimit nofile=65536:65536

provider:
  type: aws
  region: <us-east-2>
  availability_zone: <us-east-2a>

available_node_types:
  ray.head.default:
    resources: {}
    node_config:
      InstanceType: g4dn.4xlarge
      ImageId: latest_dlami
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            VolumeSize: 100
  ray.worker.default:
    min_workers: 0
    max_workers: 0
    resources: {}
    node_config:
      InstanceType: g4dn.4xlarge
      ImageId: latest_dlami

head_node_type: ray.head.default

file_mounts:
  {
    /home/ubuntu/ludwig/: </Users/$USER/ludwig>,  # Ludwig Repo.
    /home/ray/.aws: </Users/$USER/.aws>,  # AWS credentials.
  }

rsync_exclude:
  - "**/.git"
  - "**/.git/**"

rsync_filter:
  - ".gitignore"

setup_commands:
  - pip uninstall -y ludwig && pip install -e /home/ubuntu/ludwig/.
  - pip install s3fs==2021.10.0 aiobotocore==1.4.2 boto3==1.17.106
  - pip install pandas==1.1.4
  - pip install hydra-core --upgrade

head_start_ray_commands:
  - ray stop --force
  - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
  - ray stop --force
  - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

设置一个环境变量，指向集群配置文件的位置 (可以是相对路径)

export CLUSTER="$HOME/.clusters/cluster.yaml"

开发者工作流程¶

(一次性) 启动 Ray 集群¶

export CLUSTER="$HOME/cluster_g4dn.yaml" export CLUSTER_CPU="$HOME/cluster_cpu.yaml" ray up $CLUSTER

进行本地更改¶

在本地运行测试。

pytest tests/...

将本地更改通过 Rsync 同步到 Ray GPU 集群¶

ray rsync_up $CLUSTER -A '/Users/$USER/ludwig/' '/home/ubuntu/ludwig'
ray rsync_up $CLUSTER_CPU -A '/Users/$USER/ludwig/' '/home/ubuntu/ludwig'

警告

末尾的斜杠 / 很重要！

在 Ray 挂载的 ludwig 目录中，在 GPU 集群上运行测试¶

ray exec $CLUSTER "cd /home/ubuntu/ludwig && pytest tests/"

您也可以直接连接到集群头节点的终端

ray attach $CLUSTER