我发现遵循 Ray 指南在 ray 集群上运行 docker 映像以执行 python 脚本非常困难。我发现缺乏简单的工作示例。
所以我有最简单的docker文件:
FROM rayproject/ray
WORKDIR /usr/src/app
COPY . .
CMD ["step_1.py"]
ENTRYPOINT ["python3"]
我用它来创建 can 映像并将其推送到 docker hub。 (“myimage”只是一个例子)
docker build -t myimage .
docker push myimage
“step_1.py”每秒打印 hello,持续 200 秒:
import time
for i in range(200):
time.sleep(1)
print("hello")
这是我的 config.yaml。又非常简单:
cluster_name: simple-1
min_workers: 0
max_workers: 2
docker:
image: "myimage"
container_name: "my_simple_docker_container"
pull_before_run: True
idle_timeout_minutes: 5
provider:
type: aws
region: eu-west-2
availability_zone: eu-west-2a
file_mounts_sync_continuously: False
auth:
ssh_user: ubuntu
ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem
head_node:
InstanceType: c5.2xlarge
ImageId: ami-xxxxx826a6b31fd2c
KeyName: aws_ubuntu_test
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
worker_nodes:
InstanceType: c5.2xlarge
ImageId: ami-xxxxx826a6b31fd2c
KeyName: aws_ubuntu_test
InstanceMarketOptions:
MarketType: spot
head_setup_commands:
- pip install boto3==1.4.8
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
我在终端点击:
ray up simple1.yaml:
每次都会出现这个错误:
shared connection to x.x.xx.119 closed.
"docker cp" requires exactly 2 arguments.
See 'docker cp --help'.
Usage: docker cp [OPTIONS] CONTAINER:SRC_PATH DEST_PATH|-
docker cp [OPTIONS] SRC_PATH|- CONTAINER:DEST_PATH
Copy files/folders between a container and the local filesystem
Shared connection to x.x.xx.119 closed.
只需添加 docker 映像就可以在任何其他远程计算机上运行,只是不能在 ray 集群上运行。
如果有人可以帮助我,我将永远感激不已,我什至会承诺在我的挣扎之后添加一个关于媒体的教程。