IPC 跨不同 Docker 容器中的 Python 脚本共享内存

2024-04-15

问题

我编写了一个神经网络分类器，它接收大量图像（每个图像约 1-3 GB），对它们进行修补，然后将修补程序单独通过网络传递。训练正在进行中really慢慢地，所以我对它进行了基准测试，发现将补丁从一张图像加载到内存中需要大约 50 秒（使用打开幻灯片库 https://openslide.org/api/python/），并且只需约 0.5 秒即可将它们传递到模型中。

然而，我正在开发一台具有 1.5Tb RAM 的超级计算机，其中仅使用了约 26 Gb。数据集总共约 500Gb。我的想法是，如果我们能够将整个数据集加载到内存中，它将极大地加快训练速度。但我正在与一个研究团队合作，我们正在跨多个 Python 脚本进行实验。因此，理想情况下，我希望在一个脚本中将整个数据集加载到内存中，并能够跨所有脚本访问它。

更多细节：

我们在单独的 Docker 容器（在同一台机器上）中运行单独的实验，因此数据集必须可以跨多个容器访问。
数据集是Camelyon16 数据集 https://camelyon16.grand-challenge.org/;图像存储在.tif format.
我们只需要读取图像，不需要编写。
我们一次只需要访问数据集的一小部分。

可能的解决方案

我发现很多关于如何跨多个 Python 脚本共享内存中的 Python 对象或原始数据的帖子：

跨脚本共享 Python 数据

多处理模块中具有 SyncManager 和 BaseManager 的服务器进程 |实施例1 https://stackoverflow.com/questions/2545961/how-to-synchronize-a-python-dict-with-multiprocessing/2556974#2556974 | 实施例2 https://stackoverflow.com/questions/1171767/comparison-of-the-multiprocessing-module-and-pyro/1955757#1955757 | 文档 - 服务器进程 https://docs.python.org/3/library/multiprocessing.html#sharing-state-between-processes | 文档 - SyncManager https://docs.python.org/3/library/multiprocessing.html#managers

优点：可以被网络上不同计算机上的进程共享（可以被多个容器共享吗？）
可能的问题：根据文档，比使用共享内存慢。如果我们使用客户端/服务器在多个容器之间共享内存，这会比从磁盘读取所有脚本更快吗？
可能的问题：根据这个答案 https://stackoverflow.com/questions/47837206/sharing-a-complex-python-object-in-memory-between-separate-processes, the Manager对象在发送对象之前对其进行腌制，这可能会减慢速度。

mmap https://stackoverflow.com/questions/34819892/share-variable-data-from-file-among-multiple-python-scripts-with-not-loaded-du模块|Docs https://docs.python.org/2/library/mmap.html

可能的问题：mmap将文件映射到虚拟内存，不是物理内存 https://stackoverflow.com/questions/14289421/how-to-use-mmap-in-python-when-the-whole-file-is-too-big- 它创建一个临时文件。
可能的问题：因为我们一次只使用数据集的一小部分，虚拟内存将整个数据集放在磁盘上，我们遇到殴打 https://computer.howstuffworks.com/virtual-memory.htm问题和程序进展缓慢。

Pyro4 https://stackoverflow.com/questions/1171767/comparison-of-the-multiprocessing-module-and-pyro/1955757#1955757（Python 对象的客户端-服务器） |Docs https://pythonhosted.org/Pyro4/intro.html

The sysv_ipc http://semanchuk.com/philip/sysv_ipc/Python 的模块。这个演示 https://github.com/osvenskan/sysv_ipc/tree/develop/demos/sem_and_shm看起来很有希望。

可能的问题：也许只是一个较低水平的暴露 http://semanchuk.com/philip/PythonIpc/内置可用的东西multi-processing module?

我还发现这个清单 https://docs.python.org/3/library/ipc.htmlPython 中 IPC/网络的选项。

有些讨论服务器-客户端设置，有些讨论序列化/反序列化，我担心这比从磁盘读取需要更长的时间。我找到的答案都没有解决我的问题：这些是否会导致 I/O 性能提高。

跨 Docker 容器共享内存

我们不仅需要跨脚本共享 Python 对象/内存；我们需要在 Docker 容器之间共享它们。

码头工人文档 https://docs.docker.com/engine/reference/run/#ipc-settings---ipc解释了--ipc标志很好。根据正在运行的文档，对我来说有意义的是：

docker run -d --ipc=shareable data-server
docker run -d --ipc=container:data-server data-client

但是当我在单独的容器中运行客户端和服务器时--ipc如上所述建立连接后，它们无法相互通信。我读过的SO问题（1 https://stackoverflow.com/questions/37701203/shared-memory-across-docker-containers, 2 https://stackoverflow.com/questions/44029035/ipc-communication-between-docker-containers, 3 https://stackoverflow.com/questions/29173193/shared-memory-with-docker-containers-docker-version-1-4-1, 4 https://stackoverflow.com/questions/23889187/is-it-possible-to-share-memory-between-docker-containers）不解决单独 Docker 容器中 Python 脚本之间共享内存的集成问题。

我的问题：

1：其中任何一个都可以提供比从磁盘读取更快的访问速度吗？跨进程/容器共享内存中的数据会提高性能是否合理？
2：哪种解决方案最适合在多个 Docker 容器之间共享内存中的数据？
3：如何将 Python 的内存共享解决方案与docker run --ipc=<mode>？（共享 IPC 命名空间是跨 docker 容器共享内存的最佳方式吗？）
4：是否有比这些更好的解决方案来解决我们的大 I/O 开销问题？

最小工作示例 - 已更新。不需要外部依赖！

这是我在不同容器中的 Python 脚本之间共享内存的幼稚方法。当 Python 脚本在同一个容器中运行时它可以工作，但当它们在单独的容器中运行时则不起作用。

server.py

from multiprocessing.managers import SyncManager
import multiprocessing

patch_dict = {}

image_level = 2
image_files = ['path/to/normal_042.tif']
region_list = [(14336, 10752),
               (9408, 18368),
               (8064, 25536),
               (16128, 14336)]

def load_patch_dict():

    for i, image_file in enumerate(image_files):
        # We would load the image files here. As a placeholder, we just add `1` to the dict
        patches = 1
        patch_dict.update({'image_{}'.format(i): patches})

def get_patch_dict():
    return patch_dict

class MyManager(SyncManager):
    pass

if __name__ == "__main__":
    load_patch_dict()
    port_num = 4343
    MyManager.register("patch_dict", get_patch_dict)
    manager = MyManager(("127.0.0.1", port_num), authkey=b"password")
    # Set the authkey because it doesn't set properly when we initialize MyManager
    multiprocessing.current_process().authkey = b"password"
    manager.start()
    input("Press any key to kill server".center(50, "-"))
    manager.shutdown

client.py

from multiprocessing.managers import SyncManager
import multiprocessing
import sys, time

class MyManager(SyncManager):
    pass

MyManager.register("patch_dict")

if __name__ == "__main__":
    port_num = 4343

    manager = MyManager(("127.0.0.1", port_num), authkey=b"password")
    multiprocessing.current_process().authkey = b"password"
    manager.connect()
    patch_dict = manager.patch_dict()

    keys = list(patch_dict.keys())
    for key in keys:
        image_patches = patch_dict.get(key)
        # Do NN stuff (irrelevant)

当脚本在同一容器中运行时，这些脚本可以很好地共享图像。但是当它们在单独的容器中运行时，如下所示：

# Run the container for the server
docker run -it --name cancer-1 --rm --cpus=10 --ipc=shareable cancer-env
# Run the container for the client
docker run -it --name cancer-2 --rm --cpus=10 --ipc=container:cancer-1 cancer-env

我收到以下错误：

Traceback (most recent call last):
  File "patch_client.py", line 22, in <module>
    manager.connect()
  File "/usr/lib/python3.5/multiprocessing/managers.py", line 455, in connect
    conn = Client(self._address, authkey=self._authkey)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 487, in Client
    c = SocketClient(address)
  File "/usr/lib/python3.5/multiprocessing/connection.py", line 614, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

我建议你尝试使用tmpfs http://man7.org/linux/man-pages/man5/tmpfs.5.html.

它是 Linux 的一项功能，允许您创建虚拟文件系统，所有文件系统都存储在 RAM 中。这允许非常快速的文件访问，并且只需一个 bash 命令即可设置。

除了非常快速和直接之外，它对于您的情况还有许多优点：

无需触及当前代码 - 数据集的结构保持不变
无需额外工作即可创建共享数据集 - 只需cp将数据集放入tmpfs
通用接口 - 作为一个文件系统，您可以轻松地将 RAM 数据集与系统中不一定用 python 编写的其他组件集成。例如，在容器内使用很容易，只需将安装目录传递到容器中即可。
将适合其他环境 - 如果您的代码必须在不同的服务器上运行，tmpfs可以适应和交换页面到硬盘。如果您必须在没有可用 RAM 的服务器上运行此程序，您可以将所有文件放在具有普通文件系统的硬盘驱动器上，而根本不触及您的代码。

使用步骤：

创建一个 tmpfs -sudo mount -t tmpfs -o size=600G tmpfs /mnt/mytmpfs
复制数据集 -cp -r dataset /mnt/mytmpfs
将当前数据集的所有引用更改为新数据集
Enjoy

Edit:

ramfs可能比tmpfs在某些情况下，因为它不实现页面交换。要使用它只需替换tmpfs with ramfs在上面的说明中。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)