当我通过 InfiniBand 运行 MPI 作业时,出现以下磨损情况。我们使用扭矩管理器。
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory. This can cause MPI jobs to
run with erratic performance, hang, and/or crash.
This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered. You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.
See this Open MPI FAQ item for more information on these Linux kernel module
parameters:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: host1
Registerable memory: 65536 MiB
Total memory: 196598 MiB
Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------
我已阅读警告消息上的链接,到目前为止我所做的是;
- Append
options mlx4_core log_num_mtt=20 log_mtts_per_seg=4
on /etc/modprobe.d/mlx4_en.conf
.
- Make sure the following lines are written on
/etc/security/limits.conf
* soft memlock unlimited
* hard memlock unlimited
- Append
session required pam_limits.so
on /etc/pam.d/sshd
- 确保
ulimit -c unlimited
未评论/etc/init.d/pbs_mom
谁能帮我找出我缺少的东西?
Your mlx4_core
参数允许注册2^20 * 2^4 * 4 KiB = 64 GiB
仅有的。每个节点有 192 GiB 物理内存,并且建议至少拥有两倍的可注册内存,您应该设置log_num_mtt
到 23,这会将限制增加到 512 GiB - 大于或等于 RAM 量两倍的最接近的幂。请务必重新启动节点或卸载然后重新加载内核模块。
您还应该提交一个简单的 Torque 作业脚本来执行ulimit -l
为了验证锁定内存的限制并确保没有这样的限制。注意ulimit -c unlimited
不会取消锁定内存量的限制,而是取消核心转储文件大小的限制。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)