【ModelArts系列】华为ModelArts Notebook训练yolov3模型（开发环境）

2023-11-07

一、参考资料

二、相关介绍

在ModelArts的 notebook中运行ModelZoo中模型，以yolov3为例，训练集为 COCO2014。

运行环境：ModelArts notebook
模型：ModelZoo,yolov3
数据集：COCO2014
镜像：tensorflow1.15-mindspore1.5.1-cann5.0.2-euler2.8-aarch64
规格：Ascend: 1*Ascend-910(32GB) | ARM: 24 核 96GB

如果要删除notebook，请及时备份到obs中，以免造成不必要的麻烦；

三、关键操作

3.1 准备数据集

华为 OBS上传notebook以及notebook上传到OBS

下载COCO2014数据集，下载地址：

链接：https://pan.baidu.com/s/16sxIpFs-hd-6FzN2rSHgqA
提取码：1234
数据集上传到obs

用obs-browser客户端上传COCO2014数据集到OBS。
拷贝obs数据集到notebook

在notebook操作。

import moxing as mox

# COCO_COCO_2014_Train_Images.zip
mox.file.copy_parallel('obs://liulingjun-demo/datasets/COCO2014/COCO_COCO_2014_Train_Val_annotations.zip','/home/ma-user/work/COCO_COCO_2014_Train_Val_annotations.zip')
print('Copy procedure is completed !')

# COCO_COCO_2014_Val_Images.zip
mox.file.copy_parallel('obs://liulingjun-demo/datasets/COCO2014/COCO_COCO_2014_Val_Images.zip','/home/ma-user/work/COCO_COCO_2014_Val_Images.zip')
print('Copy procedure is completed !')

# COCO_COCO_2014_Train_Images.zip
mox.file.copy_parallel('obs://liulingjun-demo/datasets/COCO2014/COCO_COCO_2014_Train_Images.zip','/home/ma-user/work/COCO_COCO_2014_Train_Images.zip')
print('Copy procedure is completed !')

解压数据集

在notebook操作。

cd work

unzip COCO_COCO_2014_Train_Val_annotations.zip -d ./COCO2014

[ma-user COCO2014]$ll
total 6780
drwx------ 2 ma-user ma-group    4096 Jun 18 08:43 annotations
drwxrwxr-x 2 ma-user ma-group 4620288 Aug 16  2014 train2014
drwxrwxr-x 2 ma-user ma-group 2310144 Aug 16  2014 val2014

（可选）解压后的数据集拷贝回obs

import moxing as mox
mox.file.copy_parallel('/home/ma-user/work/COCO2014', 'obs://liulingjun-demo/yolov3/dataset')
print('Copy procedure is completed !')

3.2 准备预训练模型

在这里插入图片描述

将 YOLOv3_TensorFlow_1.6_model/single/ckpt 路径下的模型文件拷贝到 YoloV3_for_TensorFlow_1.6_code/data/darknet_weights，并重命名为 darknet53.ckpt；

3.3 准备源码

下载源码到本地（笔记本）
解压并修改源码
准备txt标注文件

根据COCO2014数据集的实际路径使用 coco_trainval_anns.py 和 coco_minival_anns.py 分别生成 训练和验证样本标注文件 coco2014_trainval.txt 和 coco2014_minival.txt 并放置于 YoloV3_for_TensorFlow_1.6_code/data 录下。
```
# 1. 修改源码中的路径

# 2. 执行 coco_trainval_anns.py
python coco_trainval_anns.py

# 3. 执行 coco_minival_anns.py
python coco_minival_anns.py
```
修改txt标注文件的路径

/opt/npu/dataset/coco/coco2014/ 修改为 /home/ma-user/work/COCO2014/

修改 train.py

根据train.py 源代码可知，默认的训练模式是 single，加载 args_single.py 中的配置参数，所以修改 args_single.py 配置参数即可。

train.py

parser.add_argument("--mode", type=str, default='single', help="setting train mode of training.")

if args_input.mode == 'single':
    import args_single as args

args_single.py

参数基本上默认即可。

### Some paths
train_file =        os.path.join(work_path, './data/coco2014_trainval.txt')  # The path of the training txt file.
val_file =          os.path.join(work_path, './data/coco2014_minival.txt')  # The path of the validation txt file.
restore_path =      os.path.join(work_path, './data/darknet_weights/darknet53.ckpt')  # The path of the weights to restore.
anchor_path =       os.path.join(work_path, './data/yolo_anchors.txt')  # The path of the anchor txt file.
class_name_path =   os.path.join(work_path, './data/coco.names')  # The path of the class names.

...
...
...

### other training strategies
multi_scale_train = False  # Whether to apply multi-scale training strategy. Image size varies from [320, 320] to [640, 640] by default.
use_label_smooth = False # Whether to use class label smoothing strategy.
use_focal_loss = False  # Whether to apply focal loss on the conf loss.
use_mix_up = False  # Whether to use mix up data augmentation strategy.
use_warm_up = True  # whether to use warm up strategy to prevent from gradient exploding.
warm_up_epoch = min(total_epoches*0.1, 3)  # Warm up training epoches. Set to a larger value if gradient explodes.

压缩源码文件，并上传到obs

将修改好的源码压缩成zip文件，上传到obs。

拷贝并解压源码

拷贝obs的源码到notebook

import moxing as mox

# COCO_COCO_2014_Train_Images.zip
mox.file.copy_parallel('obs://liulingjun-demo/cache/YoloV3_for_TensorFlow_1.6_code.zip','/home/ma-user/work/YoloV3_for_TensorFlow_1.6_code.zip')
print('Copy procedure is completed !')

解压源码

cd /home/ma-user/work
unzip YoloV3_for_TensorFlow_1.6_code.zip

3.4 训练模型

在 /home/ma-user/work/YoloV3_for_TensorFlow_1.6_code 路径下创建 notebook，执行以下指令开启训练：

!python train.py

训练output输出到 /YoloV3_for_TensorFlow_1.6_code/training/ 路径下。

3.5 运行成功

Thu, 28 Jul 2022 09:52:13 INFO shuffle seed_0 args.
Thu, 28 Jul 2022 09:52:13 WARNING From /home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/tensorflow_core/python/data/util/random_seed.py:58: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Thu, 28 Jul 2022 09:52:13 WARNING Entity <function <lambda> at 0xffff6711b560> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Str'
Thu, 28 Jul 2022 09:52:13 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:139: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.
    
Thu, 28 Jul 2022 09:52:13 WARNING Entity <function valid_shape at 0xffff6711b950> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: module 'gast' has no attribute 'Index'
Thu, 28 Jul 2022 09:52:13 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:173: The name tf.data.Iterator is deprecated. Please use tf.compat.v1.data.Iterator instead.

Thu, 28 Jul 2022 09:52:13 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:173: DatasetV1.output_types (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(dataset)`.
Thu, 28 Jul 2022 09:52:13 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:173: DatasetV1.output_shapes (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(dataset)`.
Thu, 28 Jul 2022 09:52:13 WARNING From /home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:347: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_types(iterator)`.
Thu, 28 Jul 2022 09:52:13 WARNING From /home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_shapes(iterator)`.
Thu, 28 Jul 2022 09:52:13 WARNING From /home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.data.get_output_classes(iterator)`.
Thu, 28 Jul 2022 09:52:13 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:187: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

Thu, 28 Jul 2022 09:52:13 WARNING From /home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/tensorflow_core/contrib/layers/python/layers/layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
Thu, 28 Jul 2022 09:52:18 WARNING From /home/ma-user/work/yolov3-tensorflow/code/utils/layer_utils.py:114: The name tf.image.resize_nearest_neighbor is deprecated. Please use tf.compat.v1.image.resize_nearest_neighbor instead.

Thu, 28 Jul 2022 09:52:19 WARNING From /home/ma-user/work/yolov3-tensorflow/code/model.py:336: The name tf.log is deprecated. Please use tf.math.log instead.

Thu, 28 Jul 2022 09:52:20 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:190: The name tf.losses.get_regularization_loss is deprecated. Please use tf.compat.v1.losses.get_regularization_loss instead.

Thu, 28 Jul 2022 09:52:20 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:193: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

Thu, 28 Jul 2022 09:52:21 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:197: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

Thu, 28 Jul 2022 09:52:21 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:230: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

Thu, 28 Jul 2022 09:52:21 INFO total_steps: 200000
Thu, 28 Jul 2022 09:52:21 INFO warmup_steps: 3000
Thu, 28 Jul 2022 09:52:21 WARNING From /home/ma-user/work/yolov3-tensorflow/code/utils/misc_utils.py:184: The name tf.train.MomentumOptimizer is deprecated. Please use tf.compat.v1.train.MomentumOptimizer instead.

Thu, 28 Jul 2022 09:52:21 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:247: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

Thu, 28 Jul 2022 09:52:21 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:247: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

Thu, 28 Jul 2022 09:52:21 DEBUG compute_gradients...
Thu, 28 Jul 2022 09:52:31 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:253: The name tf.train.get_global_step is deprecated. Please use tf.compat.v1.train.get_global_step instead.

Thu, 28 Jul 2022 09:52:31 WARNING From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/npu_loss_scale_optimizer.py:159: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

Thu, 28 Jul 2022 09:52:37 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:262: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

Thu, 28 Jul 2022 09:52:37 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:295: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

Thu, 28 Jul 2022 09:52:37 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:297: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

Thu, 28 Jul 2022 09:52:37 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:297: The name tf.local_variables_initializer is deprecated. Please use tf.compat.v1.local_variables_initializer instead.

Thu, 28 Jul 2022 09:53:11 INFO Restoring parameters from /home/ma-user/work/yolov3-tensorflow/code/./data/darknet_weights/darknet53.ckpt
Thu, 28 Jul 2022 09:53:18 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:306: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

Thu, 28 Jul 2022 09:53:18 WARNING From /home/ma-user/work/yolov3-tensorflow/code/train.py:307: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Thu, 28 Jul 2022 09:54:14 WARNING From /home/ma-user/anaconda3/envs/TensorFlow-1.15.0/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Thu, 28 Jul 2022 09:54:14 WARNING From /usr/local/Ascend/tfplugin/latest/tfplugin/python/site-packages/npu_bridge/estimator/npu/util.py:206: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.
Thu, 28 Jul 2022 09:58:08 INFO Epoch: 0, global_step: 9 fps: 0.75 lr: 0.000007 | loss: total: 6.61, xy: 0.31, wh: 0.97, conf: 2.96, class: 2.38 | 
Thu, 28 Jul 2022 09:58:09 INFO Epoch: 0, global_step: 19 fps: 168.07 lr: 0.000015 | loss: total: 14.65, xy: 0.72, wh: 2.12, conf: 8.35, class: 3.46 | 
...
...
...

3.6 资源占用情况

在这里插入图片描述

CPU占用情况

1*Ascend 910 CPU24核内存96GiB（CUE评分1943） (modelarts.kat1.xlarge)
在这里插入图片描述

8*Ascend 910 CPU192核内存720GiB（CUE评分15544） (modelarts.kat1.8xlarge)
在这里插入图片描述

内存占用情况
在这里插入图片描述
NPU占用情况

四、FAQ

Q：修改notebook规格不合法

ModelArts.6405: RUNNING status not allowed update flavor, "modelarts.kat1.xlarge" 不合法

8*Ascend910 降为 1*Ascend910，降配失败

在这里插入图片描述

Q：文件超过100MB，notebook上传文件失败

在这里插入图片描述

解决办法：
上传到obs转存即可。

Q：磁盘空间不足

在这里插入图片描述

解决办法：
存储容量扩容
如果不能扩容，则重新创建notebook，配置更大的存储容量。

在这里插入图片描述

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

深度学习

modelarts

notebook

YOLOv3