以下链接是个人关于YOLO V3所有见解,如有错误欢迎大家指出,我会第一时间纠正,如有兴趣可以加微信:17575010159 相互讨论技术。
目标检测0-00:YOLO V3目录-史上最全
一、源码目录总览
tensorflow-yolov3-master-bk
├── checkpoint
├── convert_weight.py
├── core
│ ├── backbone.py
│ ├── common.py
│ ├── config.py
│ ├── dataset.py
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── backbone.cpython-36.pyc
│ │ ├── common.cpython-36.pyc
│ │ ├── config.cpython-36.pyc
│ │ ├── dataset.cpython-36.pyc
│ │ ├── __init__.cpython-36.pyc
│ │ ├── utils.cpython-36.pyc
│ │ └── yolov3.cpython-36.pyc
│ ├── utils.py
│ └── yolov3.py
├── data
│ ├── anchors
│ │ ├── basline_anchors.txt
│ │ └── coco_anchors.txt
│ ├── classes
│ │ ├── coco.names
│ │ └── voc.names
│ ├── dataset
│ │ ├── voc_test.txt
│ │ └── voc_train.txt
│ └── log
│ └── events.out.tfevents.1564706916.WIN-RCRPPSUQJFP
├── docs
│ ├── Box-Clustering.ipynb
│ ├── images
│ │ ├── 611_result.jpg
│ │ ├── darknet53.png
│ │ ├── iou.png
│ │ ├── K-means.png
│ │ ├── levio.jpeg
│ │ ├── probability_extraction.png
│ │ ├── road.jpeg
│ │ ├── road.mp4
│ │ └── yolov3.png
│ └── requirements.txt
├── evaluate.py
├── freeze_graph.py
├── image_demo.py
├── LICENSE
├── LICENSE.fuck
├── mAP
│ ├── extra
│ │ ├── class_list.txt
│ │ ├── convert_gt_xml.py
│ │ ├── convert_gt_yolo.py
│ │ ├── convert_keras-yolo3.py
│ │ ├── convert_pred_darkflow_json.py
│ │ ├── convert_pred_yolo.py
│ │ ├── find_class.py
│ │ ├── intersect-gt-and-pred.py
│ │ ├── README.md
│ │ ├── remove_class.py
│ │ ├── remove_delimiter_char.py
│ │ ├── remove_space.py
│ │ ├── rename_class.py
│ │ └── result.txt
│ ├── __init__.py
│ └── main.py
├── README.md
├── scripts
│ ├── show_bboxes.py
│ └── voc_annotation.py
├── train.py
└── video_demo.py
上述没有注释部分,后续再填充,下面的讲解都是基于VOC数据
二、YOLO V3输入
当我们拿到一个网络得时候,首先我们要知道一个网络得输入和输出是什么,这样我们能尽快的理解这个网络。按照大佬源码中的操作,运行scripts/voc_annotation.py之后,我们得到data/dataset中的voc_test与voc_train文件,其格式如下:
../VOC/train/VOCdevkit/VOC2007\JPEGImages\000005.jpg 263,211,324,339,8 165,264,253,372,8 241,194,295,299,8
首选是图像的路径,然后是图像中所包含box左上角和右下角的坐标以及其所属类别,即路径后面,每五个数字,可以表示一个box,以及这个box对应类别的编号。有了这样的数据信息之后我们就能进行训练了,即能够执行train.py文件。
再继续讲解之前,我们需要了解一下yolo V3的结构,yoloV3不同于之前的yolo1与yolo2,其使用了图像金字塔的思想,对一张图片进行了3次降采样,分别为8,16,32。表示我们输入的图像必须是32的倍数,不然没有办法进行32倍的降采样。如果对yolo1,与yolo2不熟悉,可以看看以下的文章:
YOLOv1到YOLOv3的演变过程及每个算法详解
总的来说,其就是把一张图片分成了很多grid(网格)或者cell(细胞)。就拿yolo1来说,假设一张下面的图片:
这张图片的大小为(448,448),其被划分之后变成(7,7)。即每个每个gred都代表着(64,64)的视野。YOLO1会对每个gred进行两个box的检测,每个box所包含的信息如下:box的坐标(4个数字),该box物体的置信度(1个数字),该box所属于各种类别的概率(有多个类别就需要多好数字表示,如果VOC数据,则为20个)。所以我们最后要描述一张图像的信息需要(7,7,(4+1)*2+20)=(7,7,30)的矩阵。这样我们可以描述出来一张图片的信息,如下:
上面是YOLO1的处理,对于YOLO3的原理也是类似的,只是YOLO3采用了图像金字他的思想,做了3次变化,如假设输入一张(416,416)的图片,经过3次(8,16,32)下采样变换之后为sacel[(52,52), (26,26), (13,13)]。可以这样理解,一张图片使用3种方式进行描述:8倍下采样得到特征图,每个网格可以代表原图种8个像素的感受野。16倍下采样得到特征图,每个网格可以代表原图种16个像素的感受野。32倍下采样得到特征图,每个网格可以代表原图种32个像素的感受野(重复了一些废话,大家不要介意)。
那么同样的道理,一张图片根据不同的划分,我们使用3种方式去描述,每种方式,我们对其中的每个gred都要进行3次预测,是的,在YOLO3中,其会对每个网格进行3中预测。每次预测和YOLO1一样,需要box的坐标(4个数字),该box物体的置信度(1个数字),该box所属于各种类别的概率。那么对于
方式一8倍降采样需要[52,52,3*((4+1)+20)]=[52,52,75]
方式一16倍降采样需要[26,26,3*((4+1)+20)]=[26,26,75]
方式一32倍降采样需要[26,26,3*((4+1)+20)]=[13,13,75]
讲解到这里,也差不多了,如果想详细了解,可以通过一下两篇文章(再这里表示对大佬们的感谢)
目标检测之YoloV1论文及tensorflow实现
搞懂YOLO v1看这篇就够了
现在看看我们的网络输入需要什么,再train.py文件中可以找到如下:
self.input_data = tf.placeholder(dtype=tf.float32, name='input_data')
self.label_sbbox = tf.placeholder(dtype=tf.float32, name='label_sbbox')
self.label_mbbox = tf.placeholder(dtype=tf.float32, name='label_mbbox')
self.label_lbbox = tf.placeholder(dtype=tf.float32, name='label_lbbox')
self.true_sbboxes = tf.placeholder(dtype=tf.float32, name='sbboxes')
self.true_mbboxes = tf.placeholder(dtype=tf.float32, name='mbboxes')
self.true_lbboxes = tf.placeholder(dtype=tf.float32, name='lbboxes')
self.trainable = tf.placeholder(dtype=tf.bool, name='training')
我相信大家现在对anchors已经十分好奇了,那么他到底是什么呢?如果等不急的朋友可以观看下面链接:
YOLO-v3模型参数anchor设置
其就是一个预验框,那么什么是预验框呢?简单的来说,就是我们希望训练出来的后的网络,在对数据进行预测的时候,不喜欢他预测的框是千奇百怪的,所以我们给定一些框让网络去学习,即以后预测的框,尽量和我们给出的框比较接近。这里的框表示的是长度,和宽度。后面会详细的讲解,我们继续往下,慢慢就明白是怎么回事了。
placeholder表示的是占位符,那么他的数据到底是怎么来的呢?根据train.py中的pbar = tqdm(self.trainset),一路追踪下去,可以知道其数据的预处理在core/dataset.py完成,其代码注释如下:
import os
import cv2
import random
import numpy as np
import tensorflow as tf
import core.utils as utils
from core.config import cfg
np.set_printoptions(suppress=True, threshold=np.nan)
class Dataset(object):
"""implement Dataset here"""
def __init__(self, dataset_type):
self.annot_path = cfg.TRAIN.ANNOT_PATH if dataset_type == 'train' else cfg.TEST.ANNOT_PATH
self.input_sizes = cfg.TRAIN.INPUT_SIZE if dataset_type == 'train' else cfg.TEST.INPUT_SIZE
self.batch_size = cfg.TRAIN.BATCH_SIZE if dataset_type == 'train' else cfg.TEST.BATCH_SIZE
self.data_aug = cfg.TRAIN.DATA_AUG if dataset_type == 'train' else cfg.TEST.DATA_AUG
self.train_input_sizes = cfg.TRAIN.INPUT_SIZE
self.strides = np.array(cfg.YOLO.STRIDES)
self.classes = utils.read_class_names(cfg.YOLO.CLASSES)
self.num_classes = len(self.classes)
self.anchors = np.array(utils.get_anchors(cfg.YOLO.ANCHORS))
self.anchor_per_scale = cfg.YOLO.ANCHOR_PER_SCALE
self.max_bbox_per_scale = 150
self.annotations = self.load_annotations(dataset_type)
self.num_samples = len(self.annotations)
self.num_batchs = int(np.ceil(self.num_samples / self.batch_size))
self.batch_count = 0
def load_annotations(self, dataset_type):
"""
随机读取"./data/classes/voc.names"中的内容
:param dataset_type:
:return:
"""
with open(self.annot_path, 'r') as f:
txt = f.readlines()
annotations = [line.strip() for line in txt if len(line.strip().split()[1:]) != 0]
np.random.shuffle(annotations)
return annotations
def __iter__(self):
return self
def __next__(self):
"""
数据处理的核心函数,为了形象的表达。
:return:
"""
with tf.device('/cpu:0'):
self.train_input_size = random.choice(self.train_input_sizes)
self.train_output_sizes = self.train_input_size // self.strides
batch_image = np.zeros((self.batch_size, self.train_input_size, self.train_input_size, 3))
batch_label_sbbox = np.zeros((self.batch_size, self.train_output_sizes[0], self.train_output_sizes[0],
self.anchor_per_scale, 5 + self.num_classes))
batch_label_mbbox = np.zeros((self.batch_size, self.train_output_sizes[1], self.train_output_sizes[1],
self.anchor_per_scale, 5 + self.num_classes))
batch_label_lbbox = np.zeros((self.batch_size, self.train_output_sizes[2], self.train_output_sizes[2],
self.anchor_per_scale, 5 + self.num_classes))
batch_sbboxes = np.zeros((self.batch_size, self.max_bbox_per_scale, 4))
batch_mbboxes = np.zeros((self.batch_size, self.max_bbox_per_scale, 4))
batch_lbboxes = np.zeros((self.batch_size, self.max_bbox_per_scale, 4))
num = 0
if self.batch_count < self.num_batchs:
while num < self.batch_size:
index = self.batch_count * self.batch_size + num
if index >= self.num_samples: index -= self.num_samples
annotation = self.annotations[index]
image, bboxes = self.parse_annotation(annotation)
label_sbbox, label_mbbox, label_lbbox, sbboxes, mbboxes, lbboxes = self.preprocess_true_boxes(bboxes)
"""
print(label_sbbox.shape)
print(label_mbbox.shape)
print(label_lbbox.shape)
print(sbboxes.shape)
print(mbboxes.shape)
print(lbboxes.shape)
print(label_sbbox)
print(label_mbbox)
print(label_lbbox)
#print(sbboxes)
#print(mbboxes)
#print(lbboxes)
print(annotation)
"""
batch_image[num, :, :, :] = image
batch_label_sbbox[num, :, :, :, :] = label_sbbox
batch_label_mbbox[num, :, :, :, :] = label_mbbox
batch_label_lbbox[num, :, :, :, :] = label_lbbox
batch_sbboxes[num, :, :] = sbboxes
batch_mbboxes[num, :, :] = mbboxes
batch_lbboxes[num, :, :] = lbboxes
num += 1
self.batch_count += 1
return batch_image, batch_label_sbbox, batch_label_mbbox, batch_label_lbbox, \
batch_sbboxes, batch_mbboxes, batch_lbboxes
else:
self.batch_count = 0
np.random.shuffle(self.annotations)
raise StopIteration
def random_horizontal_flip(self, image, bboxes):
if random.random() < 0.5:
_, w, _ = image.shape
image = image[:, ::-1, :]
bboxes[:, [0,2]] = w - bboxes[:, [2,0]]
return image, bboxes
def random_crop(self, image, bboxes):
if random.random() < 0.5:
h, w, _ = image.shape
max_bbox = np.concatenate([np.min(bboxes[:, 0:2], axis=0), np.max(bboxes[:, 2:4], axis=0)], axis=-1)
max_l_trans = max_bbox[0]
max_u_trans = max_bbox[1]
max_r_trans = w - max_bbox[2]
max_d_trans = h - max_bbox[3]
crop_xmin = max(0, int(max_bbox[0] - random.uniform(0, max_l_trans)))
crop_ymin = max(0, int(max_bbox[1] - random.uniform(0, max_u_trans)))
crop_xmax = max(w, int(max_bbox[2] + random.uniform(0, max_r_trans)))
crop_ymax = max(h, int(max_bbox[3] + random.uniform(0, max_d_trans)))
image = image[crop_ymin : crop_ymax, crop_xmin : crop_xmax]
bboxes[:, [0, 2]] = bboxes[:, [0, 2]] - crop_xmin
bboxes[:, [1, 3]] = bboxes[:, [1, 3]] - crop_ymin
return image, bboxes
def random_translate(self, image, bboxes):
if random.random() < 0.5:
h, w, _ = image.shape
max_bbox = np.concatenate([np.min(bboxes[:, 0:2], axis=0), np.max(bboxes[:, 2:4], axis=0)], axis=-1)
max_l_trans = max_bbox[0]
max_u_trans = max_bbox[1]
max_r_trans = w - max_bbox[2]
max_d_trans = h - max_bbox[3]
tx = random.uniform(-(max_l_trans - 1), (max_r_trans - 1))
ty = random.uniform(-(max_u_trans - 1), (max_d_trans - 1))
M = np.array([[1, 0, tx], [0, 1, ty]])
image = cv2.warpAffine(image, M, (w, h))
bboxes[:, [0, 2]] = bboxes[:, [0, 2]] + tx
bboxes[:, [1, 3]] = bboxes[:, [1, 3]] + ty
return image, bboxes
def parse_annotation(self, annotation):
line = annotation.split()
image_path = line[0]
if not os.path.exists(image_path):
raise KeyError("%s does not exist ... " %image_path)
image = np.array(cv2.imread(image_path))
bboxes = np.array([list(map(int, box.split(','))) for box in line[1:]])
if self.data_aug:
image, bboxes = self.random_horizontal_flip(np.copy(image), np.copy(bboxes))
image, bboxes = self.random_crop(np.copy(image), np.copy(bboxes))
image, bboxes = self.random_translate(np.copy(image), np.copy(bboxes))
image, bboxes = utils.image_preporcess(np.copy(image), [self.train_input_size, self.train_input_size], np.copy(bboxes))
return image, bboxes
def bbox_iou(self, boxes1, boxes2):
boxes1 = np.array(boxes1)
boxes2 = np.array(boxes2)
boxes1_area = boxes1[..., 2] * boxes1[..., 3]
boxes2_area = boxes2[..., 2] * boxes2[..., 3]
boxes1 = np.concatenate([boxes1[..., :2] - boxes1[..., 2:] * 0.5,
boxes1[..., :2] + boxes1[..., 2:] * 0.5], axis=-1)
boxes2 = np.concatenate([boxes2[..., :2] - boxes2[..., 2:] * 0.5,
boxes2[..., :2] + boxes2[..., 2:] * 0.5], axis=-1)
left_up = np.maximum(boxes1[..., :2], boxes2[..., :2])
right_down = np.minimum(boxes1[..., 2:], boxes2[..., 2:])
inter_section = np.maximum(right_down - left_up, 0.0)
inter_area = inter_section[..., 0] * inter_section[..., 1]
union_area = boxes1_area + boxes2_area - inter_area
return inter_area / union_area
def preprocess_true_boxes(self, bboxes):
label = [np.zeros((self.train_output_sizes[i], self.train_output_sizes[i], self.anchor_per_scale,
5 + self.num_classes)) for i in range(3)]
bboxes_xywh = [np.zeros((self.max_bbox_per_scale, 4)) for _ in range(3)]
bbox_count = np.zeros((3,))
for bbox in bboxes:
bbox_coor = bbox[:4]
bbox_class_ind = bbox[4]
onehot = np.zeros(self.num_classes, dtype=np.float)
onehot[bbox_class_ind] = 1.0
uniform_distribution = np.full(self.num_classes, 1.0 / self.num_classes)
deta = 0.01
smooth_onehot = onehot * (1 - deta) + deta * uniform_distribution
bbox_xywh = np.concatenate([(bbox_coor[2:] + bbox_coor[:2]) * 0.5, bbox_coor[2:] - bbox_coor[:2]], axis=-1)
bbox_xywh_scaled = 1.0 * bbox_xywh[np.newaxis, :] / self.strides[:, np.newaxis]
iou = []
exist_positive = False
for i in range(3):
anchors_xywh = np.zeros((self.anchor_per_scale, 4))
anchors_xywh[:, 0:2] = np.floor(bbox_xywh_scaled[i, 0:2]).astype(np.int32) + 0.5
anchors_xywh[:, 2:4] = self.anchors[i]
iou_scale = self.bbox_iou(bbox_xywh_scaled[i][np.newaxis, :], anchors_xywh)
iou.append(iou_scale)
iou_mask = iou_scale > 0.3
if np.any(iou_mask):
xind, yind = np.floor(bbox_xywh_scaled[i, 0:2]).astype(np.int32)
label[i][yind, xind, iou_mask, :] = 0
label[i][yind, xind, iou_mask, 0:4] = bbox_xywh
label[i][yind, xind, iou_mask, 4:5] = 1.0
label[i][yind, xind, iou_mask, 5:] = smooth_onehot
bbox_ind = int(bbox_count[i] % self.max_bbox_per_scale)
bboxes_xywh[i][bbox_ind, :4] = bbox_xywh
bbox_count[i] += 1
exist_positive = True
if not exist_positive:
best_anchor_ind = np.argmax(np.array(iou).reshape(-1), axis=-1)
best_detect = int(best_anchor_ind / self.anchor_per_scale)
best_anchor = int(best_anchor_ind % self.anchor_per_scale)
xind, yind = np.floor(bbox_xywh_scaled[best_detect, 0:2]).astype(np.int32)
label[best_detect][yind, xind, best_anchor, :] = 0
label[best_detect][yind, xind, best_anchor, 0:4] = bbox_xywh
label[best_detect][yind, xind, best_anchor, 4:5] = 1.0
label[best_detect][yind, xind, best_anchor, 5:] = smooth_onehot
bbox_ind = int(bbox_count[best_detect] % self.max_bbox_per_scale)
bboxes_xywh[best_detect][bbox_ind, :4] = bbox_xywh
bbox_count[best_detect] += 1
label_sbbox, label_mbbox, label_lbbox = label
sbboxes, mbboxes, lbboxes = bboxes_xywh
return label_sbbox, label_mbbox, label_lbbox, sbboxes, mbboxes, lbboxes
def __len__(self):
return self.num_batch
通过源码的注释,我们可知道其中的label_sbbox, label_mbbox, label_lbbox, sbboxes, mbboxes, lbboxes分别为什么:
label_sbbox, label_mbbox, label_lbbox
sbboxes, mbboxes, lbboxes
知道了网络的输入那么我们来分析一下输出。
YOLO V3输出
既然要知道其输出,我们当然需要去了解他的网络结构,下面是一篇比较好的文章:
YOLO v3网络结构分析
通过前面的分析,我们可以知道,假设输入图像为[412,412],那么其会输出三个特征向量(针对VOC数据集20分类)
[52,52,3*((4+1)+20)] = [52,52,75]
[26,26,3*((4+1)+20)] = [26,26,75]
[13,13,3*((4+1)+20)] = [13,13,75]
下小结我们讲解一下YOLO V3的损失函数,任何一个网络的核心,都在于其损失函数。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)