摘要
Intersection over Union (IoU) is the most popular evaluation metric used in the object detection benchmarks. However, there is a gap between optimizing the commonly used distance losses for regressing the parameters of a bounding box and maximizing this metric value. The optimal objective for a metric is the metric itself. In the case of axis-aligned 2D bounding boxes, it can be shown that IoU can be directly used as a regression loss. However, IoU has a plateau making it infeasible to optimize in the case of nonoverlapping bounding boxes. In this paper, we address the weaknesses of IoU by introducing a generalized version as both a new loss and a new metric. By incorporating this generalized IoU (GIoU) as a loss into the state-of-the art object detection frameworks, we show a consistent improvement on their performance using both the standard, IoU based, and new, GIoU based, performance measures on popular object detection benchmarks such as PASCAL VOC and MS COCO.
联合交叉(IoU)是在对象检测基准中使用的最流行的评估度量。 然而,在优化常用距离损失以回归边界框的参数和最大化该度量值之间存在差距。 度量的最佳目标是度量本身。 在轴对齐的2D边界框的情况下,可以显示IoU可以直接用作回归损失。 然而,IoU具有一个平台,使得在不重叠的边界框的情况下优化是不可行的。 在本文中,我们通过引入广义版本作为新损失和新指标来解决IoU的弱点。 通过将这种通用IoU(GIoU)作为一种损失纳入最先进的对象检测框架,我们使用基于标准,基于IoU和基于GIoU的新的性能度量对流行对象的性能进行了一致的改进 检测基准,如PASCAL VOC和MS COCO。
引言
Bounding box regression is one of the most fundamental components in many 2D/3D computer vision tasks. Tasks such as object localization, multiple object detection, object tracking and instance level segmentation rely on accurate bounding box regression. The dominant trend for improving performance of applications utilizing deep neural networks is to propose either a better architecture backbone [15, 13] or a better strategy to extract reliable local features [6]. However, one opportunity for improvement that is widely ignored is the replacement of the surrogate regression losses such as ‘ 1 and ‘ 2 -norms, with a metric loss calculated based on Intersection over Union (IoU).
边界框回归是许多2D / 3D计算机视觉任务中最基本的组件之一。 目标定位,多目标检测,对象跟踪和实例级别分割等任务依赖于精确的边界框回归。 利用深度神经网络提高应用性能的主要趋势是提出更好的架构或更好的策略来提取可靠的局部特征。 然而,一个被广泛忽视的改进机会是改变回归损失,例如L1L2范数,其中包括根据IOU计算的度量损失。
IoU, also known as Jaccard index, is the most commonly used metric for comparing the similarity between two arbitrary shapes. IoU encodes the shape properties of the objects under comparison, e.g. the widths, heights and locations of two bounding boxes, into the region property and then calculates a normalized measure that focuses on their areas (or volumes). This property makes IoU invariant to the scale of the problem under consideration. Due to this appealing property, all performance measures used to evaluate for segmentation [2,1,25,14], object detection[14,4],and tracking [11, 10] rely on this metric.
IoU,也称为Jaccard索引,是用于比较两个任意形状之间的相似性的最常用度量。 IoU对比较的对象的形状区域属性进行编码,例如 两个边界框的宽度,高度和位置,然后计算一个关注其区域(或体积)的标准化度量。 此属性使IoU对所考虑问题的规模不变。 由于这种吸引人的特性,用于评估分割,物体检测和跟踪的所有性能测量依赖于该度量。
However, it can be shown that there is not a strong correlation between minimizing the commonly used losses,e.g. ‘ n -norms, defined on parametric representation of two bounding boxes in 2D/3D and improving their IoU values.For example, consider the simple 2D scenario in Fig. 1 (a),where the predicted bounding box (black rectangle), and the ground truth box (green rectangle), are represented by their top-left and bottom-right corners, i.e. ( x 1 , y 1 , x 2 , y 2 ) (x_1 ,y_1 ,x_2 ,y_2 ) (x1,y1,x2,y2). For simplicity, let’s assume that the distance, e.g. ‘ 2 -norm, between one of the corners of two boxes is fixed. Therefore any predicted bounding box where the second corner lies on a circle with a fixed radius centered on the second corner of the green rectangle (shown by a gray dashed line circle) will have exactly the same ‘ 2 -norm distance from the ground truth box; however their IoU values can be significantly different (Fig. 1 (a)). The same argument can be extended to any other representation and loss, e.g. Fig. 1 (b). It is intuitive that a good local optimum for these types of objectives may not necessarily be a local optimum for IoU. Moreover, in contrast to IoU, ‘ n -norm objectives defined based on the aforementioned parametric representations are not invariant to the scale of the problem. To this end, several pairs of bounding boxes with the same level of overlap, but different scales due to e.g. perspective, will have different objective values. In addition, some representations may suffer from lack of regularization between the different types of parameters used for the representation. For example, in the center and size representation, ( x c , y c ) (x_c ,y_c ) (xc,yc) is defined on the location space while (w,h) belongs to the size space. Complexity increases as more parameters are incorporated, e.g.rotation, or when adding more dimensions to the problem. To alleviate some of the aforementioned problems, state-of-the-art object detectors introduce the concept of an anchor box [22] as a hypothetically good initial guess. They also define a non-linear representation [19, 5] to naively compensate for the scale changes. Even with these handcrafted changes, there is still a gap between optimizing the regression losses and IoU values.
然而,可以证明在IOU和最小化常用损失函数之间没有很强的相关性,例如。 ’ L n L_n Ln 正则化,定义在2D / 3D中两个边界框的参数化表示,并改善它们的IoU值。例如,考虑图1(a)中的简单2D场景,用左上角和右下角表示预测的边界框(黑色矩形)和真实框(绿色矩形)。为简单起见,我们用 L 2 L_2 L