最先进的深度学习：Mask R-CNN简介

2023-11-19

介绍 (Introduction)

From my experience as a time traveller, I can confidently say that autonomous driving is/was/will be all the craze. Mathematically, the hype around computer vision grows exponentially as a function of the index of plank time iterations. Just kidding.

从我作为时间旅行者的经验来看，我可以自信地说自动驾驶是/曾经/将是所有的热潮。在数学上，围绕计算机视觉的炒作与木板时间迭代指数成函数关系。开玩笑。

Anyways, in this post, we’ll dive into some of the more recent developments in computer vision with deep learning, and eventually build up to a model called “Mask R-CNN”. This post should be fairly intuitive, but I expect you to know some of the more basic models for computer vision. If you think you’re ready, let’s begin.

无论如何，在本文中，我们将通过深度学习来深入研究计算机视觉的一些最新发展，并最终构建一个称为“ Mask R-CNN”的模型。这篇文章应该相当直观，但是我希望您了解一些更基本的计算机视觉模型。如果您准备好了，那就开始吧。

对象检测与语义分割与实例分割 (Object detection vs. Semantic segmentation vs. Instance segmentation)

In this post, I’m assuming that you are comfortable with basic deep learning tasks and models specific to computer vision, such as convolutional neural networks (CNN), image classification etc. If these terms sound like jargon to you, go ahead and read this post.

在本文中，我假设您熟悉计算机视觉特有的基本深度学习任务和模型，例如卷积神经网络(CNN)，图像分类等。如果这些术语对您来说像专业术语，请继续阅读这个帖子。

Ok, now let’s move on to the fun stuff. Besides the traditional dog vs. cat classifier that most of us would have built on our deep learning journey, there is a whole lot more we can do with the very same idea of a neural network.

好的，现在让我们继续学习有趣的东西。除了我们大多数人本可以在深度学习过程中建立的传统的狗对猫分类器之外，使用神经网络的相同概念我们还可以做更多的事情。

For example, instead of just telling us what’s in an image, we can train our CNN to tell us which part of the image convinced it to make that decision. To see why this is useful, be sure to check out this Ted talk. This can be done by asking the CNN to draw a box around the object, like in the below image:

例如，我们不仅可以告诉我们图像中的内容，还可以训练CNN来告诉我们图像的哪一部分说服了它做出决定。要了解为什么这样做有用，请务必查看此Ted演讲。这可以通过要求CNN在对象周围绘制一个框来完成，如下图所示：

In deep learning language, this task is called object detection, and it is really quite easy to implement. First, when preparing our data, we need to use a tool to draw bounding boxes around images. This is quite easy using free online tools. Then, we change the final/output layer of the CNN to a softmax layer that has 4 + k outputs — the x-coordinate of the bounding box, the y-coordinate of the bounding box, the height of the bounding box, the width of the bounding box, and class probability scores for k classes.

在深度学习语言中，此任务称为对象检测，并且确实很容易实现。首先，在准备数据时，我们需要使用工具在图像周围绘制边界框。使用免费的在线工具，这非常容易。然后，将CNN的最终/输出层更改为具有4 + k输出的softmax层-边界框的x坐标，边界框的y坐标，边界框的高度，宽度边界框的值，以及k个类别的类别概率分数。

The first thing you might ask is why we choose weird things to learn like the x,y coordinates and the height, width. Can’t we just learn the (x,y) coordinates of each corner of the box? Well, we can — however, if we learn 4 pairs of variables, we have to learn 8 in total to represent the box. However, if we use this technique, we only need to use 4.

您可能要问的第一件事是，为什么我们选择要学习的怪异事物，例如x，y坐标以及高度，宽度。我们不能只学习盒子每个角的(x，y)坐标吗？好吧，我们可以-但是，如果我们学习4对变量，则必须总共学习8个变量来表示该框。但是，如果使用此技术，则只需使用4。

Another interesting task we can solve is semantic segmentation. Again, this is just a fancy word for what is essentially colouring an image like in a children’s colouring book.

我们可以解决的另一个有趣的任务是语义分割。同样，这只是个花哨的字眼，它实质上是在像儿童着色书中那样为图像着色。

Similar to the case of object detection, a free tool could be used to essentially colour an image manually, which is used as the ground truth example for preparing a dataset. Here, our neural network is trained to map each pixel in the input image to a particular class. Crudely, can be done by using something called a fully convolutional network (FCN). This network is just a series of convolutional layers .

类似于对象检测的情况，可以使用免费工具从本质上手动上色图像，该图像用作准备数据集的真实示例。在这里，我们的神经网络经过训练，可以将输入图像中的每个像素映射到特定类别。粗略地讲，可以使用称为完全卷积网络(FCN)的方法来完成。该网络只是一系列卷积层。

So, the FCN would learn (through the mystic dark art of deep learning) the mapping from an input image to a “coloured in” version of it, that highlights the different classes in an image.

因此，FCN将(通过深度学习的神秘黑暗艺术)学习从输入图像到其“彩色”版本的映射，从而突出显示图像中的不同类别。

An important thing to note is that semantic segmentation does not highlight individual instances of a class differently. For example, if there were 3 cows in an image, the model would highlight the area they occupy, but it will not be able to distinguish one cow from another. If we want to add this functionality, we need to extend the task and introduce another term to complicate the already enormously large vocabulary of deep learning — instance segmentation.

需要注意的重要一点是，语义分段不会以不同的方式突出显示类的各个实例。例如，如果图像中有3头奶牛，则模型会突出显示它们所占据的区域，但无法将一头奶牛与另一头奶牛区分开。如果要添加此功能，则需要扩展任务并引入另一个术语，以使已经非常庞大的深度学习词汇(实例分割)变得复杂。

Ok, it wasn’t really all that bad, was it? The term is pretty much self-explanatory. Our goal is to segment or separate each “instance” of a class in an image. This should help you visualize what we are trying to achieve:

好吧，这还不是真的那么糟糕吗？这个词几乎是不言自明的。我们的目标是分割或分离图像中类的每个“实例”。这应该可以帮助您直观地了解我们正在努力实现的目标：

The actual model we use to solve this problem is actually much simpler than you might think. Instance segmentation can essentially be solved in 2 steps:

我们用来解决此问题的实际模型实际上比您想象的要简单得多。实例细分基本上可以通过两个步骤解决：

Perform a version of object detection to draw bounding boxes around each instance of a class
执行对象检测的一种版本，以围绕类的每个实例绘制边界框
Perform semantic segmentation on each of the bounding boxes
对每个边界框执行语义分割

This amazing simple model actually performs extremely well. It works, because if we assume step 1 to have a high accuracy, then semantic segmentation in step 2 is provided a set of images which are guaranteed to have only 1 instance of the main class. The job of the model in step 2 is to just take in an image with exactly 1 main class, and predict which pixels correspond to the main class (cat/dog/human etc.), and which pixels correspond to the background of an image.

这个惊人的简单模型实际上执行得非常好。之所以起作用，是因为如果我们假设步骤1具有很高的准确性，那么在步骤2中的语义分割将提供一组图像，这些图像保证仅具有一个主类实例。第2步中模型的工作是仅获取具有1个主要类别的图像，并预测哪些像素对应于主要类别(猫/狗/人等)，以及哪些像素对应于图像的背景。

Another interesting fact is that if we are able to solve the multi bounding box problem and semantic segmentation problem independently, we have also essentially solved the task of instance segmentation! The good news is that very powerful models have been built to solve both of these problems, and putting the 2 together is a relatively trivial task.

另一个有趣的事实是，如果我们能够独立解决多边界框问题和语义分割问题，那么从本质上来说，我们也解决了实例分割的任务！好消息是，已经建立了非常强大的模型来解决这两个问题，将这两个问题放在一起是一项相对琐碎的任务。

This particular model has a name — Mask R-CNN (short for “regional convolutional neural network”), and it was built by the Facebook AI research team (FAIR) in April 2017.

该特定模型的名称为Mask R-CNN(“区域卷积神经网络”的缩写)，由Facebook AI研究团队(FAIR)于2017年4月构建。

The working principle of Mask R-CNN is again quite simple. All they (the researchers) did was stitch 2 previously existing state of the art models together and played around with the linear algebra (deep learning research in a nutshell). The model can be roughly divided into 2 parts — a region proposal network (RPN) and binary mask classifier.

Mask R-CNN的工作原理同样非常简单。他们(研究人员)所做的只是将2个先前存在的最先进模型缝合在一起，并使用线性代数(简而言之，进行深度学习研究)。该模型可以大致分为两部分-区域提议网络(RPN)和二进制掩码分类器。

Step one is to get a set of bounding boxes that could possibly contain an object of relevance. The fancy word of the day here is RoI Align. The RoI Align network works on principles of object detection (discussed above, did you forget already!), but it outputs multiple possible bounding boxes rather than a single definite one. These boxes are refined using another regression model, which we will not discuss here. More details on the RoI Align network can be found here.

第一步是获得一组可能包含相关对象的边界框。今天最喜欢的单词是RoI Align。 RoI Align网络基于对象检测的原理(如上所述，您已经忘记了！)，但是它输出多个可能的边界框，而不是一个确定的边界框。这些框使用另一个回归模型进行了精炼，在此不再赘述。您可以在此处找到有关RoI Align网络的更多详细信息。

The second stage is to actually do the colouring. Au contraire to what one might think, this step is also quite easy! All you need to do is apply the existing state of the art model for semantic segmentation to each bounding box. The cool part is that since we are guaranteed to have at most 1 class in each box, we just to train our semantic segmentation model like a binary classifier, meaning we only need to learn the mapping from input pixels to a 1/0. 1 would represent the presence of an object, and 0 would be the background. Then, for added flair, we could color each of the pixels that map to 1 to get funky looking results that look like this:

第二阶段是实际进行着色。与您可能想到的相反，这一步也很容易！您需要做的就是将现有的最新模型进行语义分割到每个边界框。最酷的部分是，由于我们保证每个框最多具有1个类，因此我们仅像二进制分类器一样训练语义分割模型，这意味着我们仅需要学习从输入像素到1/0的映射。 1将表示对象的存在，而0将作为背景。然后，为了增加风格，我们可以为映射到1的每个像素着色，以获得看起来像这样的时髦效果：

结论 (Conclusion)

The applications of this technology are far-reaching. Some of the more lucrative use-cases include motion capture, autonomous driving and surveillance systems. But we’ll leave all the applications of this technology in the minds of the reader.

这项技术的应用是深远的。一些更有利可图的用例包括运动捕捉，自动驾驶和监视系统。但是，我们会将这种技术的所有应用程序留给读者。

For the most part, instance segmentation is now quite achievable, and it’s time to start thinking about innovative ways of using this idea of doing computer vision algorithms at a pixel by pixel level. A good example would be a cool new algorithm called DensePose. For some unknown reason, this model is not getting a lot of press. But the potential is greater than that of the gravitation of a black hole!

在大多数情况下，实例分割现在是可以实现的，现在是时候开始考虑使用创新的方法来使用这种逐像素级别执行计算机视觉算法的想法了。一个很好的例子是称为DensePose的很酷的新算法。由于某些未知的原因，该模型并未受到广泛的关注。但是潜力大于黑洞的引力！

Put simply, think of DensePose as kinect on the cheap. It can essentially do whatever an advanced motion capture system can do for a fraction of a fraction of a cost. In theory, you could run this model on a $10 device like as raspberry pi!

简单地说，将DensePose视为廉价的kinect。它基本上可以执行高级运动捕捉系统可以执行的任何事情，而成本却仅为其一小部分。从理论上讲，您可以在价格为$ 10的设备上运行该模型，例如raspberry pi！

From a theoretical perspective (aka the cooler one), this technology could be extended to other types of neural nets (other than CNNs and FCNs). The main idea here is that of taking the most elementary portion of some data (a pixel in this case), and deciding how it contributes to the overall structure.

从理论上讲(又称冷却器)，该技术可以扩展到其他类型的神经网络(CNN和FCN除外)。这里的主要思想是获取某些数据的最基本部分(在这种情况下为像素)，并确定其如何对整体结构做出贡献。

Hypothetically, we could classify each individual sample from a signal and decide how it contributes to a sequence of music. If we’re even more ambitious, we could identify which parts of a sequence of music is most appealing, and combine it with appealing parts from other songs, resulting in a novel way to fuse our favorite songs together!

假设地，我们可以对信号中的每个样本进行分类，并确定其对音乐序列的贡献。如果我们更具野心，我们可以确定音乐序列中最吸引人的部分，然后将其与其他歌曲的吸引人的部分结合起来，从而以一种新颖的方式将我们喜欢的歌曲融合在一起！

On a more serious note, we could use the same technique for more important data. For example, we could train a Mask R-CNN model to highlight which exact areas of an MRI scan correlate to certain behavioral/psychological patterns, or which sub-sequences of DNA correspond to some particular traits, potentially resulting in breakthroughs in medical AI.

更严重的是，我们可以对更重要的数据使用相同的技术。例如，我们可以训练一个Mask R-CNN模型，以突出显示MRI扫描的确切区域与某些行为/心理模式相关，或者DNA的哪些子序列与某些特定特征相对应，从而有可能导致医学AI的突破。

The Mask R-CNN model, at its core, is about breaking data into its most fundamental building blocks. As humans, we have inherent biases in the way we look at the world. AI, on the other hand, has the potential to look at the world in ways we humans couldn’t even comprehend, and as it was once said by a man who mastered the art of looking for the most fundamental truths:

Mask R-CNN模型的核心是将数据分解为最基本的构建块。作为人类，我们在看待世界的方式上存在固有的偏见。另一方面，人工智能有潜力以人类甚至无法理解的方式看待世界，正如一位精通寻找最基本真理的人曾经说过的那样：

翻译自: https://www.freecodecamp.org/news/mask-r-cnn-explained-7f82bec890e3/

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)