ST-GCN的学习之路（一）论文分析

2023-11-20

St-GCN_2018AAAI

Author：Sijie Yan, Yuanjun Xiong and Dahua Lin
论文原文：Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Introduction

In this work, we propose a novel model of dynamic skeletons called Spatial Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data. On two large datasets, Kinetics and NTU-RGBD, it achieves substantial improvements over mainstream methods.

那么什么是 dynamic skeletons 呢？那么如何体现 automatically 以及 spatial and temporal 的呢？

Multiple modalities of human action

Human action recognition has become an active research area in recent years, as it plays a significant role in video understanding. In general, human action can be recognized from multiple modalities(Simonyan and Zisserman 2014; Tran et al. 2015; Wang, Qiao, and Tang 2015; Wang et al. 2016; Zhao et al. 2017), such as appearance, depth, optical flows, and body skeletons (Du, Wang, and Wang 2015; Liu et al. 2016).

对于单张图片的算法，最高境界就是图像的理解和推理；视频算法，当然也是希望最终实现视频的理解。
视频的建模可以粗分为三类：基于二维图像，基于深度图像，基于光流，基于人体骨架。

The weakless of previous methods

The dynamic skeleton modality can be naturally represented by a time series of human joint locations, in the form of 2D or 3D coordinates. Human actions can then be recognized by analyzing the motion patterns thereof. Earlier methods of using skeletons for action recognition simply employ the joint coordinates at individual time steps to form feature vectors, and apply temporal analysis thereon (Wang et al. 2012; Fernando et al. 2015). The capability of these methods is limited as they do not explicitly exploit the spatial relationships among the joints, which are crucial for understanding human actions.

之前的方法，只是将不同帧的关节点组成向量，然后在时间域进行组合学习。这种方法并没有考虑到人体关节点本身的连接关系等。

This work’s advantages

There are two types of edges, namely the spatial edges that conform to the natural connectivity of joints and the temporal edges that connect the same joints across consecutive time steps. Multiple layers of spatial temporal graph convolution are constructed thereon, which allow information to be integrated along both the spatial and the temporal dimension.
The hierarchical nature of ST-GCN eliminates the need of hand-crafted part assignment or traversal rules.

如何理解 hierarchical nature？（层次化本质）也即是说 hierarchical nature 实现了 automatically learning.
从另一方面，这句话表示其他方法需要进行 hand-crafted part assignment or traversal rules。

This (hierarchical nature…) not only leads to greater expressive power and thus higher performance (as shown in our experiments), but also makes it easy to generalize to different contexts. Upon the generic GCN formulation, we also study new strategies to design graph convolution kernels, with inspirations from image models.

可以看到此处的 gcn 与之前阅读的几篇 gcn 文章的核设计将会有所差异。

在这里插入图片描述

Related work

Introduction 里面比较宽泛地介绍了 human action recognition 任务，在 Related work 里面将进行细化，将焦点集中于 Skeleton Based. 两者都是综述性质的？

Neural Networks on Graphs

The principle of constructing GCNs on graph generally follows two streams:

the spectral perspective, where the locality of the graph convolution is considered in the form of spectral analysis (Henaff, Bruna, and LeCun 2015; Duvenaud et al. 2015; Li et al. 2016; Kipf and Welling 2017);
the spatial perspective, where the convolution filters are applied directly on the graph nodes and their neighbors (Bruna et al. 2014; Niepert, Ahmed, and Kutzkov 2016). This work follows the spirit of the second stream. We construct the CNN filters on the spatial domain, by limiting the application of each filter to the 1-neighbor of each node.

这之前对图神经网络的研究分为了两类：第一种是基于频域的角度:
如何理解 Graph Convolutional Network（GCN）？
GCN (Graph Convolutional Network) 图卷积网络解析
第二种是基于空间域的角度:(emmm,这个本人暂时还没看，之后填坑)
本文采用的是直接从空间域的方式来构建卷积。

Skeleton Based Action Recognition

The recent success of deep learning has lead to the surge of deep learning based skeleton modeling methods. These works have been using recurrent neural networks (Shahroudy et al. 2016; Zhu et al. 2016; Liu et al. 2016; Zhang, Liu, and Xiao 2017) and temporal CNNs (Li et al. 2017; Ke et al. 2017; Kim and Reiter 2017) to learn action recognition models in an end-to-end manner. Among these approaches, many have emphasized the importance of modeling the joints within parts of human bodies. But these parts are usually explicitly assigned using domain knowledge.

深度学习的方法，如采用 RNN,，temporal CNNs

Spatial Temporal Graph ConvNet

When performing activities, human joints move in small local groups, known as “body parts”. Existing approaches for skeleton based action recognition have verified the effectiveness of introducing body parts in the modeling(Shahroudy et al. 2016; Liu et al. 2016; Zhang, Liu, and Xiao 2017).

“body parts”：层次显然高于关键点。网络结构设计很重要，但是值得一提的是很多时候将会产生冗余设计。

We argue that the improvement is largely due to that parts restrict the modeling of joints trajectories within “local regions” compared with the whole skeleton, thus forming a hierarchical representation of the skeleton sequences.

我们认为，使用body parts让模型提升的主要原因是部分限制了建模的关节轨迹在“局部区域”，而不是整个骨架，从而形成了骨骼序列的层次表示。

In tasks such as image object recognition, the hierarchical representation and locality are usually achieved by the intrinsic properties of convolutional neural networks (Krizhevsky, Sutskever, and Hinton 2012), rather than manually assigning object parts. It motivates us to introduce the appealing property of CNNs to skeleton based action recognition. The result of this attempt is the ST-GCN model.

这里提到了 CNNs 的层次化表征以及局部性。

Skeleton based data can be obtained from motion-capture devices or pose estimation algorithms from videos. Usually the data is a sequence of frames, each frame will have a set of joint coordinates. Given the sequences of body joints in the form of 2D or 3D coordinates, we construct a spatial temporal graph with the joints as graph nodes and natural connectivities in both human body structures and time as graph edges.The input to the ST-GCN is therefore the joint coordinate vectors on the graph nodes.This can be considered as an analog(类似物) to image based CNNs where the input is formed by pixel intensity(强度) vectors residing on the 2D image grid. Multiple layers of spatial-temporal graph convolution operations will be applied on the input data and generating higher-level feature maps on the graph. It will then be classified by the standard SoftMax classifier to the corresponding action category. The whole model is trained in an end-to-end manner with backpropagation. We will now go over the components in the ST-GCN model.

类比于 CNNs 进行阐述

Skeleton Graph Construction

A skeleton sequence is usually represented by 2D or 3D coordinates of each human joint in each frame. Previous work using convolution for skeleton action recognition (Kim and Reiter 2017) concatenates coordinate vectors of all joints to form a single feature vector per frame. In our work, we utilize the spatial temporal graph to form hierarchical representation of the skeleton sequences.
Particularly, we construct an undirected spatial temporal graph G = ( V , E ) G = (V, E) G=(V,E) on a skeleton sequence with N joints and T frames featuring both intra-body and inter-frame connection. In this graph, the node set V = { v t i ∣ t = 1 , . . . , T , i = 1 , . . . , N } V=\{v_{ti}|t=1,...,T,i=1,...,N\} V={vti∣t=1,...,T,i=1,...,N}includes the all the joints in a skeleton sequence. As ST-GCN’s input,the feature vector on a node F(v_ti) consists of coordinate vectors, as well as estimation confidence.We construct the spatial temporal graph on the skeleton sequences in two steps. First, the joints within one frame are connected with edges according to the connectivity of human body structure, which is illustrated in Fig. 1. Then each joint will be connected to the same joint in the consecutive frame. The connections in this setup are thus naturally defined without the manual part assignment.

First 和 then 就是所说的两步吧。“in this setup”是包括两步，还是只包括后面的一步？从后面看是两步都是（但是代码貌似看起来只包括了第二步）？

This also enables the network architecture to work on datasets with different number of joints or joint connectivities. For example, on the Kinetics dataset, we use the 2D pose estimation results from the OpenPose (Cao et al. 2017b) toolbox which outputs 18 joints, while on the NTURGB+D dataset (Shahroudy et al. 2016) we use 3D joint tracking results as input, which produces 25 joints. The STGCN can operate in both situations and provide consistent superior performance. An example of the constructed spatial temporal graph is illustrated in Fig. 1.

对于不同关节提取模型(OpenPose和NTURGB+D，一个是18个关节点，另一个是25个关节点)，st-gcn均适用。

Formally, the edge set E is composed of two subsets, the first subset depicts the intra-skeleton connection at each frame, denoted as E s = { v t i v t j ∣ ( i , j ) ∈ H } E_s=\{{v_{ti}v_{tj}|(i,j) \in H}\} Es={vtivtj∣(i,j)∈H}, where H is the set of naturally connected human body joints. The second subset contains the inter-frame edges, which connect the same joints in consecutive frames as E F = { v t i v ( t + 1 ) i } E_F=\{v_{ti}v_{(t+1)i}\} EF={vtiv(t+1)i}Therefore all edges in E_F for one particular joint i will represent its trajectory over time.

规定了图中边的集合E：基于同一帧不同关节的连接和基于不同帧相同关节的连接。

Spatial Graph Convolutional Neural Network

这才是本文的重点所在，即作者是如何设计得 ST-GCN?

Before we dive into the full-fledged ST-GCN, we first look at the graph CNN model within one single frame. In this case, on a single frame at time τ , there will be N joint nodes V_t, along with the skeleton edges E s ( τ ) = { v t i v t j ∣ t = τ , ( i , j ) ∈ H } E_s(\tau)=\{v_{ti}v_{tj}|t=\tau,(i,j)\in H\} Es(τ)={vtivtj∣t=τ,(i,j)∈H}
Recall the definition of convolution operation on the 2D natural images or feature maps, which can be both treated as 2D grids. The output feature map of a convolution operation is again a 2D grid. With stride 1 and appropriate padding, the output feature maps can have the same size as the input feature maps. We will assume this condition in the following discussion. Given a convolution operator with the kernel size of K×K,and an input feature map f_in with the number of channels c.The The output value for a single channel at the spatial location x can be written as f o u t = ∑ h = 1 K ∑ w = 1 K f i n ( p ( x , h , w ) ) ⋅ w ( h , w ) (1) f_{out}=\sum_{h=1}^K\sum_{w=1}^Kf_{in}(p(x,h,w))\cdot w(h,w) \tag{1} fout=h=1∑Kw=1∑Kfin(p(x,h,w))⋅w(h,w)(1)p is sampling function, w is weight function.

这个公式表示的是什么呢？它想表达的是：在只考虑输出单channel 的条件下，输入一个 feature map 以及一个指定的坐标，那么以该坐标为中心的邻域（包括 channel 维度的数据）与卷积核进行一次运算产生一个输出。

The convolution operation on graphs is then defined by extending the formulation above to the cases where the input features map resides on a spatial graph . That is, the feature map has a vector on each node of the graph. The next step of the extension is to redefine the sampling function p and the weight function w.

下一步是重新定义采样函数p和权重函数w

Sampling function

On images, the sampling function p(h,w) is defined on the neighboring pixels with respect to the center location x.On graphs, we can similarly define the sampling function on the neighbor set B ( v t i ) = { v t j ∣ d ( v t i , v t j ≤ D ) } B(v_{ti})=\{v_{tj}|d(v_{ti},v_{tj}\leq D)\} B(vti)={vtj∣d(vti,vtj≤D)} of a node v t i v_{ti} vti.Here d ( v t i , v t j ) d(v_{ti},v_{tj}) d(vti,vtj)denotes the minimum length of any path from v t j v_{tj} vtj to v t i v_{ti} vti.Thus the sampling function p: B ( v t i ) → V B(v_{ti})\rightarrow V B(vti)→V can be written as p ( v t i , v t j ) = v t j (2) p(v_{ti},v_{tj})=v_{tj} \tag{2} p(vti,vtj)=vtj(2)In this work we use D = 1 for all cases, that is, the 1- neighbor set of joint nodes. The higher number of D is left for future works.

采样在本文中只采了图中相邻点成为V集合

Weight function

Compared with the sampling function, the weight function is trickier to define. In 2D convolution, a rigid grid naturally exists around the center location. So pixels within the neighbor can have a fixed spatial order. The weight function can then be implemented by indexing a tensor of (c, K, K) dimensions according to the spatial order. For general graphs like the one we just constructed, there is no such implicit arrangement. The solution to this problem is first investigated in Niepert, Ahmed, and Kutzkov 2016, where the order is defined by a graph labeling process in the neighbor graph around the root node. We follow this idea to construct our weight function.

Niepert, Ahmed, and Kutzkov 2016 该论文需要仔细阅读一下。
主要讲述了本文中权重函数应该如何构建，是通过选取根节点周围节点来构建的，详细方法当然要自行阅读原文啦。
然后，可以看到，论文认为要实现图卷积主要问题还是如何实现节点类似图像 2D Grids 的组织。

Instead of giving every neighbor node a unique labeling, we simplify the process by partitioning(分割) the neighbor set B ( v t i ) B(v_{ti}) B(vti) of a joint node v t i v_{ti} vti into a fixed number of K subsets, where each subset has a numeric label.Thus we can have a mapping l t i : B ( v t i ) → { 0 , 1 , . . . , K − 1 } l_{ti}:B(v_{ti})\rightarrow \{0,1,...,K-1\} lti:B(vti)→{0,1,...,K−1} which maps a node in the neighborhood to its subset label. The weight function w ( v t i , v t j ) : B ( v t i ) → R c w(v_{ti},v_{tj}):B(v_{ti})\rightarrow R^c w(vti,vtj):B(vti)→Rccan be implemented by indexing a tensor of (c, K) dimension or w ( v t i , v t j ) = w ′ ( l t i ( v t j ) ) (3) w(v_{ti},v_{tj})=w'(l_{ti}(v_{tj}))\tag{3} w(vti,vtj)=w′(lti(vtj))(3)We will discuss several partitioning strategies after.

把邻结点划分成固定个数的K个子集合，并给这K个子集合标注上一个数字label。定义了映射 l t i l_{ti} lti :将 i i i 的邻结点给分配到对应label的子集中去。

Spatial Graph Convolution

With the refined sampling function and weight function, we now rewrite Eq. 1 in terms of graph convolution as f o u t = ∑ v t j ∈ B ( v t i ) 1 Z t i ( v t j ) f i n ( p ( x , h , w ) ) w ( h , w ) (4) f_{out}=\sum_{v_{tj}\in B(v_{ti})}\frac{1}{Z_{ti}(v_{tj})}f_{in}(p(x,h,w))w(h,w) \tag{4} fout=vtj∈B(vti)∑Zti(vtj)1fin(p(x,h,w))w(h,w)(4)where the normalizing term Z t i ( v t j ) = ∣ { v t k ∣ l t i ( v t k ) = l t i ( v t j ) } ∣ Z_{ti}(v_{tj}) = \lvert \{ v_{tk}|l_{ti}(v_{tk}) = l_{ti}(v_{tj}) \} \lvert Zti(vtj)=∣{vtk∣lti(vtk)=lti(vtj)}∣ equals the cardinality(基数) of the corresponding subset.This term is added to balance the contributions of different subsets to the output. Substituting Eq. 2 and Eq. 3 into Eq. 4, we arrive at f o u t = ∑ v t j ∈ B ( v t i ) 1 Z t i ( v t j ) f i n ( v t j ) w ( l t i ( v t j ) ) (5) f_{out}=\sum_{v_{tj}\in B(v_{ti})}\frac{1}{Z_{ti}(v_{tj})}f_{in}(v_{tj})w(l_{ti}(v_{tj})) \tag{5} fout=vtj∈B(vti)∑Zti(vtj)1fin(vtj)w(lti(vtj))(5)It is worth noting this formulation can resemble(类似于) the standard 2D convolution if we treat a image as a regular 2D grid. For example, to resemble a 3 × 3 convolution operation, we have a neighbor of 9 pixels in the 3 × 3 grid centered on a pixel. The neighbor set should then be partitioned into 9 subsets, each having one pixel.

It is worth noting: 值得注意的是
Z表达的意思是集合的大小，即其所说的基数（集合里面的概念）
为什么要除这个Z呢，我认为应该是为了防止随着卷积层的深入而产生的训练时梯度消失的问题。

Spatial Temporal Modeling

Having formulated spatial graph CNN, we now advance to the task of modeling the spatial temporal dynamics within skeleton sequence. Recall that in the construction of the graph, the temporal aspect of the graph is constructed by connecting the same joints across consecutive frames. This enable us to define a very simple strategy to extend the spatial graph CNN to the spatial temporal domain. That is, we extend the concept of neighborhood to also include temporally connected joints as B ( v t i ) = { v q j ∣ d ( v t j , v t i ) ≤ K , ∣ q − t ∣ ≤ ⌊ Γ / 2 ⌋ } (6) B(v_{ti})=\{v_{qj}|d(v_{tj},v_{ti})\leq K,|q-t|\leq \lfloor \Gamma/2 \rfloor \} \tag{6} B(vti)={vqj∣d(vtj,vti)≤K,∣q−t∣≤⌊Γ/2⌋}(6)

这部分关键是对 B ( v t i ) B(v_{ti}) B(vti)的定义在时间维度上做了扩展。 Γ \Gamma Γ是时间跨度，是指不同时间下得到的neighbor graph的时间上的跨度，一句话概括这个公式：在 t t t 时刻标号 i i i 的关节周围满足空间上距离 K K K 的邻近点们 v t j v_{tj} vtj,他们在 q q q 时刻构成的集合。

The parameter Γ \Gamma Γ controls the temporal range to be included in the neighbor graph and can thus be called the temporal kernel size.To complete the convolution operation on the spatial temporal graph, we also need the sampling function, which is the same as the spatial only case, and the weight function, or in particular, the labeling map l S T l_{ST} lST.Because the temporal axis is well-ordered, we directly modify the label map l S T l_{ST} lST for a spatial temporal neighborhood rooted at v t i v_{ti} vti to be l S T ( v q j ) = l t i ( v t j ) + ( q − t + ⌊ Γ / 2 ⌋ ) × K (7) l_{ST}(v_{qj})=l_{ti}(v_{tj})+(q-t+\lfloor \Gamma/2 \rfloor) \times K \tag{7} lST(vqj)=lti(vtj)+(q−t+⌊Γ/2⌋)×K(7)where l t i ( v t j ) l_{ti}(v_{tj}) lti(vtj) is the label map for the single frame case at v t i v_{ti} vti .In this way, we have a well-defined convolution operation on the constructed spatial temporal graphs.

主要就是要实现不同时间上 l l l 映射的 label 的分配。这部分可以对照前面定义映射 l t i l_{ti} lti对于不同时间 q q q 上的，通过公式的后半部分进行区分。这样就将原来的label从 0 → K − 1 0 \rightarrow K-1 0→K−1 ，变为 0 → x K − 1 0\rightarrow xK-1 0→xK−1.
也就是说上面的公式是在原来 l t i l_{ti} lti 基础上进行修改之后的，将首帧赋值 0 → K − 1 0 \rightarrow K-1 0→K−1，后面每增加一帧，则对应在上一帧基础上进行平移 K K K.
值得一提的是，目前为止，在计算过程中，并不会使得图变小，无论经过多少次卷积，图还是原来一样大。（当然，节点的 feature 被当作 channel 进行处理。（但是可以实现 channel 的改变).

目前为止，关于时空图卷积的方法已经介绍完毕。简单梳理一下：在 t t t 时刻输入一副图G, 在图上找到一个根节点 i i i , 确定一个时间跨度 Γ \Gamma Γ, 找到 i i i 节点周围的临近节点们 v t j v_{tj} vtj, 经过若干个时间跨度后，到达时刻 q q q ,根据式（7）对这些邻结点 j j j划分子集，并对每个子集进行labeling（每个子集label会决定权重 w ( l t i ( v t j ) ) w(l_{ti}(v_{tj})) w(lti(vtj)) 的分配），最后，按照式（5）进行卷积并输出。输出的图的节点数不会发生变化（即图的大小不会发生变化）。

Partition Strategies

Given the high-level formulation of spatial temporal graph convolution, it is important to design a partitioning strategy to implement the label map l l l. In this work we explore several partition strategies. For simplicity, we only discuss the cases in a single frame because they can be naturally extended to the spatial-temporal domain using Eq. 7

接下来，我们要讨论这个 l l l映射应该采用什么样的原则去划分子集（去确定K），当然，我们只需要讨论单帧图像上的子集划分就可以，因为时间轴上的划分label很容易由（7）扩展推导得出。

Uni-labeling

The simplest and most straight forward partition strategy is to have subset, which is the whole neighbor set itself. In this strategy, feature vectors on every neighboring node will have a inner product with the same weight vector. Actually, this strategy resembles the propagation rule introduced in (Kipf and Welling 2017). It has an obvious drawback that in the single frame case, using this strategy is equivalent to computing the inner product between the weight vector and the average feature vector of all neighboring nodes. This is suboptimal for skeleton sequence classification as the local differential properties could be lost in this operation. Formally, we have K = 1 K = 1 K=1 and l t i ( v t j ) = 0 , ∀ i , j ∈ V l_{ti}(v_{tj})=0 ,\forall i,j \in V lti(vtj)=0,∀i,j∈V

第一种划分最简单，所有的 v t j v_{tj} vtj都划分到一个子集中，说明其权重相同。缺点：对于骨架模型来说，局部微分属性可能会丢失。

Distance partitioning

Another natural partitioning strategy is to partition the neighbor set according to the nodes’ distance to the root node.In this work, because we set D = 1, the neighbor set will then be separated into two subsets, where d = 0 refers to the root node itself and remaining neighbor nodes are in the d = 1 subset. Thus we will have two different weight vectors and they are capable of modeling local differential properties such as the relative translation between joints. Formally, we have K = 2 and l t i ( v t i ) = d ( v t j , v t i ) l_{ti}(v_{ti})=d(v_{tj},v_{ti}) lti(vti)=d(vtj,vti)

顾名思义，按距离分割，因为本文是D=1，所以集合分成两个子集:d=0和d=1。

Spatial configuration partitioning

Since the body skeleton is spatially localized, we can still utilize this specific spatial configuration in the partitioning process. We design a strategy to divide the neighbor set into three subsets: 1) the root node itself; 2)centripetal(向心) group: the neighboring nodes that are closer to the gravity center of the skeleton than the root node; 3) otherwise the centrifugal(离心) group. Here the average coordinate of all joints in the skeleton at a frame is treated as its gravity center. This strategy is inspired by the fact that motions of body parts can be broadly categorized as concentric and eccentric motions. Formally, we have l t i ( v t j ) = { 0 i f r j = r i 1 i f r j < r i 2 i f r j > r j l_{ti}(v_{tj})=\left\{ \begin{array}{c} 0 & if &r_j=r_i \\ 1& if & r_j<r_i \\ 2 & if &r_j>r_j\end{array}\right. lti(vtj)=⎩⎨⎧012ifififrj=rirj<rirj>rj
where r j r_j rj is the average distance from gravity center to joint i over all frames in the training set. Visualization of the three partitioning strategies is shown in Fig. 3. We will empirically examine the proposed partioning strategies on skeleton based action recognition experiments. It is expected that a more advanced partitioning strategy will lead to better modeling capacity and recognition performance.

根据离人体重心的距离分成三类，给出了 r j r_j rj的定义。下图形象的给出了三种不同的划分方式的例子。

在这里插入图片描述

Learnable edge importance weighting

Although joints move in groups when people are performing actions, one joint could appear in multiple body parts. These appearances, however, should have different importance in modeling the dynamics of these parts. In this sense, we add a learnable mask M on every layer of spatial temporal graph convolution. The mask will scale the contribution of a node’s feature to its neighboring nodes based on the learned importance weight of each spatial graph edge in E s E_s Es Empirically(经验上的地) we find adding this mask can further improve the recognition performance of ST-GCN. It is also possible to have a data dependent attention map for this sake. We leave this to future works.

加入一个可学习的掩膜，此处就是讲述注意力机制了。

Implementing ST-GCN

The implementation of graph-based convolution is not as straightforward as 2D or 3D convolution. Here we provide details on implementing ST-GCN for skeleton based action recognition.
We adopt a similar implementation of graph convolution as in (Kipf and Welling 2017). The intra-body connections of joints within a single frame are represented by an adjacency matrix A(邻接矩阵) and an identity matrix I representing self-connections. In the single frame case, ST-GCN with the first partitioning strategy can be implemented with the following formula： f o u t = Λ − 1 2 ( A + I ) Λ − 1 2 f i n W (9) f_{out}=\Lambda ^{-\frac{1}{2}}(A+I)\Lambda ^{-\frac{1}{2}}f_{in}W \tag{9} fout=Λ−21(A+I)Λ−21finW(9)where Λ i i = ∑ j ( A i j + I i j ) \Lambda ^{ii} = \sum_{j}(A^{ij}+I^{ij}) Λii=∑j(Aij+Iij) Here the weight vectors of multiple output channels are stacked to form the weight matrix W. In practice, under the spatial temporal cases, we can represent the input feature map as a tensor of (C, V, T) dimensions.The graph convolution is implemented by performing a 1 × Γ standard 2D convolution and multiplies the resulting tensor with the normalized adjacency matrix Λ − 1 2 ( A + I ) Λ − 1 2 \Lambda ^{-\frac{1}{2}}(A+I)\Lambda ^{-\frac{1}{2}} Λ−21(A+I)Λ−21on the second dimension.

这里的 Λ \Lambda Λ应该指的是度矩阵，其实公式9是从2017Kipf发表的半监督图卷积网络里借鉴的公式。
（C, V, T）=(通道，节点，时间)
normalized adjacency matrix 是上面推导中的Z。为了防止训练过程中的梯度消失问题。

For partitioning strategies with multiple subsets, i.e., distance partitioning and spatial configuration partitioning, we again utilize this implementation. But note now the adjacency matrix is dismantled(拆分成) into several matrixes A j A_j Aj. For example in the distance partitioning strategy, A 0 = I A_0=I A0=I, A 1 = A A_1=A A1=A. Then The Eq. 9 is transformed into f o u t = ∑ j Λ j − 1 2 A j Λ j − 1 2 f i n W j (10) f_{out}=\sum _j\Lambda_j ^{-\frac{1}{2}}A_j\Lambda_j ^{-\frac{1}{2}}f_{in}W_j \tag{10} fout=j∑Λj−21AjΛj−21finWj(10)

注意关注（10）和（9）的区别。对于不同的划分策略所产生的子集，W权重会不同，度矩阵也会有所不同。因此，相当于是对不同的子集使用（9）式计算最后求和。
这种形式是如何实现的呢？这需要考虑到前面的 Λ i i = ∑ j ( A i j + I i j ) \Lambda ^{ii} = \sum_{j}(A^{ij}+I^{ij}) Λii=∑j(Aij+Iij)

where similarly Λ j i i = ∑ k ( A j i k ) + α \Lambda_j^{ii}= \sum_k(A_j^{ik})+\alpha Λjii=∑k(Ajik)+α. Here we set α = 0.001 to avoid empty rows in A j A_j Aj.
It is straightforward to implement the learnable edge importance weighting. For each adjacency matrix, we accompany it with a learnable weight matrix M. And we substitute the matrix A + I in Eq. 9 and A j A_j Aj in A j A_j Aj in Eq. 10 with ( A + I ) ⊗ M (A+I)\otimes M (A+I)⊗M and A j ⊗ M A_j\otimes M Aj⊗M, respectively. Here ⊗ \otimes ⊗ denotes element-wise product(两个矩阵对应位置元素进行乘积) between two matrixes. The mask M is initialized as an all-one matrix.

加入可学习的边权重矩阵。对应上面提到的M掩膜矩阵。

Experiments

Network architecture and training

Since the ST-GCN share weights on different nodes, it is important to keep the scale of input data consistent on different joints. In our experiments, we first feed input skeletons to a batch normalization layer to normalize data. The ST-GCN model is composed of 9 layers of spatial temporal graph convolution operators (ST-GCN units). The first three layers have 64 channels for output. The follow three layers have 128 channels for output. And the last three layers have 256 channels for output. These layers have 9 temporal kernel size. The Resnet mechanism（机制） is applied on each ST-GCN unit. And we randomly dropout the features at 0.5 probability after each STGCN unit to avoid overfitting. The strides of the 4-th and the 7-th temporal convolution layers are set to 2 as pooling layer. After that, a global pooling was performed on the resulting tensor to get a 256 dimension feature vector for each sequence. Finally, we feed them to a SoftMax classifier. Tdecay the lehe models are learned using stochastic gradient descent with a learning rate of 0.01. We arning rate by 0.1 after every 10 epochs.

一共有9个时空卷积层，每一个都简称为（ST-GCN units），时轴上的卷积核大小为9，每层之间使用了残差连接。每层之后0.5概率的dropout。第4层和第7层的时轴卷积步长设为2以作为池化层。最终得到256个维度的feature vector 通过 softmax 最终分类。
batch normalization layer → \rightarrow → three layers(64 channel output) → \rightarrow → three layers(128 channel output) → \rightarrow → three layers(256 channel output)

To avoid overfitting, we perform two kinds of augmentation to replace dropout layers when training on the Kinetics dataset (Kay et al. 2017). First, to simulate the camera movement, we perform random affine transformations on the skeleton sequences of all frames. Particularly, from the first frame to the last frame, we select a few fixed angle, translation and scaling factors as candidates and then randomly sampled two combinations of three factors to generate an affine transformation. This transformation is interpolated for intermediate frames to generate a effect as if we smoothly move the view point during playback. We name this augmentation as random moving. Second, we randomly sample fragments from the original skeleton sequences in training and use all frames in the test. Global pooling at the top of the network enables the network to handle the input sequences with indefinite length.

提出了两种dropout以外的方法来增强模型的泛化能力。第一种是对所有帧的骨架序列做随机仿射变换。第二种是在训练中随机抽取原始骨架序列中的片段，并在测试中使用所有帧。

Comparison with State of the Arts

结论：在动作识别的表现上相对于最好的 RGB 和光流的方法还是有差距的。
思考：但是省去了手工设计特征的步骤，简化了人力需求。

Summary

本文就是在之前图卷积的基础上，进一步进行改进。本文主要特点是：1、定义了时间域上的邻接点（见公式6），从而引出ST-GCN；2、加入可学习参数，使得不同节点权重不同（类似注意力机制）；3、尝试了不同邻接点策略；4、取得了不错的分类成绩；
在这里插入图片描述
至此，论文部分分析已经结束了，下一篇博客我准备写一写他的代码实现。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)