Authors
- Jiaqi Geng jiaqigen@andrew.cmu.edu Carnegie Mellon University Pittsburgh, PA, USA
- Dong Huang donghuang@cmu.edu Carnegie Mellon University Pittsburgh, PA, USA
- Fernando De la Torre ftorre@cs.cmu.edu Carnegie Mellon University Pittsburgh, PA, USA
Bibtex
Geng J, Huang D, De la Torre F. DensePose From WiFi[J]. arXiv preprint arXiv:2301.00250, 2022.
0. ABSTRACT
Advances in computer vision and machine learning techniques have led to significant development in 2D and 3D human pose estimation from RGB cameras, LiDAR, and radars. However, human pose esti- mation from images is adversely affected by occlusion and lighting, which are common in many scenarios of interest. Radar and LiDAR technologies, on the other hand, need specialized hardware that is expensive and power-intensive. Furthermore, placing these sensors in non-public areas raises significant privacy concerns.
计算机视觉和机器学习技术的进步已经导致RGB摄像头、LiDAR和雷达中2D和3D人体姿势估计的重大发展。然而,图像中的人体姿势估计受到遮挡和照明的不利影响,这在许多关注的场景中是常见的。另一方面,雷达和LiDAR技术需要专门的硬件,价格昂贵,耗能大。此外,在非公共区域放置这些传感器会引发严重的隐私问题。
To address these limitations, recent research has explored the use of WiFi antennas (1D sensors) for body segmentation and key-point body detection. This paper further expands on the use of the WiFi signal in combination with deep learning architectures, commonly used in computer vision, to estimate dense human pose correspon- dence. We developed a deep neural network that maps the phase and amplitude of WiFi signals to UV coordinates within 24 human regions. The results of the study reveal that our model can estimate the dense pose of multiple subjects, with comparable performance to image-based approaches, by utilizing WiFi signals as the only input. This paves the way for low-cost, broadly accessible, and privacy-preserving algorithms for human sensing.
为了解决这些局限,最近的研究已经探索了使用WiFi天线(1D传感器)进行身体分割和关键点身体检测。本文进一步拓展了WiFi信号与计算机视觉中常用的深度学习体系结构的结合,以估计密集的人体姿势对应关系。我们开发了一种深度神经网络,将WiFi信号的相位和振幅映射到24个人体区域内的UV坐标。研究结果表明,我们的模型可以利用WiFi信号作为唯一输入,估计多个对象的密集姿势,与基于图像的方法具有相当的性能。这为低成本、广泛可访问和隐私保护的人体传感算法铺平了道路。
Keywords
Pose Estimation, Dense Body Pose Estimation, WiFi Signals, Key- point Estimation, Body Segmentation, Object Detection, UV Coor- dinates, Phase and Amplitude, Phase Sanitization, Channel State Information, Domain Translation, Deep Neural Network, Mask- RCNN
1. INTRODUCTION
Much progress has been made in human pose estimation using 2D [7, 8, 12, 22, 28, 33] and 3D [17, 32] sensors in the last few years (e.g., RGB sensors, LiDARs, radars), fueled by applications in autonomous driving and augmented reality. These traditional sensors, however, are constrained by both technical and practical considerations. LiDAR and radar sensors are frequently seen as being out of reach for the average household or small business due to their high cost. For example, the medium price of one of the most common COTS LiDAR, Intel L515, is around 700 dollars, and the prices for ordinary radar detectors range from 200 dollars to 600 dollars. In addition, these sensors are too power-consuming for daily and household use. As for RGB cameras, narrow field of view and poor lighting conditions, such as glare and darkness, can have a severe impact on camera-based approaches. Occlusion is another obstacle that prevents the camera-based model from generating reasonable pose predictions in images. This is especially worrisome for indoors scenarios, where furniture typically occludes people.
近年来,人体姿势估计在2D [7, 8, 12, 22, 28, 33] 和3D [17, 32]传感器方面取得了很大进展(例如,RGB传感器,LiDAR,雷达),得益于自动驾驶和增强现实应用。然而,这些传统传感器受到技术和实际限制的限制。LiDAR和雷达传感器因其高成本常常被认为不适合普通家庭或小企业使用。例如,最常见的COTS LiDAR之一Intel L515的中等价格约为700美元,而普通雷达探测器的价格从200美元到600美元不等。此外,这些传感器对于日常和家庭使用来说太耗电。至于RGB相机,狭窄的视野和较差的照明条件,例如眩光和黑暗,可能对基于相机的方法产生严重影响。遮挡是另一个阻碍相机模型在图像中生成合理姿势预测的障碍。这对室内场景尤其令人担忧,因为家具通常会遮挡人。
More importantly, privacy concerns prevent the use of these technologies in non-public places. For instance, most people are uncomfortable with having cameras recording them in their homes, and in certain areas (such as the bathroom) it will not be feasible to install them. These issues are particularly critical in healthcare applications, that are increasingly shifting from clinics to homes, where people are being monitored with the help of cameras and other sensors. It is important to resolve the aforementioned prob- lems in order to better assist the aging population, which is the most susceptible (especially during COVID) and has a growing demand to keep them living independently at home.
此外,隐私问题阻止了这些技术在非公共地方的使用。例如,大多数人不愿意在家中被摄像机录制,在某些地区(如浴室)安装摄像机将不可行。这些问题在医疗应用中特别关键,随着医疗监测从诊所向家庭转移,人们正在通过摄像机和其他传感器监测。为了更好地帮助老年人,特别是在COVID期间,最容易受到影响并需要独自生活在家中的老年人,解决上述问题至关重要。
We believe that WiFi signals can serve as a ubiquitous substitute for RGB images for human sensing in certain instances. Illumination and occlusion have little effect on WiFi-based solutions used for interior monitoring. In addition, they protect individuals’ privacy and the required equipment can be bought at a reasonable price. In fact, most households in developed countries already have WiFi at home, and this technology may be scaled to monitor the well-being of elder people or just identify suspicious behaviors at home.
我们相信,WiFi信号可以在某些情况下作为RGB图像对人体感知的普遍替代品。光照和遮挡对用于室内监测的基于WiFi的解决方案几乎没有影响。此外,它们保护个人的隐私,所需的设备价格合理。事实上,大多数发达国家的家庭已经有了WiFi,这项技术可以扩展到监测老年人的健康状况,或在家中识别可疑行为。
The issue we are trying to solve is depicted in Fig. 1 (first row). Given three WiFi transmitters and three aligned receivers, can we detect and recover dense human pose correspondence in clut- tered scenarios with multiple people (Fig. 1 fourth row). It should be noted that many WiFi routers, such as TP-Link AC1750, come with 3 antennas, so our method only requires 2 of these routers. Each of these router is around 30 dollars, which means our entire setup is still way cheaper than LiDAR and radar systems. Many factors make this a difficult task to solve.
我们试图解决的问题如图1(第1行)所示。给定三个WiFi发射器和三个对齐的接收器,我们是否可以在多人的杂乱情况下(图1第4行)检测和恢复人体密集姿态对应关系。应该注意到,许多WiFi路由器(例如TP-Link AC1750)配备了3个天线,因此我们的方法只需要这些路由器中的2个。每个路由器的价格约为30美元,这意味着我们整个设置仍然远远低于LiDAR和雷达系统。许多因素使这成为一个难以解决的任务。
First of all, WiFi-based perception[11, 30] is based on the Channel-state-information (CSI) that represents the ratio between the transmitted signal wave and the received signal wave. The CSIs are complex decimal sequences that do not have spatial correspondence to spatial locations, such as the image pixels.
首先,基于WiFi的感知[11,30]是基于信道状态信息(CSI)的,该信息表示发送信号波和接收信号波的比率。 CSI是复数十进制序列,没有与空间位置的对应关系,例如图像像素。
Secondly, classic techniques rely on accurate measurement of time-of-fly and angle-of-arrival of the signal be- tween the transmitter and receiver [13, 26]. These techniques only locate the object’s center; moreover, the localization accuracy is only around 0.5 meters due to the random phase shift allowed by the IEEE 802.11n/ac WiFi communication standard and potential interference with electronic devices under similar frequency range such as microwave oven and cellphones.
其次,经典技术依赖于信号在发射机和接收机之间的时间和角度测量[13,26]。 这些技术只能定位物体的中心;此外,由于IEEE 802.11n/ac WiFi通信标准允许的随机相位变化以及与具有相似频率范围的电子设备(例如微波炉和手机)的干扰,定位精度仅为0.5米左右。
To address these issues, we derive inspiration from recent pro-posed deep learning architectures in computer vision, and propose a neural network architecture that can perform dense pose estima- tion from WiFi. Fig 1 (bottom row) illustrates how our algorithm is able to estimate dense pose using only WiFi signal in scenarios with occlusion and multiple people.
为了解决这些问题,我们从计算机视觉中最近提出的深度学习架构中获得启示,并提出了一种能够从WiFi进行密集姿态估计的神经网络架构。 图1(底部)说明了我们的算法如何仅使用WiFi信号在存在遮挡和多人的情况下估计密集姿态。
Figure 1: The first row illustrates the hardware setup. The second and third rows are the clips of amplitude and phase of the input WiFi signal. The fourth row contains the dense pose estimation of our algorithm from only the WiFi signal.
图1:第一行说明了硬件设置。第二行和第三行是输入WiFi信号的振幅和相位的剪辑。第四行包含了我们算法仅通过WiFi信号的密集姿态估计。
2 RELATED WORK
This section briefly describes existing work on dense estimation from images and human sensing from WiFi.
这一部分简要描述了从图像中进行密集估计和从WiFi中进行人类感知的现有工作。
Our research aims to conduct dense pose estimation via WiFi. In computer vision, the subject of dense pose estimation from pictures and video has received a lot of attention [6, 8, 18, 40]. This task consists of finding the dense correspondence between image pixels and the dense vertices indexes of a 3D human body model. The pioneering work of Güler et al. [8] mapped human images to dense correspondences of a human mesh model using deep networks. DensePose is based on instance segmentation architectures such as Mark-RCNN [9], and predicts body-wise UV maps for each pixel, where UV maps are flattened representations of 3d geometry, with coordinate points usually corresponding to the vertices of a 3d dimensional object. In this work, we borrow the same architecture as DensePose [8]; however, our input will not be an image or video, but we use 1D WiFi signals to recover the dense correspondence.
我们的研究旨在通过WiFi进行密集姿态估计。在计算机视觉中,从图片和视频中进行密集姿态估计的问题受到了很多关注 [6,8,18,40]。该任务包括找到图像像素和三维人体模型的密集顶点索引之间的密集对应关系。 Güler等人的先驱工作[8]使用深度网络将人类图像映射到人体网格模型的密集对应关系。DensePose基于实例分割架构,如Mark-RCNN [9],预测每个像素的body-wise UV映射,其中UV映射是3d几何的压缩表示,其坐标点通常对应3d维对象的顶点。在这项工作中,我们借鉴了与DensePose [8]相同的架构;然而,我们的输入不是图像或视频,而是使用1D WiFi信号来恢复密集对应关系。
Recently, there have been many extensions of DensePose pro- posed, especially in 3D human reconstruction with dense body parts [3, 35, 37, 38]. Shapovalov et al.’s [24] work focused on lifting dense pose surface maps to 3D human models without 3D supervision. Their network demonstrates that the dense correspondence alone (without using full 2D RGB images) contains sufficient information to generate posed 3D human body. Compared to previous works on reconstructing 3D humans with sparse 2D keypoints, DensePose annotations are much denser and provide information about the 3D surface instead of 2D body joints.
最近,DensePose 的许多扩展已经被提出,特别是在密集身体部分的 3D 人体重建方面 [3,35,37,38]。Kapovalov 等人的工作 [24] 着重于将密集姿态表面映射升级为不带 3D 监督的 3D 人体模型。他们的网络证明,仅密集对应(不使用完整的 2D RGB 图像)包含生成 posed 3D 人体所需的充分信息。与之前用稀疏 2D 关键点重建 3D 人类的工作相比,DensePose 注释更密集,提供了有关 3D 表面的信息,而不是 2D 身体关节。
While there is a extensive literature on detection [19, 20], tracking [4, 34], and dense pose estimation [8, 18] from images and videos, human pose estimation from WiFi or radar is a relatively unexplored problem. At this point, it is important to differentiate the current work on radar-based systems and WiFi. The work of Adib et.al. [2] proposed a Frequency Modulated Continuous Wave (FMCW) radar system (broad bandwidth from 5.56GHz to 7.25GHz) for indoor human localization. A limitation of this system is the specialized hardware for synchronizing the ransmission, refraction, and reflection to compute the Time-of-Flight (ToF). The system reached a resolution of 8.8 cm on body localization. In the following work [1], they improved the system by focusing on a moving per- son and generated a rough single-person outline with depth maps. Recently, they applied deep learning approaches to do fine-grained human pose estimation using a similar system, named RF-Pose [39]. These systems do not work under the IEEE 802.11n/ac WiFi communication standard (40MHz bandwidth centered at 2.4GHz). They rely on additional high-frequency and high-bandwidth electromagnetic fields, which need specialized technology not available to the general public. Recently, significant improvements have been made to radar-based human sensing systems. mmMesh [36] generates 3D human mesh from commercially portable millimeter-wave de- vices. This system can accurately localize the vertices on the human mesh with an average error of 2.47 cm. However, mmMesh does not work well with occlusions since high-frequency radio waves cannot penetrate objects.
在图像和视频的检测[19,20]、跟踪[4,34]和密集姿态估计[8,18]方面有着丰富的文献,但从WiFi或雷达中估计人体姿态是一个相对未被探索的问题。在这一点上,重要的是要区分当前基于雷达的系统和WiFi的工作。 Adib等人的工作[2]提出了一种频率调制连续波(FMCW)雷达系统(从5.56GHz到7.25GHz的宽带)用于室内人体定位。该系统的一个限制是用于同步传输、折射和反射以计算时间到达(ToF)的专用硬件。该系统在身体定位上达到了8.8厘米的分辨率。在随后的工作[1]中,他们通过关注移动的人来改进该系统,并生成了一个带深度图的粗略的单人轮廓。最近,他们应用深度学习方法,使用类似的系统进行细粒度人体姿态估计,称为RF-Pose[39]。这些系统不适用于IEEE 802.11n/ac WiFi通信标准(以2.4GHz为中心的40MHz带宽)。它们依赖于额外的高频和高带宽电磁场,需要普通公众不具备的专用技术。最近,基于雷达的人体传感系统取得了重大进展。
Unlike the above radar systems, the WiFi-based solution [11, 30] used off-the-shelf WiFi adapters and 3dB omnidirectional antennas. The signal propagate as the IEEE 802.11n/ac WiFi data packages transmitting between antennas, which does not introduce addi- tional interference. However, WiFi-based person localization using the traditional time-of-flight (ToF) method is limited by its wave- length and signal-to-noise ratio. Most existing approaches only conduct center mass localization [5, 27] and single-person action classification [25, 29]. Recently, Fei Wang et.al. [31] demonstrated that it is possible to detect 17 2D body joints and perform 2D se- mantic body segmentation mask using only WiFi signals. In this work, we go beyond [31] by estimating dense body pose, with much more accuracy than the 0.5m that the WiFi signal can pro- vide theoretically. Our dense posture outputs push above WiFi’s signal constraint in body localization, paving the road for complete dense 2D and possibly 3D human body perception through WiFi. To achieve this, instead of directly training a randomly initialized WiFi-based model, we explored rich supervision information to improve both the performance and training efficiency, such as uti- lizing the CSI phase, adding keypoint detection branch, and transfer learning from the image-based model.
与上述的雷达系统不同,基于WiFi的解决方案[11,30]使用了现成的WiFi适配器和3dB全向天线。信号传播是IEEE 802.11n/ac WiFi数据包在天线之间传输,不会引入额外的干扰。然而,使用传统的时间-飞行(ToF)方法的WiFi基于人员定位受其波长和信噪比的限制。大多数现有的方法仅进行中心质量定位[5,27]和单人行动分类[25,29]。最近,Fei Wang等人[31]证明了可以使用仅WiFi信号检测17个2D身体关节并执行2D语义身体分割掩码。在这项工作中,我们超越了[31],通过估计密集的身体姿态,比WiFi信号可以提供的理论上的0.5米更精确。我们密集的姿势输出超出了WiFi在身体定位中的信号限制,为完整的密集2D和可能的3D人体感知通过WiFi铺平了道路。为了实现这一目标,我们不是直接训练随机初始化的基于WiFi的模型,而是探索丰富的监督信息来提高性能和训练效率,例如利用CSI相位,添加关键点检测分支,并从基于图像的模型进行转移学习。
3 METHODS
Our approach produces UV coordinates of the human body surface from WiFi signals using three components: first, the raw CSI signals are cleaned by amplitude and phase sanitization. Then, a two-branch encoder-decoder network performs domain translation from sanitized CSI samples to 2D feature maps that resemble images. The 2D features are then fed to a modified DensePose-RCNN architecture [8] to estimate the UV map, a representation of the dense correspondence between 2D and 3D humans. To improve the training of our WiFi-input network, we conduct transfer learning, where we minimize the differences between the multi-level fea- ture maps produced by images and those produced by WiFi signals before training our main network.
我们的方法通过三个部分从WiFi信号生成人体的UV坐标:首先,原始CSI信号通过振幅和相位清洁处理;然后,一个双分支编码器-解码器网络从清洁的CSI样本进行域名转换,得到类似图像的2D特征图;最后,2D特征图输入到修改过的DensePose-RCNN体系结构[8]中,估计UV图,表示2D和3D人体之间密集对应关系的表示。为了提高我们WiFi输入网络的训练,我们进行迁移学习,通过最小化图像和WiFi信号生成的多层特征图的差异,在训练主网络之前提高特征图的相似度。
The raw CSI data are sampled in 100Hz as complex values over 30 subcarrier frequencies (linearly spaced within
2.4
G
H
z
±
20
M
H
z
2.4GHz \pm 20MHz
2.4GHz±20MHz) transmitting among 3 emitter antennas and 3 reception antennas (see Figure 2). Each CSI sample contains a
3
×
3
3 \times 3
3×3 real integer matrix and a
3
×
3
3 \times 3
3×3 imaginary integer matrix. The inputs of our network contained 5 consecutive CSI samples under 30 frequencies, which are organized in a
150
×
3
×
3
150 \times 3 \times 3
150×3×3 amplitude tensor and a
150
×
3
×
3
150 \times 3 \times 3
150×3×3 phase tensor respectively. Our network outputs include a
17
×
56
×
56
17 \times 56 \times 56
17×56×56 tensor of keypoint heatmaps (one
56
×
56
56 \times 56
56×56 map for each of the 17 kepoints) and a
25
×
112
×
112
25 \times 112 \times 112
25×112×112 tensor of UV maps (one
112
×
112
112 \times 112
112×112 map for each of the 24 body parts with one additional map for background).
我们的原始 CSI 数据是在 100Hz 的频率以复数值形式对 30 个子载波频率(在
2.4
G
H
z
±
20
M
H
z
2.4GHz \pm 20MHz
2.4GHz±20MHz 范围内线性分布)采样的。其中,3 个发射天线和 3 个接收天线之间的数据传输(见图 2)。每个 CSI 样本包含一个
3
×
3
3 \times 3
3×3 的实整数矩阵和一个
3
×
3
3 \times 3
3×3 的虚整数矩阵。我们的网络的输入包含 5 个连续的 CSI 样本以及 30 个频率,它们以
150
×
3
×
3
150 \times 3 \times 3
150×3×3 的幅值张量和
150
×
3
×
3
150 \times 3 \times 3
150×3×3 的相位张量的形式组织。我们的网络输出包括一个
17
×
56
×
56
17 \times 56 \times 56
17×56×56 的关键点热图张量(每个 17 个关键点有一个
56
×
56
56 \times 56
56×56 的地图)和一个
25
×
112
×
112
25 \times 112 \times 112
25×112×112 的 UV 地图张量(每个 24 个人体部位有一个
112
×
112
112 \times 112
112×112 的地图,另外还有一个背景图)。
3.1 Phase Sanitization
The raw CSI samples are noisy with random phase drift and flip (see Figure 3(b)). Most WiFi-based solutions disregard the phase of CSI signals and rely only on their amplitude (see Figure 3 (a)). As shown in our experimental validation, discarding the phase information have a negative impact on the performance of our model. In this section, we perform sanitization to obtain stable phase values to enable full use of the CSI information.
原始的 CSI 样本具有随机相位漂移和翻转的噪音(见图 3(b))。大多数基于 WiFi 的解决方案忽略 CSI 信号的相位,仅依赖它们的振幅(见图 3(a))。如我们的实验验证所示,丢弃相位信息会对我们模型的性能产生负面影响。在本节中,我们执行清洁以获得稳定的相位值,以实现对 CSI 信息的充分利用。
Figure 2: CSI samples from Wifi. (a) the layout of WiFi devices and human bodies, and (b) the
3
×
3
3 \times 3
3×3 tensor dimension corresponds to the
3
×
3
3 \times 3
3×3 transmitter-receiver antenna pairs. For instance,
E
1
E1
E1 denotes the first emitter and
R
1
R1
R1 denotes the first receiver, etc. By incorporating the 5 consecutive complex-valued CSI samples (100 samples/second) under 30 subcarrier frequencies, the two input tensors to our network are a
150
×
3
×
3
150 \times 3 \times 3
150×3×3 amplitude tensor and a
150
×
3
×
3
150 \times 3 \times 3
150×3×3 phase tensor.
图2:WiFi的CSI样本。(a)WiFi设备和人体的布局,(b)
3
×
3
3 \times 3
3×3 张量维度对应于
3
×
3
3 \times 3
3×3 发射机-接收机天线对。例如,
E
1
E1
E1表示第一个发射机,
R
1
R1
R1表示第一个接收机,等。通过结合30个子载波频率下的5个连续复杂值CSI样本(每秒100个样本),我们网络的两个输入张量是
150
×
3
×
3
150 \times 3 \times 3
150×3×3 振幅张量和
150
×
3
×
3
150 \times 3 \times 3
150×3×3 相位张量。
In raw CSI samples (5 consecutive samples visualized in Figure 3(a-b)), the amplitude (A) and phase (
Φ
\Phi
Φ) of each complex element
z
=
a
+
b
i
z = a +bi
z=a+bi are computed using the formulation
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)