SPIN、VIBE 等 3D Human Pose Estimation 方法中的弱透视投影 (Weak Perspective Projection)

2023-05-16

弱透视投影 (Weak Perspective Projection)

弱透视投影假设焦距与物距足够大，此时物体在 z z z 轴(光轴)上的变化可以忽略。

SPIN、VIBE 等 3D Human Pose Estimation 方法中的弱透视投影

首先，3D 关键点已经位于一个 [ − 1 , 1 ] 3 [-1, 1]^3 [−1,1]3 的一个立方体内。且相机位于立方体中心(世界坐标系原点)，相机坐标系与世界坐标系完全对齐。如下图所示：
在这里插入图片描述
图1. 初始状态

为了进行弱透视投影，需要将物距增大，按照下式进行增大
t z = 2 × f R e s × s t_z = \frac{2\times f}{Res \times s} tz=Res×s2×f
其中 f f f 是焦距； R e s Res Res 是 crop 并 resize 后图像大小，即输入图片大小，在文中一般设置为 224； s s s 是网络预测得到的 cam 参数中的一个， t x , t y , s = c a m tx, ty, s = cam tx,ty,s=cam， t x , t y tx, ty tx,ty 表示关键点应该在 [ − 1 , 1 ] 3 [-1, 1]^3 [−1,1]3 立方体内应该偏移的位置， s s s 表示人体在 224 × 224 224 \times 224 224×224 中的比例。可以按照下图来理解 t z t_z tz 的计算公式。
在这里插入图片描述
图2. 物距 t z t_z tz 计算示意图

投影步骤

对关键点按照 [ t x , t y , t z ] [t_x, t_y, t_z] [tx,ty,tz] 进行平移。
构造相机内参矩阵，对关键点进行变换得到像素坐标。

注意： 从代码来看，与上述过程只有一点不同，代码中 u 0 u_0 u0 与 v 0 v_0 v0 都设置为了0，这是因为 GT 的 2D joints 已经已 crop 的图像中心为原点，归一化到了 [ 0 , 1 ] [0, 1] [0,1]。

def projection(pred_joints, pred_camera):
    pred_cam_t = torch.stack([pred_camera[:, 1],
                              pred_camera[:, 2],
                              2 * 5000. / (224. * pred_camera[:, 0] + 1e-9)], dim=-1)
    batch_size = pred_joints.shape[0]
    camera_center = torch.zeros(batch_size, 2)
    pred_keypoints_2d = perspective_projection(pred_joints,
                                               rotation=torch.eye(3).unsqueeze(0).expand(batch_size, -1, -1).to(pred_joints.device),
                                               translation=pred_cam_t,
                                               focal_length=5000.,
                                               camera_center=camera_center)
    # Normalize keypoints to [-1,1]
    pred_keypoints_2d = pred_keypoints_2d / (224. / 2.)
    return pred_keypoints_2d

def perspective_projection(points, rotation, translation,
                           focal_length, camera_center):
    """
    This function computes the perspective projection of a set of points.
    Input:
        points (bs, N, 3): 3D points
        rotation (bs, 3, 3): Camera rotation
        translation (bs, 3): Camera translation
        focal_length (bs,) or scalar: Focal length
        camera_center (bs, 2): Camera center
    """
    batch_size = points.shape[0]
    K = torch.zeros([batch_size, 3, 3], device=points.device)
    K[:,0,0] = focal_length
    K[:,1,1] = focal_length
    K[:,2,2] = 1.
    K[:,:-1, -1] = camera_center

    # Transform points
    points = torch.einsum('bij,bkj->bki', rotation, points)
    points = points + translation.unsqueeze(1)

    # Apply perspective distortion
    projected_points = points / points[:,:,-1].unsqueeze(-1) # 提前除以 Z_c

    # Apply camera intrinsics
    projected_points = torch.einsum('bij,bkj->bki', K, projected_points)

    return projected_points[:, :, :-1]

参考

Beyond Weak Perspective for Monocular 3D Human Pose Estimation

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)