Despite significant progress in single image-based 3D human mesh recovery, accurately and smoothly recovering 3D human motion from a video remains challenging. Existing video-based methods generally recover human mesh by estimating the complex pose and shape parameters from coupled image features, whose high complexity and low representation ability often result in inconsistent pose motion and limited shape patterns. To alleviate this issue, we introduce 3D pose as the intermediary and propose a Pose and Mesh Co-Evolution network (PMCE) that decouples this task into two parts: (1) video-based 3D human pose estimation and (2) mesh vertices regression from the estimated 3D pose and temporal image feature. Specifically, we propose a two-stream encoder that estimates mid-frame 3D pose and extracts a temporal image feature from the input image sequence. In addition, we design a co-evolution decoder that performs pose and mesh interactions with the image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the human body shape. Extensive experiments demonstrate that the proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M, and MPI-INF-3DHP.
Framework Overview
Framework of the proposed PMCE. Given a video sequence, static image features are extracted by a pre-trained CNN, and 2D poses are detected by an off-the-shelf 2D pose detector. The two-stream encoder leverages dual parallel modules to generate a temporal feature and estimate the mid-frame 3D pose, respectively. Then, the co-evolution decoder regresses the mesh vertices through the pose and mesh interactions guided by our proposed AdaLN, which makes the pose and mesh fit the body shape.
Results on internet videos
Comparison on Datasets
Visual Comparison
Pink: MPS-Net, Blue: Our PMCE.
@inproceedings{you2023pmce,
author = {Yingxuan You, Hong Liu, Ti Wang, Wenhao Li, Runwei Ding, Xia Li},
title = {Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
pages = {14963--14973},
year = {2023}
}