NEURAL IMAGE-BASED AVATARS: GENERALIZABLE RADIANCE FIELDS FOR HUMAN AVATAR MODELING

Abstract

We present a method that enables synthesizing novel views and novel poses of arbitrary human performers from sparse multi-view images. A key ingredient of our method is a hybrid appearance blending module that combines the advantages of the implicit body NeRF representation and image-based rendering. Existing generalizable human NeRF methods that are conditioned on the body model have shown robustness against the geometric variation of arbitrary human performers. Yet they often exhibit blurry results when generalized onto unseen identities. Meanwhile, image-based rendering shows high-quality results when sufficient observations are available, whereas it suffers artifacts in sparse-view settings. We propose Neural Image-based Avatars (NIA) that exploits the best of those two methods: to maintain robustness under new articulations and self-occlusions while directly leveraging the available (sparse) source view colors to preserve appearance details of new subject identities. Our hybrid design outperforms recent methods on both in-domain identity generalization as well as challenging cross-dataset generalization settings. Also, in terms of the pose generalization, our method outperforms even the per-subject optimized animatable NeRF methods.

1. INTRODUCTION

Acquisition of 3D renderable full-body avatars is critical for applications to virtual reality, telepresence and human modeling. While early solutions have required heavy hardware setups such as dense camera rigs or depth sensors, recent neural rendering techniques have achieved significant progress to a more scalable and low-cost solution. Notably, neural radiance fields (NeRF) based methods facilitated by the parametric body prior Loper et al. (2015) 2022). Nevertheless, these body surface feature conditioned NeRFs still suffer blur artifacts when generalizing onto unseen subject identities with complex poses. Also, an extra effort on video-level feature aggregation is required in Kwon et al. (2021) to compensate for the sparsity of input views. In this paper, we propose Neural Image-based Avatars (NIA) that generalizes novel view synthesis and pose animation for arbitrary human performers from a sparse set of still images. It is a hybrid framework that combines body surface feature conditioned NeRF (e.g., Kwon et al. ( 2021)) and image-based rendering techniques (e.g., Wang et al. (2021) ). While the former helps in robust representation of different body shapes and poses, the image-based rendering helps preserving the color and texture details from the source images. This can complement the NeRF predicted colors which are often blur and inaccurate in generalization settings (cross-identity as well as cross-dataset generalization) as shown in figure 2 and figure 5. To leverage the best of both worlds, we propose a neural appearance blending scheme that learns to adaptively blend the NeRF predicted colors with the direct source image colors. Last but not least, by deforming the learned NIA representation based on the skeleton-driven transformations Lewis et al. (2000) ; Kavan et al. (2007) , we enable plausible pose animation of the learned avatar. To demonstrate the efficacy of our NIA method, we experiment on ZJU-MoCap Peng et al. 2022) also propose the neural blending of NeRF prediction with source view colors. Specifically, they use the pixel-aligned featuresconditioned NeRF as their implicit body representation. However, pixel-aligned features alone are prone to errors under complex poses as reported in Kwon et al. (2021) . Therefore, their method require 3D supervision (e.g., depth, visibility) or per-subject finetuning. In contrast to these methods, we aim at generalizing human modeling without relying on the 3D supervision or per-subject finetuning, but by using only RGB supervision. 



require only sparse camera views to enable visually pleasing free-view synthesis Peng et al. (2021b); Kwon et al. (2021); Zhao et al. (2022); Cheng et al. (2022) or pose animation Peng et al. (2021a); Su et al. (2021) of the human avatar. Still, creating a full-body avatar from sparse images (e.g., three snaps) of a person is a challenging problem due to the complexity and diversity of possible human appearances and poses. Most existing methods Peng et al. (2021b;a) are therefore focusing on person-specific setting which requires a dedicated model optimization for each new subject it encounters. More recent methods explore generalizable human NeRF representations Peng et al. (2021b); Raj et al. (2021b); Kwon et al. (2021); Zhao et al. (2022) by using pixel-aligned features in a data-driven manner. Among them, Kwon et al. (2021) and Chen et al. (2022) specifically exploit the body surface feature conditioned NeRF (i.e., pixel-aligned features anchored at the SMPL vertices) which helps robustness to various articulations while obviating the need for the 3D supervision Zhao et al. (2022); Cheng et al. (

(2021b) and MonoCap Habermann et al. (2020; 2021) datasets. First, experiments show that our method outperforms the state-of-the-art Neural Human Performer Kwon et al. (2021) and GP-NeRF Chen et al. (2022) in novel view synthesis task. Furthermore, we study the more challenging cross-dataset generalization by evaluating the zero-shot performance on the MonoCap Habermann et al. (2020; 2021) datasets, where we clearly outperform the previous methods. Finally, we evaluate on the pose animation task, where our NIA tested on unseen subjects achieves better pose generalization than the per-subject trained A-NeRF Su et al. (2021) and Animatable-NeRF Peng et al. (2021a) that are tested on the seen training subjects. The ablation studies demonstrate that the proposed modules of our NIA collectively contribute to the high-quality rendering for arbitrary human subjects. 2 RELATED WORK Combined with Neural Radiance Fields (NeRF) Mildenhall et al. (2020), human reconstruction research has shown unprecedented development Pumarola et al. (2020); Park et al. (2021a;b). Human priors are utilized to enable robust reconstruction of face and body Gao et al. (2020); Gafni et al. (2021); Peng et al. (2021b). However, these methods are per-subject optimized, and cannot model the motions that are not seen during training. Therefore, subsequent works have been focusing on generalization in two directions: pose and subject identity. Pose generalization. Su et al. (2021) utilize a joint-relative encoding for dynamic articulations. Noguchi et al. (2021) explicitly associate 3D points to body parts. Chen et al. (2021) and Peng et al. (2021a) deform the target pose space queries into the canonical space to obtain the color and density values. Liu et al. (2021) leverages normal map as the dense pose cue. Xu et al. (2021) learns deformable signed distance field. Weng et al. (2022) and Peng et al. (2022) decompose the human deformation into articulation-driven and non-rigid deformations. Su et al. (2022) and Zheng et al. (2022) utilize the joint-specific local radiance fields. Despite the significant progress in pose generalization, they still focus on a subject-specific setting which requires training a single model for each subject. In this paper, we tackle generalization across both poses and subject identities. Subject identity generalization. The use of image-conditioned or pixel-aligned features Yu et al. (2020); Wang et al. (2021) has allowed generalized neural human representations from sparse camera views. Raj et al. (2021b) use camera-encoded pixel-aligned features for face view synthesis. Kwon et al. (2021) aggregate temporal features by anchoring them onto the SMPL body vertice to complement the sparse input views. Chen et al. (2022) also leverages body surface features to enable full-body synthesis. Zhao et al. (2022) and Cheng et al. (

Other (non NeRF-based) methods. Deferred neural rendering based methods Thies et al. (2019); Raj et al. (2021a); Grigorev et al. (2021) enable fast synthesis of person-specific avatar by combining the traditional graphics pipeline with the neural texture maps. Grigorev et al. further finetunes to deal with unseen subjects. In contrast, we focus on the generalization without any finetuning. Saito et al. (2019; 2020); He et al. (2021) leverage pixel-aligned features to enable 3d reconstruction from a single image. However, they require 3D groundtruth supervision. Habermann et al. (2019); Liu et al. (2020); Habermann et al. (2021); Bagautdinov et al. (2021) generate high-quality non-rigid deformation, but they require the template mesh optimization. Aliev et al. (2019); Wu et al. (2020) anchor features on the point clouds and render them with differentiable rasterizer. However, they

