NEURAL IMAGE-BASED AVATARS: GENERALIZABLE RADIANCE FIELDS FOR HUMAN AVATAR MODELING

Abstract

We present a method that enables synthesizing novel views and novel poses of arbitrary human performers from sparse multi-view images. A key ingredient of our method is a hybrid appearance blending module that combines the advantages of the implicit body NeRF representation and image-based rendering. Existing generalizable human NeRF methods that are conditioned on the body model have shown robustness against the geometric variation of arbitrary human performers. Yet they often exhibit blurry results when generalized onto unseen identities. Meanwhile, image-based rendering shows high-quality results when sufficient observations are available, whereas it suffers artifacts in sparse-view settings. We propose Neural Image-based Avatars (NIA) that exploits the best of those two methods: to maintain robustness under new articulations and self-occlusions while directly leveraging the available (sparse) source view colors to preserve appearance details of new subject identities. Our hybrid design outperforms recent methods on both in-domain identity generalization as well as challenging cross-dataset generalization settings. Also, in terms of the pose generalization, our method outperforms even the per-subject optimized animatable NeRF methods.

1. INTRODUCTION

Acquisition of 3D renderable full-body avatars is critical for applications to virtual reality, telepresence and human modeling. While early solutions have required heavy hardware setups such as dense camera rigs or depth sensors, recent neural rendering techniques have achieved significant progress to a more scalable and low-cost solution. Notably, neural radiance fields (NeRF) 



based methods facilitated by the parametric body prior Loper et al. (2015) require only sparse camera views to enable visually pleasing free-view synthesis Peng et al. (2021b); Kwon et al. (2021); Zhao et al. (2022); Cheng et al. (2022) or pose animation Peng et al. (2021a); Su et al. (2021) of the human avatar. Still, creating a full-body avatar from sparse images (e.g., three snaps) of a person is a challenging problem due to the complexity and diversity of possible human appearances and poses. Most existing methods Peng et al. (2021b;a) are therefore focusing on person-specific setting which requires a dedicated model optimization for each new subject it encounters. More recent methods explore generalizable human NeRF representations Peng et al. (2021b); Raj et al. (2021b); Kwon et al. (2021); Zhao et al. (2022) by using pixel-aligned features in a data-driven manner. Among them, Kwon et al. (2021) and Chen et al. (2022) specifically exploit the body surface feature conditioned NeRF (i.e., pixel-aligned features anchored at the SMPL vertices) which helps robustness to various articulations while obviating the need for the 3D supervision Zhao et al. (2022); Cheng et al. (2022). Nevertheless, these body surface feature conditioned NeRFs still suffer blur artifacts when generalizing onto unseen subject identities with complex poses. Also, an extra effort on video-level feature aggregation is required in Kwon et al. (2021) to compensate for the sparsity of input views. In this paper, we propose Neural Image-based Avatars (NIA) that generalizes novel view synthesis and pose animation for arbitrary human performers from a sparse set of still images. It is a hybrid framework that combines body surface feature conditioned NeRF (e.g., Kwon et al. (2021)) and image-based rendering techniques (e.g., Wang et al. (2021)). While the former helps in robust representation of different body shapes and poses, the image-based rendering helps preserving the color and texture details from the source images. This can complement the NeRF predicted colors

