IS ATTENTION ALL THAT NERF NEEDS?

Abstract

We present Generalizable NeRF Transformer (GNT), a transformer-based architecture that reconstructs Neural Radiance Fields (NeRFs) and learns to render novel views on the fly from source views. While prior works on NeRFs optimize a scene representation by inverting a handcrafted rendering equation, GNT achieves neural representation and rendering that generalizes across scenes using transformers at two stages. (1) The view transformer leverages multi-view geometry as an inductive bias for attention-based scene representation, and predicts coordinatealigned features by aggregating information from epipolar lines on the neighboring views. (2) The ray transformer renders novel views using attention to decode the features from the view transformer along the sampled points during ray marching. Our experiments demonstrate that when optimized on a single scene, GNT can successfully reconstruct NeRF without an explicit rendering formula due to the learned ray renderer. When trained on multiple scenes, GNT consistently achieves state-of-the-art performance when transferring to unseen scenes and outperform all other methods by ~10% on average. Our analysis of the learned attention maps to infer depth and occlusion indicate that attention enables learning a physicallygrounded rendering. Our results show the promise of transformers as a universal modeling tool for graphics.

1. INTRODUCTION

Neural Radiance Field (NeRF) (Mildenhall et al., 2020) and its follow-up works (Barron et al., 2021; Zhang et al., 2020; Chen et al., 2022) have achieved remarkable success on novel view synthesis, generating photo-realistic, high-resolution, and view-consistent scenes. Two key ingredients in NeRF are, (1) a coordinate-based neural network that maps each spatial position to its corresponding color and density, and (2) a differentiable volumetric rendering pipeline that composes the color and density of points along each ray cast from the image plane to generate the target pixel color. Optimizing a NeRF can be regarded as an inverse imaging problem that fits a neural network to satisfy the observed views. Such training leads to a major limitation of NeRF, making it a time-consuming optimization process for each scene (Chen et al., 2021a; Wang et al., 2021b; Yu et al., 2021) . Recent works Neuray (Liu et al., 2022 ), IBRNet (Wang et al., 2021b ), and PixelNerf (Yu et al., 2021) go beyond the coordinate-based network and rethink novel view synthesis as a cross-view imagebased interpolation problem. Unlike the vanilla NeRF that tediously fits each scene, these methods synthesize a generalizable 3D representation by aggregating image features extracted from seen views according to the camera and geometry priors. However, despite showing large performance gains, they unexceptionally decode the feature volume to a radiance field, and rely on classical volume rendering (Max, 1995; Levoy, 1988) to generate images. Note that the volume rendering equation adopted in NeRF over-simplifies the optical modeling of solid surface (Yariv et al., 2021; Wang et al., 2021a ), reflectance (Chen et al., 2021c; Verbin et al., 2021; Chen et al., 2022) , inter-surface scattering and other effects. This implies that radiance fields along with volume rendering in NeRF are not a universal imaging model, which may have limited the generalization ability of NeRFs as well. In this paper, we first consider the problem of transferable novel view synthesis as a two-stage information aggregation process: the multi-view image feature fusion, followed by the samplingbased rendering integration. Our key contributions come from using transformers (Vaswani et al., 2017) for both these stages. Transformers have had resounding success in language modeling (Devlin et al., 2018) and computer vision (Dosovitskiy et al., 2020) and their "self-attention" mechanism can be thought of as a universal trainable aggregation function. In our case, for volumetric scene representation, we train a view transformer, to aggregate pixel-aligned image features (Saito et al., 2019) from corresponding epipolar lines to predict coordinate-wise features. For rendering a novel view, we develop a ray transformer that composes the coordinate-wise point features along a traced ray via the attention mechanism. These two form the Generalizable NeRF Transformer (GNT). GNT simultaneously learns to represent scenes from source view images and to perform sceneadaptive ray-based rendering using the learned attention mechanism. Remarkably, GNT predicts novel views using the captured images without fitting per scene. Our promising results endorse that transformers are strong, scalable, and versatile learning backbones for graphical rendering (Tewari et al., 2020) . Our key contributions are: 1. A view transformer to aggregate multi-view image features complying with epipolar geometry and to infer coordinate-aligned features. 2. A ray transformer for a learned ray-based rendering to predict target color. 3. Experiments to demonstrate that GNT's fully transformer-based architecture achieves stateof-the-art results on complex scenes and cross-scene generalization. 4. Analysis of the attention module showing that GNT learns to be depth and occlusion aware. Overall, our combined Generalizable NeRF Transformer (GNT) demonstrates that many of the inductive biases that were thought necessary for view synthesis (e.g. persistent 3D model, hard-coded rendering equation) can be replaced with attention/transformer mechanisms.

2. RELATED WORK

Transformers (Vaswani et al., 2017) have emerged as a ubiquitous learning backbone that captures long-range correlation for sequential data. It has shown remarkable success in language understanding (Devlin et al., 2018; Dai et al., 2019; Brown et al., 2020 ), computer vision (Dosovitskiy et al., 2020; Liu et al., 2021 ), speech (Gulati et al., 2020) 



and even protein structure(Jumper et al., 2021)  amongst others. In computer vision,(Dosovitskiy et al., 2020)  were successful in demonstrating Vision Transformers (ViT) for image classification. Subsequent works extended ViT to other vision tasks, including object detection(Carion et al., 2020), segmentation(Chen et al., 2021b; Wang et al.,  2021c), video processing (Zhou et al., 2018a; Arnab et al., 2021), and 3D instance processing(Guo  et al., 2021; Lin et al., 2021). In this work, we apply transformers for view synthesis by learning to reconstruct neural radiance fields and render novel views. Neural Radiance Fields (NeRF) introduced by Mildenhall et al. (2020) synthesizes consistent and photorealistic novel views by fitting each scene as a continuous 5D radiance field parameterized by an MLP. Since then, several works have improved NeRFs further. For example, Mip-NeRF Barron et al.Transformer Meets Radiance Fields. Most similar to our work are NeRF methods that apply transformers for novel view synthesis and generalize across scenes.IBRNet (Wang et al., 2021b)   processes sampled points on the ray using an MLP to predict color values and density features which are then input to a transformer to predict density.Recently, NeRFormer (Reizenstein et al., 2021)  andWang et al. (2022)  use attention module to aggregate source views to construct feature volume with epipolar geometry constraints. However, a key difference with our work is that, all of them decode

availability

https://vita-group.github.io/GNT/ 

