SWITCH-NERF: LEARNING SCENE DECOMPOSITION WITH MIXTURE OF EXPERTS FOR LARGE-SCALE NEU-RAL RADIANCE FIELDS

Abstract

The Neural Radiance Fields (NeRF) have been recently applied to reconstruct building-scale and even city-scale scenes. To model a large-scale scene efficiently, a dominant strategy is to employ a divide-and-conquer paradigm via performing scene decomposition, which decomposes a complex scene into parts that are further processed by different sub-networks. Existing large-scale NeRFs mainly use heuristic hand-crafted scene decomposition, with regular 3D-distance-based or physical-street-block-based schemes. Although achieving promising results, the hand-crafted schemes limit the capabilities of NeRF in large-scale scene modeling in several aspects. Manually designing a universal scene decomposition rule for different complex scenes is challenging, leading to adaptation issues for different scenarios. The decomposition procedure is not learnable, hindering the network from jointly optimizing the scene decomposition and the radiance fields in an end-to-end manner. The different sub-networks are typically optimized independently, and thus hand-crafted rules are required to composite them to achieve a better consistency. To tackle these issues, we propose Switch-NeRF, a novel end-to-end large-scale NeRF with learning-based scene decomposition. We design a gating network to dispatch 3D points to different NeRF sub-networks. The gating network can be optimized together with the NeRF sub-networks for different scene partitions, by a design with the Sparsely Gated Mixture of Experts (MoE). The outputs from different sub-networks can also be fused in a learnable way in the unified framework to effectively guarantee the consistency of the whole scene. Furthermore, the proposed MoE-based Switch-NeRF model is carefully implemented and optimized to achieve both high-fidelity scene reconstruction and efficient computation. Our method establishes clear state-ofthe-art performances on several large-scale datasets. To the best of our knowledge, we are the first to propose an applicable end-to-end sparse NeRF network with learning-based decomposition for large-scale scenes. Codes are released at https://github.com/MiZhenxing/Switch-NeRF. Along the scene decomposition and learning a sparse neural network, recent Mega-NeRF (Turki et al., 2022) and Block-NeRF (Tancik et al., 2022) have extended NeRF to building-scale and even city-scale scenes based on heuristic hand-crafted scene decomposition. As depicted in Fig. 1 , the Mega-NeRF and Block-NeRF simply use 3D sampling distances or street blocks to decompose the scene and train different NeRF models separately. With promising results on large-scale scenes, their hand-crafted scene decomposition methods still lead to several issues. The large-scale scenes are essentially complex and irregular. Designing a universal scene decomposition rule for different scenes is extremely challenging in a hand-crafted way. This accordingly brings adaptation issues for distinct scenarios in the real world. Hand-crafted rules require rich priors of the target scene, such as the structure of the scene, to deploy the partition centroids as in Mega-NeRF and the physical distribution of the scene images as in Block-NeRF. These priors may not be available in practical applications. The hand-crafted decomposition is not learnable, hindering the network from jointly optimizing the scene decomposition and the radiance fields in an end-to-end manner. The gaps between the decomposition, composition and NeRF optimization may lead to sub-optimal results. Besides, the different sub-networks are typically trained separately, leading to possible inconsistency among different sub-networks. To handle this problem, they usually set overlapping among adjacent partitions in training and use hand-crafted rules in inference to composite results from different sub-networks. (Tancik et al., 2022; Turki et al., 2022) . To address above-mentioned issues, in this paper, we make the following contributions. An end-to-end framework for joint learning of scene decomposition and NeRF. We present Switch-NeRF, an end-to-end sparse neural network framework, which jointly learns the scene decomposition and NeRF. As shown in Fig. 1c , we propose a learnable gating network for scene decomposition. It dynamically selects and sparsely activates a sub-network for each 3D point. The overall network is trained end-to-end without any heuristic intervention. We do not require any priors of the 3D scene shape or the distribution of scene images, leading to a generic framework for large-scale scenes. Since the selection of sub-networks in training is a discrete operation, a critical problem is how to back-propagate gradients into the gating network. We use the strategy from the Sparsely-Gated Mixture-of-Experts (MoE) (Shazeer et al., 2017) to deal with this problem. We structure our sub-networks as NeRF experts for different scene partitions. 3D points are dispatched into different NeRF experts based on the gating network. Besides the gating network, we also de-

1. INTRODUCTION

The Neural Radiance Fields (NeRF) method (Mildenhall et al., 2020) has gathered wide popularity in novel-view synthesis and 3D reconstruction due to its high quality and simplicity. It encodes a 3D scene from multiple 2D posed images. The original NeRF typically targets small scenes or objects, while in real-world applications such as autonomous driving and augmented reality (AR) / virtual reality (VR), building NeRF models to effectively handle large-scale scenes is critically important. The problem of a large-scale NeRF is that more data typically requires a higher network capacity (number of network parameters). A naïve solution is to densely increase the network width and depth. However, this will also greatly increase the computation for each sample and is harder to optimize. A more applicable network should have a large capacity while maintaining almost constant computational cost for each sample. Therefore, building an applicable large-scale NeRF can be Figure 1 : Different kinds of decomposition methods. The dot lines mean non-differentiable operations. The solid lines mean differentiable operations that can be trained by back-propagation. The Mega-NeRF (Turki et al., 2022) clusters pixels by 3D sampling distances to centroids in training. The Block-NeRF (Tancik et al., 2022) clusters images by dividing the whole scene according to street blocks. The sub-networks in both methods are trained separately. Our Switch-NeRF learns to decompose the 3D points by a trainable gating network and the whole network is trained end-to-end. considered as building a sparse neural network. The core of the design is to select different network parameters (i.e. sub-networks) for different inputs. This procedure can be formulated as a scene decomposition problem in the NeRF task. Each sub-network handles a different part of the scene. sign a head to unify the predictions of multiple NeRF experts, which aligns the high-level implicit features from different NeRF experts to effectively address the inconsistency problem. Efficient network design and implementation. With the framework design of Switch-NeRF, however, optimizing and implementing it efficiently and stably is not trivial. In NeRF, the number of 3D points in a forward pass is orders of magnitude larger than that of the input tokens in other NLP and vision tasks within an MoE framework. Dispatching samples to different NeRF experts inevitably introduces large computation and memory usage. Therefore, we consider dispatching 3D points only once with an effective gating network, and we design a deeper gating network to guarantee enough parameters to boost the accuracy of the scene rendering. Another common design in MoE implementations (Hwang et al., 2022) is to define a capacity factor to limit the number of tokens dispatched to each expert. This dynamically drops overflow 3D points for our NeRF experts. It works well when training the network but brings a large influence on the testing accuracy. To address this issue, we implement a full dispatch operation by CUDA based on Tutel (Hwang et al., 2022) to significantly improve the testing performance, while avoiding unnecessary memory allocation. High-quality results. Extensive experiments are conducted on challenging benchmarks with largescale scenes. Qualitative results demonstrate that our network can learn reasonable decomposition of large-scale complex scenes. Our model also establishes state-of-the-art performances. It shows clearly more superior results with much less network parameters compared to those hand-crafted decomposition counterparts.

2. RELATED WORK

Neural Radiance Field. The Neural Radiance Field (NeRF) is proposed by Mildenhall et al. (2020) to use volumetric rendering for novel view synthesis from posed images. It encodes a 3D scene into a multilayer perceptron (MLP), which is simple and requires very limited priors of the scene. Due to the success of NeRF in high-quality rendering and 3D reasoning, many works have been proposed to improve its efficiency (Reiser et al., 2021; Yu et al., 2021; Müller et al., 2022) , accuracy (Barron et al., 2021; Verbin et al., 2022) and apply it to challenging scenes (Zhang et al., 2020; Martin-Brualla et al., 2021; Xiangli et al., 2022) and 3D reconstruction tasks (Wang et al., 2021) . We pay more attention to the closely related works, i.e. NeRF methods for large-scale scenes. As shown in Fig. 1a , Mega-NeRF (Turki et al., 2022) proposes to use a simple 3D distance-based method to cluster training pixels into parts that can be trained separately by different NeRF models. It samples centroids uniformly in the 3D scene and groups 3D points in testing. Block-NeRF (Tancik et al., 2022) proposes to scale NeRF to city-level scenes. As shown in Fig. 1b , it divides the whole scene based on physical distribution of the scene images, i.e. partitioning through street blocks. The scene images in different street blocks are trained separately by sub-networks. In the testing, both methods consider offline fusion of prediction results from different sub-networks. In contrast to Mega-NeRF and Block-NeRF which use hand-craft scene decomposition, our Switch-NeRF jointly learns the scene decomposition and a large-scale NeRF in an end-to-end manner. Our method is not dependent on any priors of the 3D shape and the physical image distribution of a target scene. Therefore, our method is more generic for arbitrary large-scale scenes. Shazeer et al. (2017) . It proposes a Sparsely-Gated-Mixture-of-Experts layer in place of the feed-forward network (i.e. FFN or MLP) in a language model. It designs a vanilla Top-k gating network to dispatch samples into k experts, and proposes an auxiliary loss for balancing the training of different experts. The MoE has been widely used in Natural Language Processing (NLP) (Lepikhin et al., 2021; Fedus et al., 2022) and Vision (Riquelme et al., 2021) . Switch Transformer (Fedus et al., 2022) suggests that Top-2 gating is not necessary. It trains a network of high quality with Top-1 gating to largely reduce the dispatch computation and communication. Besides, different gating mechanisms are also proposed, such as Hash Routing (Roller et al., 2021) and BASE (Lewis et al., 2021) . There have been popular implementations of MoE in Mesh-TensorFlow (Shazeer et al., 2018) , Deepspeed (Rajbhandari et al., 2022) and Tutel (Hwang et al., 2022) , which focus on improving large-scale distributed training. To the best of our knowledge, none of the existing works considers developing a Mixture-of-NeRF-Experts (MoNE) for large-scale scenes. We design an effective structure of MoNE for jointly learning scene decomposition and NeRF, and also improve the efficiency of MoNE by handling the issue of dispatching large-scale 3D points to different NeRF experts.

Gate value of Expert 2

Expert 1 Prediction < l a t e x i t s h a 1 _ b a s e 6 4 = " V H j O 3 Y u H i V 7 v m 4 n h / n m A i c J B v g c = " > A A A B 8 H i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 4 k R M u g h R Y W U c y H J E f Y 2 + w l S 3 b 3 j t 0 9 I R z 5 F T Y W i t j 6 c + z 8 N 2 6 S K z T x w c D j v R l m 5 g U x Z 9 q 4 7 r e T W 1 l d W 9 / I b x a 2 t n d 2 9 4 r 7 B 0 0 d J Y r Q B o l 4 p N o B 1 p Q z S R u G G U 7 b s a J Y B J y 2 g t H V 1 G 8 9 U a V Z J B / M O K a + w A P J Q k a w s d L j / f V l V 7 O B w L 1 i y S 2 7 M 6 B l 4 m W k B B n q v e J X t x + R R F B p C M d a d z w 3 N n 6 K l W G E 0 0 m h m 2 g a Y z L C A 9 q x V G J B t Z / O D p 6 g E 6 v 0 U R g p W 9 K g m f p 7 I s V C 6 7 E I b K f A Z q g X v a n 4 n 9 d J T H j h p 0 z G i a G S z B e F C U c m Q t P v U Z 8 p S g w f W 4 K J Y v Z W R I Z Y Y W J s R g U b g r f 4 8 j J p n p W 9 a r l y V y n V b r M 4 8 n A E x 3 A K H p x D D W 6 g D g 0 g I O A Z X u H N U c 6 L 8 + 5 8 z F t z T j Z z C H / g f P 4 A Y z a Q L w = = < / l a t e x i t > RGB Expert 2 Expert 3 Expert 4 < l a t e x i t s h a 1 _ b a s e 6 4 = " q E b 4 Q k 2 3 8 x T I W z l 2 z T 2 v z T M u p b c = " > A A A B / n i c b V D L S s N A F J 3 U V 6 2 v q L h y E y x C 3 Z R E i r o s i O D C R Q X 7 g K a U y X T S D p 1 H m J l I S w j 4 K 2 5 c K O L W 7 3 D n 3 z h p s 9 D W A w O H c + 7 l n j l B R I n S r v t t F V Z W 1 9 Y 3 i p u l r e 2 d 3 T 1 7 / 6 C l R C w R b i J B h e w E U G F K O G 5 q o i n u R B J D F l D c D s b X m d 9 + x F I R w R / 0 N M I 9 B o e c h A R B b a S + f e S z Q E y S x k 1 a 8 R n U o y B M J u l Z 3 y 6 7 V X c G Z 5 l 4 O S m D H I 2 + / e U P B I o Z 5 h p R q F T X c y P d S 6 D U B F G c l v x Y 4 Q i i M R z i r q E c M q x 6 y S x + 6 p w a Z e C E Q p r H t T N T f 2 8 k k C k 1 Z Y G Z z C K q R S 8 T / / O 6 s Q 6 v e g n h U a w x R / N D Y U w d L Z y s C 2 d A J E a a T g 2 B S B K T 1 U E j K C H S p r G S K c F b / P I y a Z 1 X v Y t q 7 b 5 W r t / l d R T B M T g B F e C B S 1 A H t 6 A B m g C B B D y D V / B m P V k v 1 r v 1 M R 8 t W P n O I f g D 6 / M H S t G V v g = = < / l a t e x i t > PE(x) Rendering < l a t e x i t s h a 1 _ b a s e 6 4 = " u  L E Q l X n Q f 8 R d j K i S R f 3 a 4 8 J e 8 x 8 = " > A A A B 8 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h Q i m J F P V Y 8 O J J K t g P a U P Z b D f t 0 s 0 m 7 G 7 E G P o r v H h Q x K s / x 5 v / x m 2 b g 7 Y + G H i 8 N 8 P M P C / i T G n b / r Z y K 6 t r 6 x v 5 z c L W 9 s 7 u X n H / o K X C W B L a J C E P Z c f D i n I m a F M z z W k n k h Q H H q d t b 3 w 1 9 d s P V C o W i j u d R N Q N 8 F A w n x G s j X R f f q y g p I K e T v v F k l 2 1 Z 0 D L x M l I C T I 0 + s W v 3 i A k c U C F J h w r 1 X X s S L s p l p o R T i e F X q x o h M k Y D 2 n X U I E D q t x 0 d v A E n R h l g P x Q m h I a z d T f E y k O l E o C z 3 Q G W I / U o j c V / / O 6 s f Y v 3 Z S J K N Z U k P k i P + Z I h 2 j 6 P R o w S Y n m i S G Y S G Z u R W S E J S b a Z F Q w I T i L L y + T 1 l n V O a / W b m u l + k 0 W R x 6 O 4 B j K 4 M A F 1 O E a G t A E A g E U V F V d u g k W o m 5 J I U Z c F E V y 4 q G A f 0 I Q y m U z a o f M I M x O x h I K / 4 s a F I m 7 9 D n f + j Z M 2 C 2 0 9 M H A 4 5 1 7 u m R M m l C j t u t / W 0 v L K 6 t p 6 a a O 8 u b W 9 s 2 v v 7 b e V S C X C L S S o k N 0 Q K k w J x y 1 N N M X d R G L I Q o o 7 4 e g q 9 z s P W C o i + L 0 e J z h g c M B J T B D U R u r b h z 4 L x W P W v J 5 U f Q b 1 M I y z a H L a t y t u z Z 3 C W S R e Q S q g Q L N v f / m R Q C n D X C M K l e p 5 b q K D D E p N E M W T s p 8 q n E A 0 g g P c M 5 R D h l W Q T e N P n B O j R E 4 s p H l c O 1 P 1 9 0 Y G m V J j F p r J P K K a 9 3 L x P 6 + X 6 v g y y A h P U o 0 5 m h 2 K U + p o 4 e R d O B G R G G k 6 N g Q i S U x W B w 2 h h E i b x s q m B G / + y 4 u k f V b z z m v 1 u 3 q l c V v U U Q J H 4 B h U g Q c u Q A P c g C Z o A Q Q y 8 A x e w Z v 1 Z L 1 Y 7 9 b H b H T J K n Y O w B 9 Y n z 8 s W Z W q < / l a t e x i t > PE(d) < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 3 t f 2 N O m w e P u R 5 E 2 h X k Q B y 1 / O d Q = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K q M e I C B 4 8 R D A P S Z Y w O 5 k k Q + a x z M y K Y c l X e P G g i F c / x 5 t / 4 y T Z g y Y W N B R V 3 X R 3 R T F n x v r + t 5 d b W l 5 Z X c u v F z Y 2 t 7 Z 3 i r t 7 d a M S T W i N K K 5 0 M 8 K G c i Z p z T L L a T P W F I u I 0 0 Y 0 v J r 4 j U e q D V P y 3 o 5 i G g r c l 6 z H C L Z O e m i L S D 2 l l 9 f j T r H k l / 0 p 0 C I J M l K C D N V O 8 a v d V S Q R V F r C s T G t w I 9 t m G J t G e F 0 X G g n h s a Y D H G f t h y V W F A T p t O D x + j I K V 3 U U 9 q V t G i q / p 5 I s T B m J C L X K b A d m H l v I v 7 n t R L b u w h T J u P E U k l m i 3 o J R Y a B T g W A Q / W 5 f L p Y P Z s I 4 n 3 Q o c = " > A A A B 8 X i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 4 k q G X A x s I i g v n A J I S 9 z V 6 y Z G / v 2 J 2 T h C P / w s Z C E V v / j Z 3 / x k 1 y h S Y + G H i 8 N 8 P M P D + W w q D r f j u 5 t f W N z a 3 8 d m F n d 2 / / o H h 4 1 D B R o h m v s 0 h G u u V T w 6 V Q v I 4 C J W / F m t P Q l 7 z p j 2 5 m f v O J a y M i 9 Y C T m H d D O l A i E I y i l R 4 7 y M f o B + l 4 2 i u W 3 L I 7 B 1 k l X k Z K k K H W K 3 5 1 + h F L Q q 6 Q S W p M 2 3 N j 7 K Z U o 2 C S T w u d x P C Y s h E d 8 L a l i o b c d N P 5 x V N y Z p U + C S J t S y G Z q 7 8 n U h o a M w l 9 2 x l S H J p l b y b + 5 7 U T D K 6 7 q V B x g l y x x a I g k Q Q j M n u f 9 I X m D O X E E s q 0 s L c S N q S a M r Q h F W w I 3 v L L q 6 R x U f Y u y 5 X 7 S q l 6 l 8 W R h x M 4 h X P w 4 A q q c A s 1 q A M D B c / w C m + O c V 6 c d + d j 0 Z p z s p l j + A P n 8 w c s z J F K < / l a t e x i t > x (Mildenhall et al., 2020) . We omit PE(•) in our following equations for simplicity. Given an input (x, d), we first send x into the gating network and obtain the gate values G(x). Then, we apply a Top-1 operation on G(x) to determine which expert should be activated. As shown in Fig. 2 , only one expert (i.e. Expert 2) is activated. The point x will be dispatched to this selected expert. Other experts do not participate in the processing of x. The output feature of the expert E(x) will be multiplied by the gate value corresponding to this expert. This makes the gating network be trained jointly with the expert networks. After that, the feature is used to predict the density σ and color c together with d and the appearance embedding AE. In the next, we first introduce details of the trainable gating and our gating network architecture in Sec. 3.1. Then we introduce our expert and head network architectures in Sec. 3.2. We discuss the capacity factor and full dispatch in Sec. 3.3. We finally formulate the rendering procedure and our loss functions in Sec. 3.4.

3.1. SPARSE GATING IN SWITCH-NERF

The sparse gating network plays an important role in our Switch-NeRF because it determines the optimization routes of different NeRF experts. In Switch-NeRF, we only consider one gating network because there is a very large number of 3D points that require gating and dispatching in NeRF optimization. Multiple gating operations will largely decrease the training and testing speed. Trainable gating in Switch-NeRF. The trainable gating in our network follows the mechanism in the MoE method Switch Transformer (Shazeer et al., 2017) . The gate values G(x) is a vector of n-dimensions normalized via Softmax, in which G(x) i represents the probability of selecting the i-th NeRF expert. We apply a Top-1 function on G(x) to sparsely select only 1 expert E s from a set of NeRF experts, i.e. {E i } n i=1 , for each 3D point. The input x will be dispatched into the selected expert E s and obtain an output E s (x). The final output Ẽ(x) the output of E s multiplied by the corresponding gate value: Ẽ(x) = G(x) s E s (x). (1) As the predicted gate values are multiplied to the corresponding outputs of NeRF experts, the gating network can be optimized together with the NeRF experts in the backward pass. This makes the network able to directly learn scene decomposition during network training. Our network structure is highly sparse because we use Top-1 to select only one experts for each sample. < l a t e x i t s h a 1 _ b a s e 6 4 = " D s v y J 0 U f P a L e R r Q R 4 P J i 9 m 9 p 8 l M = " > A A A B / n i c b V D L S s N A F J 3 4 r P U V F V d u g k W o m 5 J I U Z c F E V y 4 q G A f 0 I Q y m U z a o f M I M x O x h I K / 4 s a F I m 7 9 D n f + j Z M 2 C 2 0 9 M H A 4 5 1 7 u m R M m l C j t u t / W 0 v L K 6 t p 6 a a O 8 u b W 9 s 2 v v 7 b e V S C X C L S S o k N 0 Q K k w J x y 1 N N M X d R G L I Q o o 7 4 e g q 9 z s P W C o i + L 0 e J z h g c M B J T B D U R u r b h z 4 L x W P W v J 5 U f Q b 1 M I y z a H L a t y t u z Z 3 C W S R e Q S q g Q L N v f / m R Q C n D X C M K l e p 5 b q K D D E p N E M W T s p 8 q n E A 0 g g P c M 5 R D h l W Q T e N P n B O j R E 4 s p H l c O 1 P 1 9 0 Y G m V J j F p r J P K K a 9 3 L x P 6 + X 6 v g y y A h P U o 0 5 m h 2 K U + p o 4 e R d O B G R G G k 6 N g Q i S U x W B w 2 h h E i b x s q m B G / + y 4 u k f V b z z m v 1 u 3 q l c V v U U Q J H 4 B h U g Q c u Q A P c g C Z o A Q Q y 8 A x e w Z v 1 Z L 1 Y 7 9 b H b H T J K n Y O w B 9 Y n z 8 s W Z W q < / l a t e x i t > PE(d) < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 3 t f 2 N O m w e P u R 5 E 2 h X k Q B y 1 / O d Q = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K q M e I C B 4 8 R D A P S Z Y w O 5 k k Q + a x z M y K Y c l X e P G g i F c / x 5 t / 4 y T Z g y Y W N B R V 3 X R 3 R T F n x v r + t 5 d b W l 5 Z X c u v F z Y 2 t 7 Z 3 i r t 7 d a M S T W i N K K 5 0 M 8 K G c i Z p z T L L a T P W F I u I 0 0 Y 0 v J r 4 j U e q D V P y 3 o 5 i G g r c l 6 z H C L Z O e m i L S D 2 l l 9 f j T r H k l / 0 p 0 C I J M l K C D N V O 8 a v d V S Q R V F r C s T G t w I 9 t m G J t G e F 0 X G g n h s a Y D H G f t h y V W F A T p t O D x + j I K V 3 U U 9 q V t G i q / p 5 I s T B m J C L X K b A d m H l v I v 7 n t R L b u w h T J u P E U k l m i 3 o J R 1 a h y f e o y z Q l l o 8 c w U Q z d y s i A 6 w x s S 6 j g g s h m H 9 5 k d R P y s F Z + f T u t F S 5 z e L I w w E c w j E E c A 4 V u I E q 1 I C A g G d 4 h T d P e y / e u / c x a 8 1 5 2 c w + / I H 3 + Q P S p Z B 3 < / l a t e x i t > 

Skip add

< l a t e x i t s h a 1 _ b a s e 6 4 = " q E b 4 Q k 2 3 8 x T I W z l 2 z T 2 v z T M u p b c = " > A A A B / n i c b V D L S s N A F J 3 U V 6 2 v q L h y E y x C 3 Z R E i r o s i O D C R Q X 7 g K a U y X T S D p 1 H m J l I S w j 4 K 2 5 c K O L W 7 3 D n 3 z h p s 9 D W A w O H c + 7 l n j l B R I n S r v t t F V Z W 1 9 Y 3 i p u l r e 2 d 3 T 1 7 / 6 C l R C w R b i J B h e w E U G F K O G 5 q o i n u R B J D F l D c D s b X m d 9 + x F I R w R / 0 N M I 9 B o e c h A R B b a S + f e S z Q E y S x k 1 a 8 R n U o y B M J u l Z 3 y 6 7 V X c G Z 5 l 4 O S m D H I 2 + / e U P B I o Z 5 h p R q F T X c y P d S 6 D U B F G c l v x Y 4 Q i i M R z i r q E c M q x 6 y S x + 6 p w a Z e C E Q p r H t T N T f 2 8 k k C k 1 Z Y G Z z C K q R S 8 T / / O 6 s Q 6 v e g n h U a w x R / N D Y U w d L Z y s C 2 d A J E a a T g 2 B S B K T 1 U E j K C H S p r G S K c F b / P I y a Z 1 X v Y t q 7 b 5 W r t / l d R T B M T g B F e C B S 1 A H t 6 A B m g C B B D y D V / B m P V k v 1 r v 1 M R 8 t W P n O I f g D 6 / M H S t G V v g = = < / l a t

Gate Expert

Linear & Softplus < l a t e x i t s h a 1 _ b a s e 6 4 = " b R g y q L H 9 q 6 j w K h z i y j r S l l j T h c 4 = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K U I 8 B L x 4 8 R D A P S J Y w O 5 l N x s x j m Z k V w p J / 8 O J B E a / + j z f / x k m y B 0 0 s a C i q u u n u i h L O j P X 9 b 6 + w t r 6 x u V X c L u 3 s 7 u 0 f l A + P W k a l m t A m U V z p T o Q N 5 U z S p m W W 0 0 6 i K R Y R p + 1 o f D P z 2 0 9 U G 6 b k g 5 0 k N B R 4 K F n M C L Z O a v U M G w r c L 1 f 8 q j 8 H W i V B T i q Q o 9 E v f / U G i q S C S k s 4 N q Y b + I k N M 6 w t I 5 x O S 7 3 U 0 A S T M R 7 S r q M S C 2 r C b H 7 t F J 0 5 Z Y B i p V 1 J i + b q 7 4 k M C 2 M m I n K d A t u R W f Z m 4 n 9 e N 7 X x d Z g x m a S W S r J Y F K c c W Y V m r 6 M B 0 5 R Y P n E E E 8 3 c r Y i M s M b E u o B K L o R g + e V V 0 r q o B p f V 2 n 2 t U r / L 4 y j C C Z z C O Q R w B X W 4 h Q Y 0 g c A j P M M r v H n K e / H e v Y 9 F a 8 H L Z 4 7 h D 7 z P H 6 J P j z Y = < / l a t e x i t >

Linear

Linear & ReLU Linear & Sigmoid < l a t e x i t s h a 1 _ b a s e 6 4 = " r 6 U G c q p e A x N P r D n 8 r j g 9 N z h 0 j o k = " > A A A B 8 X i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 5 E 1 D J g Y 2 E R w X x g c o S 9 z V 6 y Z G / v 2 J 0 T w 5 F / Y W O h i K 3 / x s 5 / 4 y a 5 Q h M f D D z e m 2 F m X p B I Y d B 1 v 5 3 C y u r a + k Z x s 7 S 1 v b O 7 V 9 4 / a J o 4 1 Y w 3 W C x j 3 Q 6 o 4 V I o 3 k C B k r c T z W k U S N 4 K R t d T v / X I t R G x u s d x w v 2 I D p Q I B a N o p Y c u 8 i c M w o x N e u W K W 3 V n I M v E y 0 k F c t R 7 5 a 9 u P 2 Z p x B U y S Y 3 p e G 6 C f k Y 1 C i b 5 p N R N D U 8 o G 9 E B 7 1 i q a M S N n 8 0 u n p A T q / R J G G t b C s l M / T 2 R 0 c i Y c R T Y z o j i 0 C x 6 U / E / r 5 N i e O V n Q i U p c s X m i 8 J U E o z J 9 H 3 S F 5 o z l G N L K N P C 3 k r Y k G r K 0 I Z U s i F 4 i y 8 v k + Z Z 1 b u o n t + d V 2 q 3 e R x F O I J j O A U P L q E G N 1 C H B j B Q 8 A y v 8 O Y Y 5 8 V 5 d z 7 m r Q U n n z m E P 3 A + f w A M 4 5 E 1 < / l a t e x i t > c Concat

Head

Figure 3 : The architectures of our gating, one expert and the head networks. It shows the forward pass of a point. It goes through the gating network and a selected expert, and is passed to a shared prediction head.

Gating network architecture.

In previous MoE methods (Shazeer et al., 2017; Lepikhin et al., 2021) , the gating network is typically a simple linear mapping. This choice is reasonable as they usually have multiple MoE layers. The input of gating networks in deep MoE layers can be high-level features from previous layers. Therefore, their gating networks share more information from the main network. Let the original input be the 3D point x. Typically x will first go through a sub-network S with several layers and learn an internal feature S(x). The real gating operation of the deep MoE layers can be written as: G(x) = Softmax(Linear(S(x))). (2) Therefore, the gating networks in these methods actually share an S network from the main network. Considering that we only use one gate network in our Switch-NeRF for efficiency, allocating more parameters for our gating network is necessary to learn powerful gating. We can put the gating network closer to the prediction head. It will share more parameters with the main network. However, this will also make each sample point share more layers before the gating network. In this case, the capacity of the whole network shrinks, as the network capacity is controlled by the number of layers of each unshared expert. Therefore, as shown in the Fig. 3 , we put the gating network at the beginning of Switch-NeRF to maximize the layer numbers of the expert networks. In order to allocate more parameters to the gating network, we use a shallow MLP instead of a linear mapping. Our gating network G(x) consists of 4 Linear layers and 1 LayerNorm as shown in Fig. 3 . Our design balances the efficiency, network sparsity and the number of parameters in the gating network.

3.2. EXPERT AND HEAD NETWORKS

NeRF Expert Network. The set of n NeRF experts {E i } n i=1 in Switch-NeRF contains most of the network parameters. As shown in Fig. 3 , each expert in Switch-NeRF is a deep MLP with a skip connection. The structure and depth of the experts are aligned with the main structure of vanilla NeRF. Each expert only processes a part of the 3D points, determined by the gating network. The output of the selected expert is multiplied by the corresponding gate value to obtain a feature vector Ẽ(x). This feature vector is then used to predict density σ and the direction-dependent color c. By increasing the number of NeRF experts n, we can easily scale the network's capacity. Unified NeRF Head. The head network H is designed for the final predictions of each input sample. It is shared for all the samples, as shown in Fig. 3 . After obtaining the expert output Ẽ(x), we use a Linear layer with a Softplus activation (Zheng et al., 2015) to predict σ. The use of Softplus follows Mip-NeRF (Barron et al., 2021) for stable prediction of σ. Then, Ẽ(x) goes through a Linear layer and is concatenated with PE(d) and an appearance embedding AE. The color c is predicted from the concatenated feature by an MLP. The appearance embedding AE is a trainable vector to capture image-level photometric and environmental variations (Martin-Brualla et al., 2021) .

3.3. CAPACITY FACTOR AND FULL DISPATCH

Capacity factor for training. In our Switch-NeRF, to efficiently dispatch 3D points to different experts is important. Einops-based dispatch (Lepikhin et al., 2021) causes memory overflow due to the large number of 3D points. In our training, we use the CUDA-based fast dispatch in Tutel (Hwang et al., 2022) . Following previous MoE methods (Lepikhin et al., 2021) , we set a capacity factor for each NeRF expert in the training. It caps the number of sample points dispatched to each NeRF expert, leading to uniform tensor shapes, balanced computation and communication. Let B be the batch size of the whole network, C f be a capacity factor and n be the number of NeRF experts. Then the maximum number of sample points going into each NeRF expert is B e = ceil( The dispatch with a capacity factor can be called a uniform dispatch. Fig. 4 shows the uniform dispatch with C f = 1.0. Overflow points are dropped. If the capacity is not fully used, it will be zero-padded. A larger capacity factor decreases the dropping ratio but increases the memory and computation. In our network, we set the capacity factor to 1.0 without requiring extra memory. We use the Batch Prioritized Routing (Riquelme et al., 2021) to improve the training with lower expert capacity. kBC f n ). Z N l o h E d U K q U X C J T c O N w E 6 q k M a h w H Y 4 r s / 9 9 h M q z R P 5 a C Y p B j E d S h 5 x R o 2 V 2 v V + d F d x v X 6 p 7 L n e A m S d + D k p Q 4 5 G v / T V G y Q s i 1 E a J q j W X d 9 L T T C l y n A m c F b s Z R p T y s Z 0 i F 1 L J Y 1 R B 9 P F u T N y a Z U B i R J l S x q y U H 9 P T G m s 9 S Q O b W d M z U i v e n P x P 6 + b m e g 2 m H K Z Z g Y l W y 6 K M k F M Q u a / k w F X y I y Y W E K Z 4 v Z W w k Z U U W Z s Q k U b g r / 6 8 j p p V V z / 2 q 0 + V M s 1 L 4 + j A O d w A V f g w w 3 U 4 B 4 a 0 A Q G Y 3 i G V 3 h z U u f F e X c Full dispatch for testing. Previous MoE methods usually use the uniform dispatch for both the training and testing (Lepikhin et al., 2021; Fedus et al., 2022) , which inevitably drops sample points. In our Switch-NeRF, although the uniform dispatch works well in the training, we observe that dropping sample points can significantly decrease the test accuracy. A possible reason is that we do not use stacked MoE layers and skip connections between them as previous MoE methods usually do, to maintain the network efficiency. To improve the testing accuracy, we implement an efficient full dispatch strategy based on Pytorch, CUDA, and Tutel. It can dispatch all the points to their corresponding expert with only slight memory increase. A description of the strategy is depicted in Fig. 4 . With this efficient full dispatch, we can largely improve the test accuracy.

3.4. VOLUME RENDERING AND LOSSES

Our volume rendering procedure follows the vanilla NeRF (Mildenhall et al., 2020) . The training data of Switch-NeRF consists of multiple posed images. For each pixel in the training images, we back-project a camera ray r into a 3D space. A set of N samples are sampled along the ray. For each sample (x, d), our network F Θ predicts a volume density σ and a color c = (r, g, b). Let δ i be the distance between adjacent points. The expected color Ĉ(r) of this pixel is synthesized along the ray by the volume rendering function in NeRF (Mildenhall et al., 2020) : Ĉ(r) = N i=1 T i (1 -exp(-σ i δ i ))c i , where T i = exp(- i-1 j=1 σ j δ j ), Rendering loss. The main loss of Switch-NeRF is the rendering loss L r . After rendering the color Ĉ(r) of a ray from our network through Equation 3, we compute L r for supervision: L r = r∈R ∥C(r) -Ĉ(r)∥ 2 , ( ) where R is the set of sampled rays. C(r) is the ground truth color of ray r in the training images. Auxiliary loss tackling imbalanced optimization. One problem of training MoE-based networks is that the gating network can favor only a few experts (Shazeer et al., 2017) . The optimization and utilization of NeRF experts will thus be imbalanced. This could even cause several experts not trained, and then the whole network converges to a sub-optimal solution and cannot scale well through the training since some of the network capacities are not fully utilized. Following Shazeer et al. (2017) ; Lepikhin et al. (2021) , we use an auxiliary loss L a to regularize the gating network and balance the utilization of NeRF experts. Specifically, following the differentiable load balancing loss proposed in GShard (Lepikhin et al., 2021) , we define our auxiliary loss as follows. Given n NeRF experts and a batch B with N sample points, let c i be the number of points dispatched to each expert E i by Top-1. We first compute the gating values m i distributed to E i with m i = x∈B G(x) i . The auxiliary loss L a can be computed as L a = n N 2 n i c i m i . This auxiliary loss encourages balanced gating because it is minimized when the dispatching is ideally balanced. Under the balanced gating, c i and m i are both expected to be N n . Then n i c i m i will be N 2 /n. The loss L a will be 1. The total loss L is the weighted sum of the above-mentioned two losses: L = L r + λL a ( ) where λ is the weight for our auxiliary loss. We set λ = 5 × 10 -4 for all our main results and it is sufficient to balance the utilization of experts. (Turki et al., 2022) and Residence, Sci-Art, Campus datasets from UrbanScene3D (Liu et al., 2021) to evaluate our Switch-NeRF. Each scene contains thousands of high-resolution images. The camera parameters for all the images are the same as Mega-NeRF (Turki et al., 2022) . These datasets cover large enough areas while they can still be handled by consumer-level workstations with commonly used GPUs. Metrics. We use the PSNR, SSIM (Wang et al., 2004 ) (both higher is better), and VGG implementation of LPIPS (Zhang et al., 2018) (lower is better) to quantitatively evaluate our results on the novel view synthesis. The PSNR is to measure the mean squared error between two images in logarithmic space. The SSIM focuses more on structural similarity. The LPIPS measures perceptual similarity. Visualization. Besides the rendered images, we visualize the 3D radiance fields. We sample 3D points along rays and use the α = 1exp(-σ i δ i ) as the opacity of each 3D point. We use Point Cloud Library (Rusu & Cousins, 2011) to show the color and opacity of each 3D point.

4.2. SETTING

Similar to NeRF++ (Zhang et al., 2020) and Mega-NeRF, the 3D scene space is spilt into a foreground and a background. We use 8 experts with Top-1 gating and C f = 1.0 for the foreground to end-to-end learn scene decomposition, and one original NeRF in the background. This differs from Mega-NeRF using a foreground and a background NeRF for each of its 8 sub-networks, requiring more network parameters. Each of our NeRF experts contains 7 layers with each layer 256 channels. We sample 256 coarse and 512 fine points per ray in the main network and 128/256 samples in the background network. We use 8 NVIDIA RTX 3090 GPUs for distributed data-parallel training and sample 1024 rays for each GPU. We use Adam optimizer (Kingma & Ba, 2015) and a learning rate decaying exponentially from 5×10 -4 to 5×10 -5 , and use bfloat16 in training and float16 in testing to reduce memory and time. We train 500k iterations for each dataset and test on validation images. Unified head. We remove the shared unified head and add a head separately for each NeRF expert. Table 2 shows that without the unified head the accuracy significantly decreases. We also observe that in this case, the network is not robust enough to the mixed-precision training and testing. A possible reason is that the gating value is directly multiplied to the predictions rather than the highlevel features, which makes the gating and prediction unstable in training and testing. This verifies our motivation to design a unified head for multiple NeRF experts. Scalability. 6 shows that the results of the uniform dispatch with C f = 2.0 largely decreases. Although the results with C f = 4.0 improve, they are still far lower than the full dispatch. Moreover, it costs more time than the full dispatch because of a large C f with zero padding. Fig. 4 shows that images rendered with the uniform dispatch contain much more artifacts than the full dispatch, proving the effectiveness of the full dispatch in Switch-NeRF.

5. CONCLUSION

In this paper, we present Switch-NeRF. To the best of our knowledge, it is the first sparse large-scale NeRF with learnable scene decomposition. We propose an end-to-end mixture-of-NeRF-experts framework to learn scene decomposition jointly with NeRF. We further design and implement an efficient network architecture and a full dispatch strategy to boost accuracy. Extensive experiments demonstrate that our network can learn more reasonable scene decomposition and shows state-ofthe-art accuracy on the scene synthesis compared to hand-crafted decomposition methods. 

C TWO GATING OPERATIONS

We add an additional ablation study on two gating operations. We train a Switch-NeRF with 2 gating networks, each acting as a gating operation. The first gating network is added after the 1-st linear layer of the main network, and the second gating network is added after the 4-th linear layer of the main network. The total number of expert layers is the same as that of the Switch-NeRF using 1 gating network to make them have similar network capacity. The results in terms of both accuracy and efficiency are shown in Table 9 . From the results, we can clearly see that two gating operations cost more training time, training memory, and testing time, while achieving a similar accuracy to the model with one gating operation. This suggests that the number of gating operations is not critical for the model performance, and one gating operation already shows sufficient capabilities in the dispatch of points. 

D MORE EXPERTS

The expert number directly controls the network capacity of Switch-NeRF. With large-scale scenes we may need to increase the expert number to get better results. In this section we add more experiments on the expert number. In Table 10 , the results show that more experts can consistently produce better results for larger-scale dataset. We additionally visualize the 3D radiance fields of a Switch-NeRF trained with 16 experts on the UrbanScene3D-Sci-Art dataset in Figure 7 . Compared with the Figure 6 in the main paper, we can see that our network can still roughly specialized to different fine-grained semantic parts of the scene. 



Learning after distance-based decomposition (e.g. Mega-NeRF) (b) Learning after physical-distributionbased decomposition (e.g. Block-NeRF) (c) Learning with scene decomposition (Ours)

8 w y u 8 W d J 6 s d 6 t j 3 l r z s p m D u E P r M 8 f 7 x m P P Q = = < / l a t e x i t > (x, y, z) t e x i t s h a 1 _ b a s e 6 4 = " D s v y J 0 U f P a L e R r Q R 4 P J i 9 m 9 p 8 l M = " > A A A B / n i c b V D L S s N A F J 3 4 r P

1 a h y f e o y z Q l l o 8 c w U Q z d y s i A 6 w x s S 6 j g g s h m H 9 5 k d R P y s F Z + f T u t F S 5 z e L I w w E c w j E E c A 4 V u I E q 1 I C A g G d 4 h T d P e y / e u / c x a 8 1 5 2 c w + / I H 3 + Q P S p Z B 3 < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " j 1

Figure2: The framework our Switch-NeRF. A 3D point x will first go through a gating network and then be dispatched to only one expert according to the gating network output. The expert output is multiplied by the corresponding gate value and sent to a head for density σ and color c prediction with direction d and appearance embedding. The rendering loss is used for supervision. The images on the left of each expert are the visualization of 3D radiance fields handled by different experts.3 SWITCH-NERFOur Switch-NeRF is a Sparse Neural Radiance Field network targeting large-scale scenes. A framework overview of Switch-NeRF is depicted in Fig.2. We represent a large-scale 3D scene as a sparse 5D radiance function F Θ : (x, d) → (c, σ), where (x, d) are a 3D point and its view direction, (c, σ) are predicted color and density, Θ represents network parameters. F Θ sparsely activates only a part of its parameters for an input x each time. The overall structure of our network mainly consists of a gating network G, a set of n experts {E i } n i=1 , and a shared prediction head H for generating σ and c. The actual inputs of our network are the positional encoding (PE(x), PE(d)) for (x, d) as in the vanilla NeRF(Mildenhall et al., 2020). We omit PE(•) in our following equations for simplicity.

t e x i t s h a 1 _ b a s e 6 4 = " m D k y n M f t L 7 x 4 x q r / 8 R b P y y O r J F M = " > A A A B 7 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 C k k p 6 k U o 9 O K x g v 2 A N p T N d t I u 3 W z C 7 k Y o p T / C i w d F v P p 7 v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C 1 P B t f G 8 b 2 d j c 2 t 7 Z 7 e w V 9 w / O D w 6 L p 2 c t n S S K Y

Figure 4: In the uniform dispatch, each expert has the same tensor shapes. It will drop overflow tokens and perform padding. The full dispatch makes sure each point will be processed by an expert. The images rendered by uniform dispatch have apparent artifacts.

Figure 5: The comparison of the rendered images from Mega-NeRF and our Switch-NeRF. Our method renders more details and tiny structures than Mega-NeRF. Please zoom in to see the details.

The testing results of our Switch-NeRF on large-scale datasets. Our method gets state-ofthe-art accuracy compared to the dense NeRF, NeRF++ and the sparse Mega-NeRF.

Comparisons of parameter number, time, memory and FLOPs. Although with more training time and memory, our network uses less testing memory and has much less parameters while achieving a better accuracy.

The accuracy and testing time of uniform dispatch with C f = 2.0 and 4.0 and full dispatch. The full dispatch clearly perform better with less time usage. We analyze different designs of our gating network. As shown in Table2, the Linear gating just feeds the PE(x) into a trainable linear layer. The w/o Norm version does not have a LayerNorm. We also evaluate the gating without the auxiliary loss L a . The Linear gating with fewer parameters does not generate satisfactory results. The MLP+Norm can boost the performance of our whole network. The network without auxiliary loss L a does not even converge. This shows L a is vital for the success of MoE methods, consistent with the observations inShazeer et al. (2017).

Table3shows the accuracy and efficiency of Switch-NeRF with different numbers of experts. With only 4 experts, we already achieve similar accuracy to Mega-NeRF using 8 subnetworks. Compared to the model with 8 experts, the model with 16 experts increases the performance remarkably, while without a large increase of the memory and time in testing. Compared to Mega-NeRF with 16 sub-networks, our Switch-NeRF with 16 experts scales much better with a higher accuracy, fewer number of the parameter and less memory footprint in testing. All of these prove the scalability of our Switch-NeRF. It can obtain much better results by increasing network capacity while maintaining almost constant computational and memory costs. Effect of top-2 and capacity factor. Table4studies using Top-2 and C f when training Switch-NeRF. Increasing C f from 1.0 to 1.5 or using Top-2 in training can improve the accuracy while increasing the training time and memory, especially for Top-2. With our design of Switch-NeRF, Top-1 with C f = 1.0 already obtains good results with acceptable efficiency, suggesting that increasing C f instead of k is better because C f in training does not affect testing with the full dispatch. Efficiency. In Table5, we compare the efficiency of Switch-NeRF with Mega-NeRF. Our model achieves better accuracy using only half of the parameters of Mega-NeRF. Mega-NeRF uses a background NeRF and different versions of appearance embeddings (AE) for each sub-network, leading to much more parameters. As Switch-NeRF is trained end-to-end, it only uses one background NeRF and one version of AE. Our network is more parameter efficient and with better accuracy. Since it has a gating network and is trained end-to-end, it reasonably requires more floating point operations (FLOPs) for each point and costs slightly more memory and time for training. Notably, compared to Mega-NeRF, it uses around 20% less testing memory, and only a very minor increase in testing time. Our Switch-NeRF achieves a good efficiency together with a better accuracy. Full dispatch. We test our trained model with a uniform dispatch and our full dispatch discussed in Sec. 3.3. A capacity factor C f of 2.0 and 4.0 with Batch Prioritized Routing(Riquelme et al., 2021) is used in the uniform dispatch. Table

Accuracy of Switch-NeRF on the San Francisco Mission Bay Dataset proposed by Block-NeRF. We train the Switch-NeRF with limited computing resources, , half float precision, and much smaller batch-size compared to Block-NeRF. Besides, Switch-NeRF does not utilize the important dynamic object masks that are used by Block-NeRF, as they are not available in the released training dataset.

Accuracy and efficiency with different numbers of gating operations. The Switch-NeRF with two gating operations and the same number of expert layers, does not improve the rendering accuracy over the model with one gating operation, while consuming more memory and time for training and testing.

The accuracy of different expert numbers. Increasing expert number can consistently improve the accuracy.

ACKNOWLEDGEMENTS

This research is supported in part by HKUST-SAIL joint research funding, the Early Career Scheme of the Research Grants Council (RGC) of the Hong Kong SAR under grant No. 26202321 and HKUST Startup Fund No. R9253.

annex

The main quantitative results are reported in Table 1 and qualitative results in Fig. 5 . The statistics of NeRF (Mildenhall et al., 2020) , NeRF++ (Zhang et al., 2020) , SVS (Riegler & Koltun) , Deep-View (Flynn et al., 2019) , and Mega-NeRF are quoted from Mega-NeRF. The MLP width of NeRF and NeRF++ is both set to 2048 to obtain a similar capacity between Mega-NeRF and Switch-NeRF.As shown in Table 1 , our Switch-NeRF achieves state-of-the-art accuracy on all the datasets in terms of PSNR, SSIM, and LPIPS. It produces much better accuracy than NeRF and NeRF++. This confirms the effectiveness of scaling NeRF with sparse Mixture-of-Experts instead of densely expanding the network. Our method also outperforms the sparse Mega-NeRF, showing the advantage of learning decomposition instead of the hand-crafted scene decomposition. Fig. 5 presents the rendered images of Mega-NeRF and Switch-NeRF. Our method can render more tiny structures and details compared to Mega-NeRF. These main results demonstrate the overall performance of our method. 2 . The random decomposition cannot fully utilize more parameters in the network. The distance decomposition performs better and is similar to Mega-NeRF. Our Switch-NeRF with learned decomposition achieves the best accuracy. NeRF expert specialization. We visualize the 3D radiance fields of the UrbanScene3D-Sci-Art for Mega-NeRF and Switch-NeRF in Fig. 6 . The Mega-NeRF partitions the scene into regular parts.

4.4. MODEL ANALYSIS

In Switch-NeRF, the experts have been roughly specialized to different semantic parts of the scene.Experts 1 and 2 focus on different buildings. Expert 3 focuses on the green grasses and trees on the ground. Expert 5 specializes on other parts of the ground. Our Switch-NeRF can achieve more reasonable and semantically meaningful scene decomposition in a learning way. 

B BLOCK-NERF DATASET

We train a Switch-NeRF on the San Francisco Mission Bay Dataset of Block-NeRF (Tancik et al., 2022) . It consists of 12,000 images with an image resolution of around 1200x900. Although the data volume is similar to the datasets used by Mega-NeRF and Switch-NeRF, the training setting of Block-NeRF is quite different from ours. The Block-NeRF uses much larger computing and memory resources in its training setting. It trains each scene block with 32 TPU v3 cores combining offer 512 GB memory, which is not easy to be followed by the common computer workstations. The batchsize of Block-NeRF for each block is 16384, significantly larger than the averaging batch-size 1024 for each sub-network in Mega-NeRF and Switch-NeRF. The networks of BlockNeRF are trained with full precision while ours with bfloat16 precision. Besides, an important part of the training data, i.e., the masks for dynamic objects in the scenes used by Block-NeRF, are also not released in the available training data. The Block-NeRF does not open-source its code, which makes us difficult to align its hyper-parameters on their training data.Due to our lack of comparable computing resources and the unavailability of the same training data used in Block-NeRF, it is not feasible for us to train a Switch-NeRF based on the same setting as Block-NeRF, to allow a direct and fair comparison. We thus train our network on the Block-NeRF dataset with a similar setting used in our paper. We use 8 NVIDIA RTX 3090 GPUs to train a Switch-NeRF model with 8 experts, and sample 1664 rays for each GPU, resulting in a batch-size of 1664 on average for each expert. This is far less than the batch-size of 16384 for each Block in Block-NeRF. The width of each layer is set as 512 as in Block-NeRF. We also use the positional encoding of Mip-NeRF as used by Block-NeRF. We sample 256 points each ray for the coarse and fine networks in training, and 512 points each ray for testing. We use the validation dataset of Block-NeRF to evaluate the performance with the PSNR, SSIM and LPIPS metrics. The experimental results are shown in Table 8 . 

E FAST RENDERING

We explore the capability of Switch-NeRF to be integrated into fast rendering techniques based on octrees similar to the interactive exploration used in Mega-NeRF-Dynamic Turki et al. (2022) . As the detailed evaluation protocol for the faster rendering is not provided by Mega-NeRF-Dynamic, we thus define a protocol to evaluate for both methods, on the validation images of each dataset. Specifically, we first convert the trained model into a coarse octree. Given a view point, we render a coarse image directly from the coarse octree. Then, we follow the dynamic octree refinement strategy proposed by Mega-NeRF-Dynamic, to refine the octree for the current view point for several rounds by querying the trained model. Here, we consider 16 rounds. We finally render a refined image from the refined octree. We evaluate both Mega-NeRF model and Switch-NeRF model on the same fast-rendering protocol we just described, and report the PSNR and average refinement time of octree in Table 11 , and images in Figure 8 . As shown in Table 11 , since the octrees before refinement are very coarse, the coarse images for both Mega-NeRF and Switch-NeRF are of low quality. After the dynamic octree refinement, the quality of images improves. Since the resolution of refined octrees is still limited, the quality is not comparable to the original model. However, our Switch-NeRF can obtain similar or better results on the validation datasets compared to Mega-NeRF. It should be noted that the octrees do not handle the background, and thus the rendered images may have black regions in both Mega-NeRF and Switch-NeRF, which decreases the PSNR values. However, as shown in Figure 8 , our method can render good quality of refined images in the foreground. This shows our Switch-NeRF can be flexibly and effectively integrated into existing faster rendering techniques. Note that the octrees do not handles background so the rendered images may have black regions in both Mega-NeRF and Switch-NeRF. This will decrease the PSNR values. However, from the visualization, we can see that Our method can render good quality of refined images in the foreground.

