ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (2024)

Sergey ZakharovToyota Research InstituteLos AltosUSAsergey.zakharov@tri.global,Katherine LiuToyota Research InstituteLos AltosUSAkatherine.liu@tri.global,Adrien GaidonToyota Research InstituteLos AltosUSAadrien.gaidon.ctr@tri.globalandRareș AmbrușToyota Research InstituteLos AltosUSArares.ambrus@tri.global

(2024)

Abstract.

The common trade-offs of state-of-the-art methods for multi-shape representation (a single model ”packing” multiple objects) involve trading modeling accuracy against memory and storage. We show how to encode multiple shapes represented as continuous neural fields with a higher degree of precision than previously possible and with low memory usage. Key to our approach is a recursive hierarchical formulation that exploits object self-similarity, leading to a highly compressed and efficient shape latent space. Thanks to the recursive formulation, our method supports spatial and global-to-local latent feature fusion without needing to initialize and maintain auxiliary data structures, while still allowing for continuous field queries to enable applications such as raytracing.In experiments on a set of diverse datasets, we provide compelling qualitative results and demonstrate state-of-the-art multi-scene reconstruction and compression results with a single network per dataset.Project page:https://zakharos.github.io/projects/refine/

compression, neural fields, level of detail, recursion, self-similarity

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24; July 27-August 1, 2024; Denver, CO, USA^†^†booktitle: Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24 (SIGGRAPH Conference Papers ’24), July 27-August 1, 2024, Denver, CO, USA^†^†doi: 10.1145/3641519.3657526^†^†isbn: 979-8-4007-0525-0/24/07^†^†submissionid: 1267^†^†ccs: Computing methodologiesMachine learning^†^†ccs: Computing methodologiesComputer graphics^†^†ccs: Computing methodologiesComputer vision representations

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (1)

1. Introduction

Neural fields that encode scene properties at arbitrary resolutions using neural networks have reached unprecedented levels of detail. Typically using fully-connected multi-layer perceptrons (MLPs) to predict continuous field values, they have been used to represent geometry and appearance with applications in computer vision(Tancik etal., 2022), robotics(Rashid etal., 2023), and computer graphics(Mitra etal., 2019). However, most high-fidelity methods are limited to single scenes(Müller etal., 2022; Takikawa etal., 2021, 2022a) and overfit to the target geometry or appearance(Müller etal., 2022), while methods that capture multiple shapes typically sacrifice high-frequency details(Jang and Agapito, 2021; Park etal., 2019; Mescheder etal., 2019), limiting utility for applications such as streaming and representation learning. We would like to enable the compression of multiple complex shapes into single vectors with a single neural network and while maintaining the ability reconstruct high frequency geometric and textural information.

Global conditioning methods(sit, 2019; Jang and Agapito, 2021; Park etal., 2019) (i.e. one latent vector per shape) are capable of learning latent spaces over large numbers of shapes but require ground truth 3D supervision and suffer when representing high frequency details. Conversely, locally-conditioned methods partition the implicit function by leveraging hybrid discrete-continuous neural scene representations, effectively blurring the line between classical data structures and neural representations and allowing for more precise reconstructions by handling scenes as collections of localized primitives. These methods typically encode single scenes and leverage a secondary data structure(Takikawa etal., 2021; Zakharov etal., 2022; Müller etal., 2022), trading off additional memory for a less complex neural function mapping feature vectors to the target signal. Recently,(Zakharov etal., 2022) proposed to take advantage of both global and local conditioning via a recursive octree formulation, but the approach only captures geometry and outputs oriented point clouds that do not allow for continuous querying of the underlying implicit function, precluding the application of techniques such as ray-tracing.

In this work, we propose to encode many scenes represented as fields in a single network, where each scene is denoted by a single latent vector in a high dimensional space. We show how entire datasets of colored shapes can be encoded into a single neural network without sacrificing high frequency details (color or geometry) and without incurring a high memory cost. Key to our approach is a recursive formulation that allows us to effectively combine local and global conditioning. Our main motivation for a recursive structure comes from the observation that natural objects are self-similar(Shechtman and Irani, 2007), that is they are similar to a part of themselves at different scales. This property is famously used in the Fractal compression methods(Jacquin, 1990). Our method effectively extends prior work to the continuous setting, which allows us to recover geometry and color information with a higher degree of fidelity than previously possible. Our novel formulation allows us to learn from direct 3D supervision (SDF plus optionally RGB), as well as from continuous valued fields (NeRFs). We also investigate the properties of the resulting latent space and our results suggest the emergence of structure based on shape and appearance similarity. We address the limitations of related methods for representing multiple 3D shapes through ReFiNe: Recursive Field Networks and our contributions are:

•
A novel implicit representation parameterized by a recursive function that efficiently combines global and local conditioning, allowing continuous spatial interpolation and multi-scale feature aggregation.
•
Thanks to its recursive formulation, ReFiNe scales to multiple 3D assets represented as fields without having to maintain auxiliary data structures, leading to a compact and efficient network structure. We demonstrate a single network representing more than 1000 objects with high quality and reducing the memory needed by 99.8%.
•
ReFiNe is cross-modal, i.e., it supports various output 3D geometry and color representations (e.g., SDF, SDF+Color, and NeRF) and its output can be rendered either with sphere raytracing (SDF), iso-surface projection (SDF) or volumetric rendering (NeRF).

2. Related Work

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (2)

2.1. Neural Fields for Representing Shapes

Neural fields have emerged as powerful learners thanks to their ability to encode any continuous function up to an arbitrary level of resolution. For a survey of recent progress please refer to(Xie etal., 2021). Shapes are typically represented as Signed Distance Functions(Park etal., 2019; Sitzmann etal., 2020a, b) or by occupancy probabilities(Mescheder etal., 2019; Peng etal., 2020; Chen and Zhang, 2019), with the encoded mesh extracted through methods such as sphere tracing(Liu etal., 2020b). Hybrid discrete-continuous data structures have enabled encoding single objects to a very high degree of accuracy(Takikawa etal., 2021, 2022a; Müller etal., 2022; Wang etal., 2022; Kim etal., 2024; Yi etal., 2023) and extensions have been proposed to model articulated(Deng etal., 2020; Mu etal., 2021) and deformable(Deng etal., 2021; Palafox etal., 2021) objects. Alternatively, training on multiple shapes leads to disentangled latent spaces(Park etal., 2019; Chen and Zhang, 2019; Tang etal., 2021) which can be used for differentiable shape optimization(Zakharov etal., 2021; Irshad etal., 2022) shape generation(Chen and Zhang, 2019; Yang etal., 2019; Cai etal., 2020; Zeng etal., 2022), interpolation(Williams etal., 2022) and completion(Zhou etal., 2021). A number of methods have been proposed which continuously model and update scene geometry within the context of Simultaneous Localization and Mapping (SLAM)(Sucar etal., 2021; Ortiz etal., 2022).Some methods also leverage recursion to improve the reconstruction accuracy of neural fields (Yang etal., 2022; Zakharov etal., 2022). The recently proposed method ROAD(Zakharov etal., 2022) is most similar to ours as it also uses a recursive Octree structure and can represent the surface of multiple objects with a single network. However, it does not encode color and it outputs a discrete fixed-resolution reconstruction, making it unsuitable for applications that require volumetric rendering or ray-tracing. In contrast, ReFiNe outputs continuous feature fields that can be used to represent various continuous representations, such as (but not limited to) colored SDFs and NeRFs.

2.2. Differentiable Rendering Advances

(Kato etal., 2020; Tewari etal., 2021) through techniques such as volume rendering(Lombardi etal., 2019) or ray marching(Niemeyer etal., 2020) have led to methods that learn to represent geometry, appearance and as well as other scene properties from image inputs and without needing direct 3D supervision. Leveraging ray marching,(sit, 2019) regresses RGB colors at surface intersection allowing it to learn from multi-view images, while(Niemeyer etal., 2020) couples an implicit shape representation with differentiable rendering. Building on(Lombardi etal., 2019), Neural Radiance Fields (NeRFs)(Mildenhall etal., 2020) regress density and color values along directed rays (5D coordinates) instead of of regressing SDF or RGB values at 3D coordinates. This simple and yet very convincingly effective representation boosted interest in implicit volumetric rendering and resulted in a multitude of works tackling problems from training and rendering time performance(Rebain etal., 2021; Lindell etal., 2021; Tancik etal., 2021; Liu etal., 2020a), to covering dynamic scenes(Park etal., 2021; Pumarola etal., 2021; Xian etal., 2021), scene relighting(Martin-Brualla etal., 2021; Bi etal., 2020; Srinivasan etal., 2021), and composition(Ost etal., 2021; Yuan etal., 2021; Niemeyer and Geiger, 2021). To achieve competitive results, NeRF-style methods require a large number of input views, with poor performance in the low data regime(Zhang etal., 2020) which can be improved by leveraging external depth supervision(Neff etal., 2021; Wei etal., 2021; Deng etal., 2022). Image supervision has also been used to learn 3D object-centric models without any additional information(Stelzner etal., 2021; Yu etal., 2022; Sajjadi etal., 2022a), through a combination of Slot Attention(Locatello etal., 2020) and volumetric rendering. Alternatively, a number of methods train generalizable priors over multiple scenes(sit, 2019; Yu etal., 2021; Jang and Agapito, 2021; Sajjadi etal., 2022b; Guizilini etal., 2022). In(Jang and Agapito, 2021) the authors learn a prior over objects that are represented as radiance fields via MLPs and parameterized by appearance and shape codes.As we show through experiments, the design of our recursive neural 3D representation leads to a latent space that promotes reusability of color and geometric primitives across shapes, enabling higher accuracy recostructions than previously possible.

3. Methodology

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (3)

We would like to learn to represent a set of objects $\mathcal{O}=\{O_{1},\dots,O_{K}\}$ . In particular, we are interested in representing objects as fields, where each object is a mapping from a 3D coordinate in space to a value of dimension $F$ , i.e., $O_{k}:\mathbb{R}^{3}\to\mathbb{R}^{F}$ . Examples of common fields are Signed Distance Fields (where $F=1$ and the value of the field indicates the distance to the nearest surface) and radiance fields (where $F=4$ , representing RGB and density values). For each object, we assume supervision in the form of $N_{k}$ coordinate and field value tuples of $\{\bm{x}_{j},f_{j}\}_{j=0}^{N_{k}}$ , where $\bm{x}\in\mathbb{R}^{3}$ and $f\in\mathbb{R}^{F}$ is the field value.

3.1. ReFiNe

Our method represents each shape $O_{k}$ with a $D$ -dimensional latent vector $\bm{z}^{0}$ that is recursively expanded into an octree with a maximum Level-of-Detail (LoD) $M$ . Each level of the octree corresponds to a feature volume. We then perform both spatial and hierarchical feature aggregations before decoding into field values. Crucially, the expansion of each latent vector into an Octree-based neural field is achieved via the same simple MLP for each LoD and decoders are shared across all objects in $\mathcal{O}$ . Once optimized, ReFiNe represents all $K$ objects in a set of $K$ latent vectors, a recursive autodecoder for octree expansion, an occupancy prediction network, and field-specific decoders (i.e., for RGB, SDF, etc). Figure1 illustrates how after training, our method can extract neural fields given different optimized LoD 0 latents, where we have dropped the superscipt for readability. Figure2 shows a more detailed overview of a reconstruction given a single input latent.

3.1.1. Recursive Subdivision & Pruning

Given a latent vector $\bm{z}^{m}\in\mathbb{R}^{D}$ from LoD $m$ , our recursive autodecoder subdivision network $\phi:\mathbb{R}^{D}\to\mathbb{R}^{8D}$ traverses an octree by latent subdivision:

(1)

\phi(\bm{z}^{m})\to\{\bm{z}^{m+1}_{i}\}_{i=0}^{7}

Thus, a latent is divided into 8 cells, each with an associated child latent that is positioned at the cell’s center. Cell locations are defined by the Morton space-filling curve(Morton, 1966).

Each child latent is then further decoded to occupancy values $o$ using occupancy network $\omega:\mathbb{R}^{D}\to\mathbb{R}^{1}$ . Rather than continuing to expand the tree for all child latents, $ReFiNe$ selects a subset based on the predicted occupancy value:

(2)

\mathcal{Z}^{m+1}=\{\bm{z}^{m+1}\in\phi(\bm{z}^{m})\mid\omega(\bm{z}^{m+1})>0.%5\},

where $\mathcal{Z}^{m+1}$ is the set of children latents from a particular parent latent $\bm{z}^{m}$ having predicted occupancies above a threshold of $0.5$ , from which the next set of children will be recursed. This process can be seen in the left inset of Fig.2.To supervise occupancy predictions, we further assume access to the structure of the ground-truth octree during training, i.e., annotations of which voxels at each LoD are occupied. If a voxel is predicted to be more likely unoccupied during reconstruction, we prune it from the octree structure.

To build the set of latents at a particular LoD, the latent expansion process described by Equations 1 and 2 for a single latent is applied to all unpruned children latents from the previous LoD. In this way, $ReFiNe$ recursively expands a latent octree from a single root latent $\bm{z}^{0}$ to a set of latents at the desired LoD.

3.1.2. Multiscale Feature Fusion

Once an octree is constructed, it can be decoded to various outputs depending on the desired field parametrization. As mentioned, we use $\omega$ and decode each recursively extracted latent vector to occupancy. However, to model more complex signals with high-frequency details (e.g. SDF or RGB) we found that directly decoding latents positioned at voxel centers results in coarse approximations at low octree LoDs and is directly tied to the voxel size, presenting challenges in scaling to high resolutions and/or complex scenes.Instead, we approximate latents at sampled locations by performing trilinear interpolation given spatially surrounding latents at the same LoD. We repeat this at every LoD except the first and then fuse resulting intermediate latents as shown in Fig.2 into a new latent $\bm{\bar{z}}\in\mathbb{R}^{\bar{D}}$ , where the dimension ${\bar{D}}$ of the fused latent varies based on whether a concatenation or summation scheme is used. In the summation scheme, the latent size remains unchanged, i.e., $\bar{D}=D$ , whereas in the concatenation scheme, it is equal to the original latent size $D$ multiplied by the maximum LoD $M$ .

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (4)

3.1.3. Geometry Extraction and Rendering

Similar to (Zakharov etal., 2022), once the feature octree has been extracted for a given object we can decode the voxel centers into field values. However, our resulting representation can also be used to differentiably render images via volumetric rendering. We first estimate AABB intersections with voxels at the highest LoD. Given enter and exit points for each voxel, we then sample points within the voxel volume, enabling rendering via methods such as sphere ray tracing and volumetric compositing.

3.2. Field Specific Details

To demonstrate the utility and flexibility of ReFiNe, we focus in this work on two popular choices of object fields: Signed Distance Fields (SDF)(Park etal., 2019) for representing surfaces and Neural Radiance Fields (NeRF)(Mildenhall etal., 2020) for volumetric rendering and view synthesis. ReFiNe regresses field specific signals via neural mappings that map regressed latents, and optionally viewing direction to the desired output (e.g. SDF, SDF and RGB or density and RGB). We denote the neural mapping responsible for geometry as $\psi$ and the neural mapping responsible for appearance as $\xi$ and discuss specific instantiations below.

3.2.1. SDF

Each fused latent $\bm{\bar{z}}$ regressed via spatial interpolation over the octree and fused over multiple LoDs is given to network $\psi:\mathbb{R}^{\bar{D}}\to\mathbb{R}^{1}$ to estimate an SDF value $s$ corresponding to a distance to the closest surface, with positive and negative values representing exterior and interior areas respectively. When dealing with colored objects, we introduce network $\xi:\mathbb{R}^{\bar{D}}\to\mathbb{R}^{3}$ to estimate a 3D vector $\textbf{c}=(r,g,b)$ that represents RGB colors.

To quickly extract points on the surface of the object, we can simply decode $s$ for the coordinate of each occupied voxel at the highest LoD and calculate the normal of the point by taking the derivative w.r.t to the spatial coordinates. If more points are desired, we can additionally sample within occupied voxels to obtain more surface points. Given further computation time, we may also render the encoded scene via sphere ray tracing, i.e. at each step querying a SDF value within voxels that defines a sphere radius for the next step. We repeat the process until we reach the surface. The latents at the surface points are then used to estimate color values. Figures 3 and 4 show qualitative examples of iso-surface projection and sphere ray tracing, respectively.

3.2.2. NeRF

When representing neural radiance fields each fused multiscale feature is given to networks $\xi:\mathbb{R}^{{\bar{D}}+3}\to\mathbb{R}^{3}$ and $\psi:\mathbb{R}^{\bar{D}}\to\mathbb{R}^{1}$ to estimate a 4D vector $(\textbf{c},\sigma)$ , where $\textbf{c}=(r,g,b)$ are RGB colors and $\sigma$ are densities per point. When trained on NeRF, our color network additionally takes a 3-channel view direction vector $d$ and the corresponding annotation $\mathcal{D}$ is augmented accordingly.

To render an image, each pixel value in the desired image frame is generated by compositing $K$ color predictions along the viewing ray via:

(3)

\small\hat{\textbf{c}}_{ij}=\sum_{k=1}^{K}w_{k}\hat{\textbf{c}}_{k},

where weights $w_{k}$ and accumulated densities $T_{k}$ , provided intervals $\delta_{k}=t_{k+1}-t_{k}$ , are defined as follows:

(4)		$\displaystyle\small w_{k}$	$\displaystyle=T_{k}\Big{(}1-\exp(-\sigma_{k}\delta_{k})\Big{)}$
(5)		$\displaystyle T_{k}$	$\displaystyle=\exp\Big{(}-\sum_{k^{\prime}=1}^{K}\sigma_{k^{\prime}}\delta_{k^%{\prime}}\Big{)}$

and $\{t_{k}\}_{k=0}^{K-1}$ are sampled adaptive depth values. Example visualizations of NeRF-based volumetric rendering can be seen in Fig.5.

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (5)

3.3. Architecture and Training

The functions $\phi$ , $\omega$ , $\psi$ , and $\xi$ are parameterized with single SIREN-based(Sitzmann etal., 2020b) MLPs using periodic activation functions allowing to high-frequency details to be resolved. We refer to these components together as the ReFiNe network.

Our supervision objective consists of three terms: a binary cross entropy occupancy loss $\mathcal{L}_{o}$ , geometry loss $\mathcal{L}_{g}$ and color loss $\mathcal{L}_{c}$ minimizing the $l_{2}$ distance between respective predictions and ground truth values in each object’s field annotation $\mathcal{D}$ .

The final loss is formulated as:

(6)

\mathcal{L}=w_{o}\mathcal{L}_{o}+w_{g}\mathcal{L}_{g}+w_{c}\mathcal{L}_{c},

where $w_{o}=2,w_{g}=10,w_{c}=1$ for SDF, and $w_{o}=2,w_{g}=1,w_{c}=1$ for NeRF. The color loss value is dropped entirely when training on purely geometric SDFs. During training, we optimize the parameters of the recursive autodecoder $\phi$ , occupancy prediction network $\omega$ , decoding networks $\xi,\psi$ as well as the set of $K$ LoD 0 latent variables $\bm{z}^{0}_{i\in K}$ , where each latent represents a single object in $\mathcal{O}$ .

All our networks are trained on a single NVIDIA A100 GPU until convergence. The convergence time varies based on the number and complexity of objects to be encoded, as well as the network’s configuration. It ranges from 10 hours for smaller datasets (Thingi32 and SRN Cars) to 40 hours for larger datasets (GSO and RTMV).

4. Experiments

To demonstrate the utility of our method, we perform experiments across a variety of datasets (Thingi32, ShapeNet150, SRN Cars, GSO, and RTMV) and field representations (SDF, SDF+RGB, and NeRF). We highlight that our method encodes entire datasets within a single neural network, and thus we aim to compare with baselines that focus on the same task and require the same kind of supervision, as opposed to methods that overfit to single shapes or scenes.

4.1. Network Details

For experiments on Thingi32 and ShapeNet150 ReFiNe’s recursive autodecoder network $\phi$ consists of a single 1024-dimensional layer, and all decoding networks $\omega$ , $\psi$ and $\xi$ use two-layers of 256 fully connected units each. For the SRN Cars experiment we use a smaller capacity network featuring 128 two-layer decoding networks.For GSO and RTMV, we increase the capacity of the ReFiNe network, such that $\phi$ consists of a single 4096-dimensional layer, and all decoding networks use two-layers of 512 fully connected units each.We use the Adam solver(Kingma and Ba, 2014) with a learning rate of $2\times 10^{-5}$ to optimize the weights of our networks and a learning rate of $1\times 10^{-4}$ for latent vectors. In general, when reporting network sizes we do not include the storage cost of latent vectors.

Throughout the experiments, we employ either concatenation (Tables1 and3) or summation latent fusion (Table2). The summation fusion scheme preserves the network size across different possible LoDs by keeping input sizes constant for decoder networks. On the other hand, the concatenation scheme comes at a higher storage cost as the corresponding decoding networks must have larger input layers, but it results in improved reconstruction quality. For an ablation comparing the fusion schemes, please refer to the supplemental material.

4.2. Training Data Generation

For object datasets represented as meshes, we normalize meshes to a unit sphere and additionally scale by a factor of 0.9. We first generate an octree of a desired LoD covering the mesh. We then perform dilation to secure a sufficient feature margin for trilinear interpolation. Finally, we sample points around the surface and compute respective SDF values.For colored shapes, we also sample points on the surface and store respective RGB values.

Method	Thingi32					ShapeNet150
Method	CD $\downarrow$	NC $\uparrow$	gIoU $\uparrow$	s $\downarrow$	MB $\downarrow$	CD $\downarrow$	NC $\uparrow$	gIoU $\uparrow$	s $\downarrow$	MB $\downarrow$
DeepSDF	0.088	0.941	96.4	0.14	7.4	0.250	0.933	90.2	0.12	7.4
Curriculum DeepSDF	0.102	0.941	96.3	0.14	7.4	0.214	0.903	93.3	0.12	7.4
ROAD / LoD6	0.138	0.959	96.4	0.03	3.2	0.175	0.928	86.3	0.01	3.8
ROAD / LoD7	0.045	0.969	98.4	0.03	3.2	0.067	0.936	94.2	0.01	3.8
ROAD / LoD8	0.022	0.971	98.7	0.04	3.2	0.041	0.935	94.9	0.02	3.8
ROAD / LoD9	0.017	0.970	98.7	0.08	3.2	0.036	0.931	94.9	0.06	3.8
ReFiNe / LoD4	0.023	0.980	98.8	0.07	3.1	0.041	0.945	96.6	0.04	3.7
ReFiNe / LoD5	0.022	0.981	99.1	0.07	3.1	0.036	0.944	96.5	0.05	3.8
ReFiNe / LoD6	0.019	0.981	99.4	0.07	3.2	0.027	0.954	97.4	0.05	3.8

To efficiently train ReFiNe on NeRFs, we first overfit single-scene NeRFs(Müller etal., 2022) on separate scenes.Each neural field can be constructed from a collection of RGB images $\{I_{i}\}_{i=0}^{N-1}$ , where camera intrinsic parameters $\textbf{K}_{i}\in\mathbb{R}^{3\times 3}$ as well as extrinsics $\mathbb{R}^{4\times 4}$ are assumed to be known.If ground truth depth maps are provided (RTMV), then the octree structure for each scene is computed and subsequently used to supervise our recursive autodecoder $\phi$ . If depth maps are not available (SRN Cars), we instead use adaptive pruning as implemented in(Müller etal., 2022). Then, we also densely sample points augmented with viewpoints inside the octree to store groundtruth density and color values for later supervision of geometry network $\psi$ and color network $\xi$ .

4.3. Reconstruction Benchmarks

4.3.1. Thingi32 / ShapeNet150 (SDF)

In the first benchmark we evaluate our method’s ability to represent and reconstruct object surfaces in the form of a SDF. We follow the experimental setup of(Takikawa etal., 2021; Zakharov etal., 2022) and train two networks: one on a subset of 32 objects from Thingi10K(Zhou and Jacobson, 2016) denoted Thingi32, and another on a subset of 150 objects from ShapeNet(Chang etal., 2015) denoted ShapeNet150. We use a latent dimension of 64 for Thingi32 and a latent dimension of 80 for ShapeNet150. We compute the commonly used Chamfer (CD), gIoU, and normal consistency (NC) metrics to evaluate surface reconstruction and we also record a memory footprint and inference time for each baseline. To extract a pointcloud from ReFiNe, we utilize the zero isosurface projection discussed in Section 3. Following ROAD’s(Zakharov etal., 2022) setup, gIoU is computed by recovering the object mesh using Poisson surface reconstruction(Kazhdan etal., 2006). We compare to DeepSDF(Park etal., 2019), Curriculum DeepSDF(Duan etal., 2020), using both methods’ open-sourced implementations for data generation and training with some minor hyper-parameter tuning to improve performance. Further details can be found in the supplemental material.

4.3.2. SRN Cars (NeRF)

In the next benchmark, we evaluate ReFiNe on another popular representation - Neural Radiance Fields (NeRFs). We use a feature dimension of 64 and compare our method against CodeNeRF(Jang and Agapito, 2021) and SRN(sit, 2019) on a subset of the SRN dataset consisting of 32 cars. We use 45 images for training and 5 non-overlapping images for testing on the task of novel view synthesis. As seen in Table2, our representation outperforms both SRN and CodeNeRF baselines. Fig.6 shows that ReFiNe does better when it comes to reconstructing high-frequency details.To compare inference time for NeRF-based baselines, we compute the average rendering time over the test images of the SRN benchmark. Our method demonstrates runtimes similar to those of SRN, with both significantly faster than CodeNeRF.

4.4. Scaling to Larger Datasets

Next, we demonstrate our model’s ability to scale to larger multi-modal datasets. For the experiments in this section, we use a latent size of 256.

4.4.1. GSO (SDF+RGB)

In the first experiment, we train ReFiNe to output a colored SDF field on the large Google Scanned Objects (GSO) dataset(Downs etal., 2022) containing 1030 diverse colored household objects targeting robotics applications. Despite the high complexity both in terms of geometry as well as color, our method achieves 0.044 Chamfer and 25.36 3D PSNR using a single network of size 45.6 MB together with a list of 256 dimensional latent vectors of 1.05 MB. Our method achieves a compression rate above $99.8\%$ compared to storing the original meshes (1.5 GB) and corresponding textures (24.2 GB). Qualitative results are shown in Fig.4 and demonstrate the reconstruction quality of our approach.

4.4.2. RTMV (NeRF)

In this experiment we want to demonstrate that our method is not limited to reconstructing objects and is able to cover diverse scenes of a much higher complexity. We evaluate ReFiNe on the RTMV view synthesis benchmark(Tremblay etal., 2022) which consists of 40 scenes from 4 different environments (10 scenes each). Each scene comprises 150 unique views, with 100 views used for training, 5 views for validation, and 45 for testing.

As results in Fig.5 show, ReFiNe is able to faithfully reconstruct the encoded scenes while storing all of them within a single network with low storage requirements and without specifically optimizing for compression. We attribute this to the recursive nature of our method splitting scene space into primitives at each recursive step. As we show in Table3, our most lightweight network is only 8.36 MB, resulting in an average storage requirement of 210 KB per scene while still achieving an acceptable reconstruction quality of 24.2 PSNR. Similar to the SRN benchmark, we also compute the average rendering time over the test images, observing a gradual increase in runtime with larger latent sizes. Additionally we perform an ablation testing the effect of changing the latent size on the final reconstruction. We report results in Table3 and Fig.7 and note that performance gradually degrades when lowering the latent size, while at the same time decreasing storage requirements.

5. Limitations and Future Work

Our representation is currently limited to bounded scenes. This limitation can potentially be resolved by introducing an inverted sphere scene model for backgrounds from (Zhang etal., 2020). We would also like to leverage diffusion-based generative models to explore the task of 3D synthesis conditioned on various modalities such as text, images, and depth maps.

Acknowledgements.

We would like to thank Prof. Greg Shakhnarovich for his valuable feedback and help with reviewing the draft for this paper.

References

(1)
sit (2019)2019.Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, In Sitzmann, Vincent and Zollhoefer, Michael and Wetzstein, Gordon.NeurIPS.
Adamkiewicz etal. (2022)Michal Adamkiewicz, Timothy Chen, Adam Caccavale, Rachel Gardner, Preston Culbertson, Jeannette Bohg, and Mac Schwager. 2022.Vision-only robot navigation in a neural radiance world.RA-L (2022).
Barron etal. (2021)JonathanT Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and PratulP Srinivasan. 2021.Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV.
Bi etal. (2020)Sai Bi, Zexiang Xu, Pratul Srinivasan, Ben Mildenhall, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David Kriegman, and Ravi Ramamoorthi. 2020.Neural reflectance fields for appearance acquisition.arXiv (2020).
Breyer etal. (2021)Michel Breyer, JenJen Chung, Lionel Ott, Roland Siegwart, and Juan Nieto. 2021.Volumetric grasping network: Real-time 6 dof grasp detection in clutter. In CoRL.
Cai etal. (2020)Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. 2020.Learning gradient fields for shape generation. In ECCV.
Chang etal. (2015)AngelX Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, etal. 2015.ShapeNet: An information-rich 3D model repository.arXiv (2015).
Chen and Zhang (2019)Zhiqin Chen and Hao Zhang. 2019.Learning implicit fields for generative shape modeling. In CVPR.
Davies etal. (2020)Thomas Davies, Derek Nowrouzezahrai, and Alec Jacobson. 2020.Overfit neural networks as a compact shape representation.arXiv (2020).
Deng etal. (2020)Boyang Deng, JP Lewis, Timothy Jeruzalski, Gerard Pons-Moll, Geoffrey Hinton, Mohammad Norouzi, and Andrea Tagliasacchi. 2020.Neural Articulated Shape Approximation. In ECCV.
Deng etal. (2022)Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. 2022.Depth-supervised nerf: Fewer views and faster training for free. In CVPR.
Deng etal. (2021)Yu Deng, Jiaolong Yang, and Xin Tong. 2021.Deformed implicit field: Modeling 3d shapes with learned dense correspondence. In CVPR.
Downs etal. (2022)Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, ThomasB McHugh, and Vincent Vanhoucke. 2022.Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items.arXiv (2022).
Duan etal. (2020)Yueqi Duan, Haidong Zhu, He Wang, Li Yi, Ram Nevatia, and LeonidasJ Guibas. 2020.Curriculum deepsdf. In ECCV.
FujiTsang etal. (2022)Clement FujiTsang, Maria Shugrina, JeanFrancois Lafleche, Towaki Takikawa, Jiehan Wang, Charles Loop, Wenzheng Chen, KrishnaMurthy Jatavallabhula, Edward Smith, Artem Rozantsev, Or Perel, Tianchang Shen, Jun Gao, Sanja Fidler, Gavriel State, Jason Gorski, Tommy Xiang, Jianing Li, Michael Li, and Rev Lebaredian. 2022.Kaolin: A Pytorch Library for Accelerating 3D Deep Learning Research.https://github.com/NVIDIAGameWorks/kaolin.
Guizilini etal. (2022)Vitor Guizilini, Igor Vasiljevic, Jiading Fang, Rares Ambrus, Greg Shakhnarovich, MatthewR Walter, and Adrien Gaidon. 2022.Depth field networks for generalizable multi-view scene representation. In ECCV.
Hodan etal. (2018)Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, etal. 2018.Bop: Benchmark for 6d object pose estimation. In ECCV.
Ichnowski etal. (2022)Jeffrey Ichnowski, Yahav Avigal, Justin Kerr, and Ken Goldberg. 2022.Dex-NeRF: Using a Neural Radiance Field to Grasp Transparent Objects. In CoRL.
Irshad etal. (2022)MuhammadZubair Irshad, Sergey Zakharov, Rares Ambrus, Thomas Kollar, Zsolt Kira, and Adrien Gaidon. 2022.ShAPO: Implicit Representations for Multi-Object Shape Appearance and Pose Optimization. In ECCV.
Jacquin (1990)ArnaudE Jacquin. 1990.Fractal image coding based on a theory of iterated contractive image transformations. In VCIP.
Jang and Agapito (2021)Wonbong Jang and Lourdes Agapito. 2021.Codenerf: Disentangled neural radiance fields for object categories. In ICCV.
Kaskman etal. (2019)Roman Kaskman, Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. 2019.Homebreweddb: Rgb-d dataset for 6d pose estimation of 3d objects. In ICCV Workshops.
Kato etal. (2020)Hiroharu Kato, Deniz Beker, Mihai Morariu, Takahiro Ando, Toru Matsuoka, Wadim Kehl, and Adrien Gaidon. 2020.Differentiable rendering: A survey.arXiv (2020).
Kazhdan etal. (2006)Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. 2006.Poisson surface reconstruction. In SGP.
Kim etal. (2024)Doyub Kim, Minjae Lee, and Ken Museth. 2024.Neuralvdb: High-resolution sparse volume representation using hierarchical neural networks.TOG (2024).
Kingma and Ba (2014)DiederikP Kingma and Jimmy Ba. 2014.Adam: A method for stochastic optimization.arXiv (2014).
Lindell etal. (2021)DavidB. Lindell, JulienN.P. Martel, and Gordon Wetzstein. 2021.AutoInt: Automatic Integration for Fast Neural Volume Rendering. In CVPR.
Liu etal. (2020a)Lingjie Liu, Jiatao Gu, KyawZaw Lin, Tat-Seng Chua, and Christian Theobalt. 2020a.Neural Sparse Voxel Fields. In NeurIPS.
Liu etal. (2020b)Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc Pollefeys, and Zhaopeng Cui. 2020b.Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In CVPR.
Locatello etal. (2020)Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. 2020.Object-centric learning with slot attention. In NeurIPS.
Lombardi etal. (2019)Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. 2019.Neural volumes: learning dynamic renderable volumes from images.TOG (2019).
Martin-Brualla etal. (2021)Ricardo Martin-Brualla, Noha Radwan, Mehdi S.M. Sajjadi, JonathanT. Barron, Alexey Dosovitskiy, and Daniel Duckworth. 2021.NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. In CVPR.
Mescheder etal. (2019)Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. 2019.Occupancy networks: Learning 3d reconstruction in function space. In CVPR.
Mildenhall etal. (2020)Ben Mildenhall, PratulP Srinivasan, Matthew Tancik, JonathanT Barron, Ravi Ramamoorthi, and Ren Ng. 2020.Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV.
Mitra etal. (2019)NiloyJ. Mitra, Iasonas Kokkinos, Paul Guerrero, Nils Thuerey, Vladimir Kim, and Leonidas Guibas. 2019.CreativeAI: Deep Learning for Graphics. In SIGGRAPH 2019 Courses.
Morton (1966)GuyM Morton. 1966.A computer oriented geodetic data base and a new technique in file sequencing.(1966).
Mu etal. (2021)Jiteng Mu, Weichao Qiu, Adam Kortylewski, Alan Yuille, Nuno Vasconcelos, and Xiaolong Wang. 2021.A-sdf: Learning disentangled signed distance functions for articulated shape representation. In ICCV.
Müller etal. (2022)Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022.Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.TOG (2022).
Neff etal. (2021)Thomas Neff, Pascal Stadlbauer, Mathias Parger, Andreas Kurz, JoergH Mueller, Chakravarty RAlla Chaitanya, Anton Kaplanyan, and Markus Steinberger. 2021.DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks. In Computer Graphics Forum.
Niemeyer and Geiger (2021)Michael Niemeyer and Andreas Geiger. 2021.Giraffe: Representing scenes as compositional generative neural feature fields. In CVPR.
Niemeyer etal. (2020)Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. 2020.Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In CVPR.
Ortiz etal. (2022)Joseph Ortiz, Alexander Clegg, Jing Dong, Edgar Sucar, David Novotny, Michael Zollhoefer, and Mustafa Mukadam. 2022.iSDF: Real-Time Neural Signed Distance Fields for Robot Perception. In RSS.
Ost etal. (2021)Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. 2021.Neural scene graphs for dynamic scenes. In CVPR.
Palafox etal. (2021)Pablo Palafox, Aljaž Božič, Justus Thies, Matthias Nießner, and Angela Dai. 2021.Npms: Neural parametric models for 3d deformable shapes. In ICCV.
Park etal. (2019)JeongJoon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019.DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. In CVPR.
Park etal. (2021)Keunhong Park, Utkarsh Sinha, JonathanT. Barron, Sofien Bouaziz, DanB Goldman, StevenM. Seitz, and Ricardo Martin-Brualla. 2021.Nerfies: Deformable Neural Radiance Fields. In ICCV.
Peng etal. (2020)Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. 2020.Convolutional occupancy networks. In ECCV.
Pumarola etal. (2021)Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. 2021.D-NeRF: Neural Radiance Fields for Dynamic Scenes. In CVPR.
Rashid etal. (2023)Adam Rashid, Satvik Sharma, ChungMin Kim, Justin Kerr, LawrenceYunliang Chen, Angjoo Kanazawa, and Ken Goldberg. 2023.Language embedded radiance fields for zero-shot task-oriented grasping. In CoRL.
Rebain etal. (2021)Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, KwangMoo Yi, and Andrea Tagliasacchi. 2021.Derf: Decomposed radiance fields. In CVPR.
Sajjadi etal. (2022a)MehdiSM Sajjadi, Daniel Duckworth, Aravindh Mahendran, Sjoerd VanSteenkiste, Filip Pavetic, Mario Lucic, LeonidasJ Guibas, Klaus Greff, and Thomas Kipf. 2022a.Object scene representation transformer. In NeurIPS.
Sajjadi etal. (2022b)MehdiSM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario Lučić, Daniel Duckworth, Alexey Dosovitskiy, etal. 2022b.Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. In CVPR.
Shechtman and Irani (2007)Eli Shechtman and Michal Irani. 2007.Matching local self-similarities across images and videos. In CVPR.
Sitzmann etal. (2020a)Vincent Sitzmann, Eric Chan, Richard Tucker, Noah Snavely, and Gordon Wetzstein. 2020a.Metasdf: Meta-learning signed distance functions. In NeurIPS.
Sitzmann etal. (2020b)Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. 2020b.Implicit neural representations with periodic activation functions. In NeurIPS.
Srinivasan etal. (2021)PratulP. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben Mildenhall, and JonathanT. Barron. 2021.NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis. In CVPR.
Stelzner etal. (2021)Karl Stelzner, Kristian Kersting, and AdamR Kosiorek. 2021.Decomposing 3d scenes into objects via unsupervised volume segmentation.arXiv (2021).
Sucar etal. (2021)Edgar Sucar, Shikun Liu, Joseph Ortiz, and AndrewJ Davison. 2021.iMAP: Implicit mapping and positioning in real-time. In ICCV.
Takikawa etal. (2022a)Towaki Takikawa, Alex Evans, Jonathan Tremblay, Thomas Müller, Morgan McGuire, Alec Jacobson, and Sanja Fidler. 2022a.Variable bitrate neural fields. In SIGGRAPH.
Takikawa etal. (2021)Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Morgan McGuire, and Sanja Fidler. 2021.Neural geometric level of detail: Real-time rendering with implicit 3D shapes. In CVPR.
Takikawa etal. (2022b)Towaki Takikawa, Or Perel, ClementFuji Tsang, Charles Loop, Joey Litalien, Jonathan Tremblay, Sanja Fidler, and Maria Shugrina. 2022b.Kaolin Wisp: A PyTorch Library and Engine for Neural Fields Research.https://github.com/NVIDIAGameWorks/kaolin-wisp.
Tancik etal. (2022)Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, PratulP Srinivasan, JonathanT Barron, and Henrik Kretzschmar. 2022.Block-nerf: Scalable large scene neural view synthesis. In CVPR.
Tancik etal. (2021)Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi Schmidt, PratulP Srinivasan, JonathanT Barron, and Ren Ng. 2021.Learned initializations for optimizing coordinate-based neural representations. In CVPR.
Tang etal. (2021)Jia-Heng Tang, Weikai Chen, Jie Yang, Bo Wang, Songrun Liu, Bo Yang, and Lin Gao. 2021.OctField: Hierarchical Implicit Functions for 3D Modeling. In NeurIPS.
Tewari etal. (2021)Ayush Tewari, Justus Thies, Ben Mildenhall, Pratul Srinivasan, Edgar Tretschk, Yifan Wang, Christoph Lassner, Vincent Sitzmann, Ricardo Martin-Brualla, Stephen Lombardi, etal. 2021.Advances in neural rendering.arXiv (2021).
Tremblay etal. (2022)Jonathan Tremblay, Moustafa Meshry, Alex Evans, Jan Kautz, Alexander Keller, Sameh Khamis, Charles Loop, Nathan Morrical, Koki Nagano, Towaki Takikawa, and Stan Birchfield. 2022.RTMV: A Ray-Traced Multi-View Synthetic Dataset for Novel View Synthesis.ECCV Workshops.
Wang etal. (2022)Yifan Wang, Lukas Rahmann, and Olga Sorkine-Hornung. 2022.Geometry-consistent neural shape representation with implicit displacement fields. In ICLR.
Wei etal. (2021)Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. 2021.Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In ICCV.
Williams etal. (2022)Francis Williams, Zan Gojcic, Sameh Khamis, Denis Zorin, Joan Bruna, Sanja Fidler, and Or Litany. 2022.Neural fields as learnable kernels for 3d reconstruction. In CVPR.
Xian etal. (2021)Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. 2021.Space-time neural irradiance fields for free-viewpoint video. In CVPR.
Xie etal. (2021)Yiheng Xie, Towaki Takikawa, Shunsuke Saito, Or Litany, Shiqin Yan, Numair Khan, Federico Tombari, James Tompkin, Vincent Sitzmann, and Srinath Sridhar. 2021.Neural Fields in Visual Computing and Beyond.arXiv (2021).
Yang etal. (2019)Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. 2019.Pointflow: 3d point cloud generation with continuous normalizing flows. In ICCV.
Yang etal. (2022)Guo-Wei Yang, Wen-Yang Zhou, Hao-Yang Peng, Dun Liang, Tai-Jiang Mu, and Shi-Min Hu. 2022.Recursive-nerf: An efficient and dynamically growing nerf.TVCG (2022).
Yi etal. (2023)Brent Yi, Weijia Zeng, Sam Buchanan, and Yi Ma. 2023.Canonical factors for hybrid neural fields. In ICCV.
Yu etal. (2021)Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. 2021.pixelnerf: Neural radiance fields from one or few images. In CVPR.
Yu etal. (2022)Hong-Xing Yu, LeonidasJ. Guibas, and Jiajun Wu. 2022.Unsupervised Discovery of Object Radiance Fields. In ICLR.
Yuan etal. (2021)Wentao Yuan, Zhaoyang Lv, Tanner Schmidt, and Steven Lovegrove. 2021.STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in Motion with Neural Rendering. In CVPR.
Zakharov etal. (2022)Sergey Zakharov, Rares Ambrus, Katherine Liu, and Adrien Gaidon. 2022.ROAD: Learning an Implicit Recursive Octree Auto-Decoder to Efficiently Encode 3D Shapes. In CoRL.
Zakharov etal. (2021)Sergey Zakharov, RaresAndrei Ambrus, VitorCampagnolo Guizilini, Dennis Park, Wadim Kehl, Fredo Durand, JoshuaB Tenenbaum, Vincent Sitzmann, Jiajun Wu, and Adrien Gaidon. 2021.Single-Shot Scene Reconstruction. In CoRL.
Zakharov etal. (2020)Sergey Zakharov, Wadim Kehl, Arjun Bhargava, and Adrien Gaidon. 2020.Autolabeling 3D Objects with Differentiable Rendering of SDF Shape Priors. In CVPR.
Zeng etal. (2022)Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. 2022.LION: Latent Point Diffusion Models for 3D Shape Generation. In NeurIPS.
Zhang etal. (2020)Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. 2020.Nerf++: Analyzing and improving neural radiance fields.arXiv (2020).
Zhou etal. (2021)Linqi Zhou, Yilun Du, and Jiajun Wu. 2021.3d shape generation and completion through point-voxel diffusion. In ICCV.
Zhou and Jacobson (2016)Qingnan Zhou and Alec Jacobson. 2016.Thingi10k: A dataset of 10,000 3d-printing models.arXiv (2016).

Supplementary Material

Appendix A Training Details

A.1. ReFiNe Training Data

ReFiNe’s training data consists of a ground truth octree structure covering the mesh at a desired LoD and densely sampled coordinates together with respective GT values (SDF, RGB, density). We sample $10^{6}$ points within 2 bands - a smaller one (LoD-1) and a larger one (LoD+1) to ensure sufficient coverage for recovering high frequency details and store respective supervision values (e.g. SDF, RGB, density).

Following(FujiTsang etal., 2022), our octree is represented as a tensor of bytes, where each bit stands for the binary occupancy sorted in Morton order. The Morton order defines a space-filling curve, which provides a bijective mapping to 3D coordinates from 1D coordinates. As a result, this frees us from storing indirection pointers and allows efficient tree access. We additionally dilate our octree using a simple $3\times 3\times 3$ dilation kernel to secure a sufficient feature margin for trilinear interpolation.

All our networks are trained on a single NVIDIA A100 GPU.

A.2. Baseline Method Details

DeepSDF

We use the open-source implementation of DeepSDF (Park etal., 2019). To generate training data, we preprocess models from Thingi32 and ShapeNet150 via the provided code and parameters, which aims to generate approximately 500k training points. We improve results on the overfitting scenario by setting the dropout rate to zero and removing the latent regularization. We use a learning rate of $0.001$ for the decoder network parameters and $0.002$ for latents as well as a decay factor of $0.75$ every 500 steps, training the methods until convergence (about 20k epochs). For the experiments on Thingi32 we use a batch size of 32 objects and for ShapeNet150 we use a batch size of 64 objects. All other parameters we leave as provided by the example implementations (i.e., we used a code length of 256 and keep the neural network architecture unchanged).

Curriculum DeepSDF

We also use the open-source implementation of Curriculum-DeepSDF (Duan etal., 2020). We duplicate the parameter changes made to DeepSDF for consistency, and use the same training data input. We do not modify the curriculum proposed in (Duan etal., 2020) other than lengthening the last stage of training. We observe that the proposed curriculum provided quantitative reconstruction gains for ShapeNet150 and not Thingi32, suggesting that a different curriculum may improve results for the latter dataset. However, searching for the optimal curriculum is expensive and we choose to report results based on the baseline curriculum given in the open-source implementation.

SRN & CodeNeRF

We use the open-source implementations with default configurations for SRN(sit, 2019) and CodeNeRF(Jang and Agapito, 2021) and train both methods on our subset of the SRN dataset as described in Section 4.3 of the main paper. Both baselines use a default latent code size of 256, whereas CodeNeRF uses 2 latent codes of 256 to represent an object - one for geometry, another for appearance. In Table 2 and Fig. 6 of the main paper we demonstrate that our method outperforms both baselines, while using a more lightweight architecture and a latent code size of 64.

Fusion	Reconstruction		Runtime	Size
Fusion	CD $\downarrow$	3D PSNR $\uparrow$	s $\downarrow$	MB $\downarrow$
Sum	0.046	33.61	0.11	3.2
Concatenate	0.046	34.89	0.12	3.8

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (8)

Appendix B Evaluation Details

To calculate the Chamfer distance for DeepSDF and Curriculum DeepSDF we first extract surface points following the protocol of(Irshad etal., 2022). In particular, we define a coarse voxel grid of LoD 2 and estimate SDF values for eachof the points using a pretrained SDF network. The voxels whose SDF values are larger than their size are pruned and the remaining voxels are propagated to the next level via subdivision. When the desired LoD is reached, we use zero isosurface projection to extract surface points using predicted SDF values and estimated surface normals. Finally, we use the Chamfer distance implementation from(Takikawa etal., 2022b) to compare our prediction against a ground truth point cloud of $2^{17}$ points sampled from the original mesh. When reconstructing SDF + Color, we additionally use PSNR to evaluate RGB values regressed from the same $2^{17}$ points.To compute gIoU, we first reconstruct a mesh using Poisson surface reconstruction from(Kazhdan etal., 2006) and then compare against $2^{17}$ ground truth values randomly computed using the original mesh.

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (9)

Appendix C Additional Results

Multiscale Feature Interpolation

In Table4 we perform an ablation studying how the multi-scale feature fusion scheme affects the final reconstruction quality. For this experiment, we use the latent size of 64, our recursive autodecoder network $\phi$ consists of a single 1024-dimensional layer, and all decoding networks $\omega$ , $\psi$ and $\xi$ use two-layers of 256 fully connected units each. We use HomebrewedDB(Kaskman etal., 2019) - a 6D pose estimation dataset from BOP benchmark(Hodan etal., 2018) comprising 33 colored meshes (17 toy, 8 household and 8 industry-relevant) of various complexity in terms of both geometry and color. Two methods of feature fusion to combine interpolated features from multiple LoDs are considered: Sum, where the latents are simply added together, and Concatenate, where the interpolated latents from each LoD are concatenated together. Both modalities are trained to encode the full dataset consisting of 33 objects. As was shown in Table 2 of the main paper, the Sum fusion scheme preserves the network size across different possible LoDs, because it doesn’t change the input size for the respective decoder networks and we have a single recursive network $\phi$ by design. On the other hand, the Concatenate scheme comes at a higher storage cost as the corresponding decoding networks must have larger input layers, but results in an improved 3D PSNR value as shown in Table4. As can be seen in Fig.9, while both schemes manage to faithfully represent object geometry the Concatenate scheme does better when it comes to preserving high-frequency color details.

Latent Space Interpolation and Clustering

We present a qualitative analysis of our latent space conducted on the ShapeNet150 and SRN Cars datasets. As our method outputs a continuous feature field, it can be used for interpolation in the latent space between objects of similar geometry. Figure10 shows an example of such interpolation between two objects of different classes.In addition, we plot latent spaces of Thingi32, ShapeNet150, and Google Scanned Objects represented by respective networks using the principal component analysis (see Fig.8). Projected latent spaces suggest that the structure of ReFiNe’s latent space clusters similar objects defined either by geometry (Thingi32, ShapeNet150) or geometry and color (Google Scanned Objects), pointing to potential classification utility.

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (10)

Single-Scene Baselines

Our main paper features baselines that are carefully selected to adhere to our key paradigm of representing the entire dataset with a single network, where each object or scene is represented by a single compact latent vector. However, it is also useful to evaluate how our results fare against single-scene methods that use a single network per object or scene. In this section, we compare our results with single-scene methods on SDF and NeRF benchmarks. All storage sizes include both network and latent vector sizes.Table5 shows the results on the ShapeNet150 and Thingi32 SDF benchmark. We compare our method against two baselines: mip-Neural Implicits(Davies etal., 2020), and NGLOD(Takikawa etal., 2021). Our method outperforms single-scene baselines on ShapeNet150 both in terms of reconstruction quality and storage and demonstrates a comparable performance on Thingi32.Similarly, Table6 shows the results on the RTMV benchmark. We compare our method against two baselines: mip-NeRF(Barron etal., 2021), and SVLF(Tremblay etal., 2022). As results show, ReFiNe is able to approach the performance of single-scene methods while storing all 40 scenes within a single network providing substantially lower storage requirements without specifically optimizing for compression. We attribute this to the recursive nature of our method splitting scene space into primitives at each recursive step.

Method	Type	ShapeNet150			Thingi32
Method	Type	CD $\downarrow$	gIoU $\uparrow$	MB $\downarrow$	CD $\downarrow$	gIoU $\uparrow$	MB $\downarrow$
Neural Implicits	Per-Scene	0.500	82.2	4.4	0.092	96.0	0.9
NGLOD	Per-Scene	0.062	91.7	185.4	0.027	99.4	39.6
ReFiNe/LoD6	Per-Dataset	0.019	99.4	3.9	0.027	97.4	3.2

Method	Type	View synthesis			Storage
Method	Type	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	MB $\downarrow$
mip-NeRF	Per-Scene	30.53	0.91	0.06	7.4 * 40
SVLF	Per-Scene	28.83	0.91	0.06	947 * 40
ReFiNe/LoD6	Per-Dataset	26.72	0.87	0.19	45.6

SIREN vs ReLU

Our recursive subdivision network $\phi$ and all decoding networks $\omega$ , $\psi$ and $\xi$ are parametrized with SIREN-based MLPs using periodic activation functions. In this ablation, we evaluate how replacing SIREN-based MLPs with standard vanilla ReLU-based MLPs affects the reconstruction metrics for scenes using different field representations. To accomplish this, we select a single object from each modality (T-Rex from Thingi32 for SDF, Dog from HB for SDF+RGB, car from SRNCars for NeRF) and overfit a single MLP to each of the modalities. All baselines use a latent size of 64, a single 1024-dimensional layer for recursive subdivision network $\phi$ , and 256-dimensional two-layer decoding networks $\omega$ , $\psi$ , and $\xi$ .Our results shown in Table7 demonstrate that a naive ReLU-based MLP implementation performs worse overall and especially suffers when it comes to reconstructing high frequency details and colors.

Network Size vs Reconstruction Quality

Similar to our latent size experiments in Table 3 of the main paper, in this ablation we study how changing the hidden dimension of our recursive subdivision network $\phi$ affects reconstruction quality. We train four baselines with different sizes for the hidden dimension of the recursive subdivision network, $\phi$ : 128, 256, 512, and 1024. The remaining parameters are consistent across all four networks: a latent size of 64, and each of the decoding networks $\omega$ , $\psi$ , and $\xi$ utilizes two layers of 256 fully connected units. All the baselines are trained on the HB dataset (SDF+RGB). As shown in Fig.11, we observe a graceful degradation of quality with decreasing network capacity.

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (11)

Qualitative Results

In Figs.12 and 13, we present additional qualitative results comparing ReFiNe against baselines: DeepSDF, Curriculum DeepSDF, and ROAD. We also demonstrate reconstructions of the HomebrewedDB in Fig.14 and additional RTMV qualitative results in Fig.15.

Activation	T-Rex (SDF)	Dog (SDF+RGB)		Car (NeRF)
Activation	CD $\downarrow$	CD $\downarrow$	3D PSNR $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$
ReLU	0.026	0.021	34.08	28.170	0.951
SIREN	0.025	0.020	42.16	29.130	0.962

Appendix D Applications

In recent years neural fields have found its use in various domains including robotics and graphics. In recent years, neural fields have found utility in various domains, including robotics and graphics. In robotics, neural fields are actively employed to represent 3D geometry and appearance, with applications in object pose estimation and refinement(Zakharov etal., 2020; Irshad etal., 2022), grasping(Breyer etal., 2021; Ichnowski etal., 2022), and trajectory planning(Adamkiewicz etal., 2022). In graphics, they have been successfully utilized for object reconstruction from sparse and noisy data(Williams etal., 2022) and for representing high-quality 3D assets(Takikawa etal., 2022a).

ReFiNe employs a recursive hierarchical formulation that leverages object self-similarity, resulting in a highly compressed and efficient shape latent space. We demonstrate that our method achieves impressive results in SDF-based reconstruction (Table 1 of the main paper) and NeRF-based novel-view synthesis (Tables 2 and 3 of the main paper), and features well-clustered latent spaces allowing for smooth interpolation (Figs.8 and 10). We believe that these properties will accelerate the applicability of neural fields in real-world tasks, particularly those involving compression.

Method	View Synthesis			Runtime	Size
Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	s $\downarrow$	MB $\downarrow$
ReFiNe / Lat 32	24.18	0.83	0.23	1.19	8.4
ReFiNe / Lat 64	25.29	0.85	0.21	1.57	13.7
ReFiNe / Lat 128	25.96	0.86	0.20	2.34	24.3
ReFiNe / Lat 256	26.72	0.87	0.19	3.89	45.6