ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (2024)

Sergey ZakharovToyota Research InstituteLos AltosUSAsergey.zakharov@tri.global, Katherine LiuToyota Research InstituteLos AltosUSAkatherine.liu@tri.global, Adrien GaidonToyota Research InstituteLos AltosUSAadrien.gaidon.ctr@tri.global and Rareș AmbrușToyota Research InstituteLos AltosUSArares.ambrus@tri.global

(2024)

Abstract.

The common trade-offs of state-of-the-art methods for multi-shape representation (a single model ”packing” multiple objects) involve trading modeling accuracy against memory and storage. We show how to encode multiple shapes represented as continuous neural fields with a higher degree of precision than previously possible and with low memory usage. Key to our approach is a recursive hierarchical formulation that exploits object self-similarity, leading to a highly compressed and efficient shape latent space. Thanks to the recursive formulation, our method supports spatial and global-to-local latent feature fusion without needing to initialize and maintain auxiliary data structures, while still allowing for continuous field queries to enable applications such as raytracing.In experiments on a set of diverse datasets, we provide compelling qualitative results and demonstrate state-of-the-art multi-scene reconstruction and compression results with a single network per dataset.Project page:https://zakharos.github.io/projects/refine/

compression, neural fields, level of detail, recursion, self-similarity

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24; July 27-August 1, 2024; Denver, CO, USA^†^†booktitle: Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24 (SIGGRAPH Conference Papers ’24), July 27-August 1, 2024, Denver, CO, USA^†^†doi: 10.1145/3641519.3657526^†^†isbn: 979-8-4007-0525-0/24/07^†^†submissionid: 1267^†^†ccs: Computing methodologies Machine learning^†^†ccs: Computing methodologies Computer graphics^†^†ccs: Computing methodologies Computer vision representations

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (1)

1. Introduction

Neural fields that encode scene properties at arbitrary resolutions using neural networks have reached unprecedented levels of detail. Typically using fully-connected multi-layer perceptrons (MLPs) to predict continuous field values, they have been used to represent geometry and appearance with applications in computer vision (Tancik et al., 2022), robotics (Rashid et al., 2023), and computer graphics (Mitra et al., 2019). However, most high-fidelity methods are limited to single scenes (Müller et al., 2022; Takikawa et al., 2021, 2022a) and overfit to the target geometry or appearance (Müller et al., 2022), while methods that capture multiple shapes typically sacrifice high-frequency details (Jang and Agapito, 2021; Park et al., 2019; Mescheder et al., 2019), limiting utility for applications such as streaming and representation learning. We would like to enable the compression of multiple complex shapes into single vectors with a single neural network and while maintaining the ability reconstruct high frequency geometric and textural information.

Global conditioning methods (sit, 2019; Jang and Agapito, 2021; Park et al., 2019) (i.e. one latent vector per shape) are capable of learning latent spaces over large numbers of shapes but require ground truth 3D supervision and suffer when representing high frequency details. Conversely, locally-conditioned methods partition the implicit function by leveraging hybrid discrete-continuous neural scene representations, effectively blurring the line between classical data structures and neural representations and allowing for more precise reconstructions by handling scenes as collections of localized primitives. These methods typically encode single scenes and leverage a secondary data structure (Takikawa et al., 2021; Zakharov et al., 2022; Müller et al., 2022), trading off additional memory for a less complex neural function mapping feature vectors to the target signal. Recently, (Zakharov et al., 2022) proposed to take advantage of both global and local conditioning via a recursive octree formulation, but the approach only captures geometry and outputs oriented point clouds that do not allow for continuous querying of the underlying implicit function, precluding the application of techniques such as ray-tracing.

In this work, we propose to encode many scenes represented as fields in a single network, where each scene is denoted by a single latent vector in a high dimensional space. We show how entire datasets of colored shapes can be encoded into a single neural network without sacrificing high frequency details (color or geometry) and without incurring a high memory cost. Key to our approach is a recursive formulation that allows us to effectively combine local and global conditioning. Our main motivation for a recursive structure comes from the observation that natural objects are self-similar (Shechtman and Irani, 2007), that is they are similar to a part of themselves at different scales. This property is famously used in the Fractal compression methods (Jacquin, 1990). Our method effectively extends prior work to the continuous setting, which allows us to recover geometry and color information with a higher degree of fidelity than previously possible. Our novel formulation allows us to learn from direct 3D supervision (SDF plus optionally RGB), as well as from continuous valued fields (NeRFs). We also investigate the properties of the resulting latent space and our results suggest the emergence of structure based on shape and appearance similarity. We address the limitations of related methods for representing multiple 3D shapes through ReFiNe: Recursive Field Networks and our contributions are:

•
A novel implicit representation parameterized by a recursive function that efficiently combines global and local conditioning, allowing continuous spatial interpolation and multi-scale feature aggregation.
•
Thanks to its recursive formulation, ReFiNe scales to multiple 3D assets represented as fields without having to maintain auxiliary data structures, leading to a compact and efficient network structure. We demonstrate a single network representing more than 1000 objects with high quality and reducing the memory needed by 99.8%.
•
ReFiNe is cross-modal, i.e., it supports various output 3D geometry and color representations (e.g., SDF, SDF+Color, and NeRF) and its output can be rendered either with sphere raytracing (SDF), iso-surface projection (SDF) or volumetric rendering (NeRF).

2. Related Work

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (2)

2.1. Neural Fields for Representing Shapes

Neural fields have emerged as powerful learners thanks to their ability to encode any continuous function up to an arbitrary level of resolution. For a survey of recent progress please refer to (Xie et al., 2021). Shapes are typically represented as Signed Distance Functions (Park et al., 2019; Sitzmann et al., 2020a, b) or by occupancy probabilities (Mescheder et al., 2019; Peng et al., 2020; Chen and Zhang, 2019), with the encoded mesh extracted through methods such as sphere tracing (Liu et al., 2020b). Hybrid discrete-continuous data structures have enabled encoding single objects to a very high degree of accuracy (Takikawa et al., 2021, 2022a; Müller et al., 2022; Wang et al., 2022; Kim et al., 2024; Yi et al., 2023) and extensions have been proposed to model articulated (Deng et al., 2020; Mu et al., 2021) and deformable (Deng et al., 2021; Palafox et al., 2021) objects. Alternatively, training on multiple shapes leads to disentangled latent spaces (Park et al., 2019; Chen and Zhang, 2019; Tang et al., 2021) which can be used for differentiable shape optimization (Zakharov et al., 2021; Irshad et al., 2022) shape generation (Chen and Zhang, 2019; Yang et al., 2019; Cai et al., 2020; Zeng et al., 2022), interpolation (Williams et al., 2022) and completion (Zhou et al., 2021). A number of methods have been proposed which continuously model and update scene geometry within the context of Simultaneous Localization and Mapping (SLAM) (Sucar et al., 2021; Ortiz et al., 2022).Some methods also leverage recursion to improve the reconstruction accuracy of neural fields (Yang et al., 2022; Zakharov et al., 2022). The recently proposed method ROAD (Zakharov et al., 2022) is most similar to ours as it also uses a recursive Octree structure and can represent the surface of multiple objects with a single network. However, it does not encode color and it outputs a discrete fixed-resolution reconstruction, making it unsuitable for applications that require volumetric rendering or ray-tracing. In contrast, ReFiNe outputs continuous feature fields that can be used to represent various continuous representations, such as (but not limited to) colored SDFs and NeRFs.

2.2. Differentiable Rendering Advances

(Kato et al., 2020; Tewari et al., 2021) through techniques such as volume rendering (Lombardi et al., 2019) or ray marching (Niemeyer et al., 2020) have led to methods that learn to represent geometry, appearance and as well as other scene properties from image inputs and without needing direct 3D supervision. Leveraging ray marching, (sit, 2019) regresses RGB colors at surface intersection allowing it to learn from multi-view images, while (Niemeyer et al., 2020) couples an implicit shape representation with differentiable rendering. Building on (Lombardi et al., 2019), Neural Radiance Fields (NeRFs) (Mildenhall et al., 2020) regress density and color values along directed rays (5D coordinates) instead of of regressing SDF or RGB values at 3D coordinates. This simple and yet very convincingly effective representation boosted interest in implicit volumetric rendering and resulted in a multitude of works tackling problems from training and rendering time performance (Rebain et al., 2021; Lindell et al., 2021; Tancik et al., 2021; Liu et al., 2020a), to covering dynamic scenes (Park et al., 2021; Pumarola et al., 2021; Xian et al., 2021), scene relighting (Martin-Brualla et al., 2021; Bi et al., 2020; Srinivasan et al., 2021), and composition (Ost et al., 2021; Yuan et al., 2021; Niemeyer and Geiger, 2021). To achieve competitive results, NeRF-style methods require a large number of input views, with poor performance in the low data regime (Zhang et al., 2020) which can be improved by leveraging external depth supervision (Neff et al., 2021; Wei et al., 2021; Deng et al., 2022). Image supervision has also been used to learn 3D object-centric models without any additional information (Stelzner et al., 2021; Yu et al., 2022; Sajjadi et al., 2022a), through a combination of Slot Attention (Locatello et al., 2020) and volumetric rendering. Alternatively, a number of methods train generalizable priors over multiple scenes (sit, 2019; Yu et al., 2021; Jang and Agapito, 2021; Sajjadi et al., 2022b; Guizilini et al., 2022). In (Jang and Agapito, 2021) the authors learn a prior over objects that are represented as radiance fields via MLPs and parameterized by appearance and shape codes.As we show through experiments, the design of our recursive neural 3D representation leads to a latent space that promotes reusability of color and geometric primitives across shapes, enabling higher accuracy recostructions than previously possible.

3. Methodology

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (3)

We would like to learn to represent a set of objects $\mathcal{O}=\{O_{1},\dots,O_{K}\}$ . In particular, we are interested in representing objects as fields, where each object is a mapping from a 3D coordinate in space to a value of dimension $F$ , i.e., $O_{k}:\mathbb{R}^{3}\to\mathbb{R}^{F}$ . Examples of common fields are Signed Distance Fields (where $F=1$ and the value of the field indicates the distance to the nearest surface) and radiance fields (where $F=4$ , representing RGB and density values). For each object, we assume supervision in the form of $N_{k}$ coordinate and field value tuples of $\{\bm{x}_{j},f_{j}\}_{j=0}^{N_{k}}$ , where $\bm{x}\in\mathbb{R}^{3}$ and $f\in\mathbb{R}^{F}$ is the field value.

3.1. ReFiNe

Our method represents each shape $O_{k}$ with a $D$ -dimensional latent vector $\bm{z}^{0}$ that is recursively expanded into an octree with a maximum Level-of-Detail (LoD) $M$ . Each level of the octree corresponds to a feature volume. We then perform both spatial and hierarchical feature aggregations before decoding into field values. Crucially, the expansion of each latent vector into an Octree-based neural field is achieved via the same simple MLP for each LoD and decoders are shared across all objects in $\mathcal{O}$ . Once optimized, ReFiNe represents all $K$ objects in a set of $K$ latent vectors, a recursive autodecoder for octree expansion, an occupancy prediction network, and field-specific decoders (i.e., for RGB, SDF, etc). Figure 1 illustrates how after training, our method can extract neural fields given different optimized LoD 0 latents, where we have dropped the superscipt for readability. Figure 2 shows a more detailed overview of a reconstruction given a single input latent.

3.1.1. Recursive Subdivision & Pruning

Given a latent vector $\bm{z}^{m}\in\mathbb{R}^{D}$ from LoD $m$ , our recursive autodecoder subdivision network $\phi:\mathbb{R}^{D}\to\mathbb{R}^{8D}$ traverses an octree by latent subdivision:

(1)

\phi(\bm{z}^{m})\to\{\bm{z}^{m+1}_{i}\}_{i=0}^{7}

Thus, a latent is divided into 8 cells, each with an associated child latent that is positioned at the cell’s center. Cell locations are defined by the Morton space-filling curve (Morton, 1966).

Each child latent is then further decoded to occupancy values $o$ using occupancy network $\omega:\mathbb{R}^{D}\to\mathbb{R}^{1}$ . Rather than continuing to expand the tree for all child latents, $R e F i N e$ selects a subset based on the predicted occupancy value:

(2)

\mathcal{Z}^{m+1}=\{\bm{z}^{m+1}\in\phi(\bm{z}^{m})\mid\omega(\bm{z}^{m+1})>0.%5\},

where $\mathcal{Z}^{m+1}$ is the set of children latents from a particular parent latent $\bm{z}^{m}$ having predicted occupancies above a threshold of $0.5$ , from which the next set of children will be recursed. This process can be seen in the left inset of Fig. 2.To supervise occupancy predictions, we further assume access to the structure of the ground-truth octree during training, i.e., annotations of which voxels at each LoD are occupied. If a voxel is predicted to be more likely unoccupied during reconstruction, we prune it from the octree structure.

To build the set of latents at a particular LoD, the latent expansion process described by Equations 1 and 2 for a single latent is applied to all unpruned children latents from the previous LoD. In this way, $R e F i N e$ recursively expands a latent octree from a single root latent $\bm{z}^{0}$ to a set of latents at the desired LoD.

3.1.2. Multiscale Feature Fusion

Once an octree is constructed, it can be decoded to various outputs depending on the desired field parametrization. As mentioned, we use $\omega$ and decode each recursively extracted latent vector to occupancy. However, to model more complex signals with high-frequency details (e.g. SDF or RGB) we found that directly decoding latents positioned at voxel centers results in coarse approximations at low octree LoDs and is directly tied to the voxel size, presenting challenges in scaling to high resolutions and/or complex scenes.Instead, we approximate latents at sampled locations by performing trilinear interpolation given spatially surrounding latents at the same LoD. We repeat this at every LoD except the first and then fuse resulting intermediate latents as shown in Fig. 2 into a new latent $\bm{\bar{z}}\in\mathbb{R}^{\bar{D}}$ , where the dimension ${\bar{D}}$ of the fused latent varies based on whether a concatenation or summation scheme is used. In the summation scheme, the latent size remains unchanged, i.e., $\bar{D}=D$ , whereas in the concatenation scheme, it is equal to the original latent size $D$ multiplied by the maximum LoD $M$ .

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (4)

3.1.3. Geometry Extraction and Rendering

Similar to (Zakharov et al., 2022), once the feature octree has been extracted for a given object we can decode the voxel centers into field values. However, our resulting representation can also be used to differentiably render images via volumetric rendering. We first estimate AABB intersections with voxels at the highest LoD. Given enter and exit points for each voxel, we then sample points within the voxel volume, enabling rendering via methods such as sphere ray tracing and volumetric compositing.

3.2. Field Specific Details

To demonstrate the utility and flexibility of ReFiNe, we focus in this work on two popular choices of object fields: Signed Distance Fields (SDF) (Park et al., 2019) for representing surfaces and Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) for volumetric rendering and view synthesis. ReFiNe regresses field specific signals via neural mappings that map regressed latents, and optionally viewing direction to the desired output (e.g. SDF, SDF and RGB or density and RGB). We denote the neural mapping responsible for geometry as $\psi$ and the neural mapping responsible for appearance as $\xi$ and discuss specific instantiations below.

3.2.1. SDF

Each fused latent $\bm{\bar{z}}$ regressed via spatial interpolation over the octree and fused over multiple LoDs is given to network $\psi:\mathbb{R}^{\bar{D}}\to\mathbb{R}^{1}$ to estimate an SDF value $s$ corresponding to a distance to the closest surface, with positive and negative values representing exterior and interior areas respectively. When dealing with colored objects, we introduce network $\xi:\mathbb{R}^{\bar{D}}\to\mathbb{R}^{3}$ to estimate a 3D vector $\textbf{c}=(r,g,b)$ that represents RGB colors.

To quickly extract points on the surface of the object, we can simply decode $s$ for the coordinate of each occupied voxel at the highest LoD and calculate the normal of the point by taking the derivative w.r.t to the spatial coordinates. If more points are desired, we can additionally sample within occupied voxels to obtain more surface points. Given further computation time, we may also render the encoded scene via sphere ray tracing, i.e. at each step querying a SDF value within voxels that defines a sphere radius for the next step. We repeat the process until we reach the surface. The latents at the surface points are then used to estimate color values. Figures 3 and 4 show qualitative examples of iso-surface projection and sphere ray tracing, respectively.

3.2.2. NeRF

When representing neural radiance fields each fused multiscale feature is given to networks $\xi:\mathbb{R}^{{\bar{D}}+3}\to\mathbb{R}^{3}$ and $\psi:\mathbb{R}^{\bar{D}}\to\mathbb{R}^{1}$ to estimate a 4D vector $(\textbf{c},\sigma)$ , where $\textbf{c}=(r,g,b)$ are RGB colors and $\sigma$ are densities per point. When trained on NeRF, our color network additionally takes a 3-channel view direction vector $d$ and the corresponding annotation $\mathcal{D}$ is augmented accordingly.

To render an image, each pixel value in the desired image frame is generated by compositing $K$ color predictions along the viewing ray via:

(3)

\small\hat{\textbf{c}}_{ij}=\sum_{k=1}^{K}w_{k}\hat{\textbf{c}}_{k},

where weights $w_{k}$ and accumulated densities $T_{k}$ , provided intervals $\delta_{k}=t_{k+1}-t_{k}$ , are defined as follows:

(4)		$\displaystyle\small w_{k}$	$\displaystyle=T_{k}\Big{(}1-\exp(-\sigma_{k}\delta_{k})\Big{)}$
(5)		$\displaystyle T_{k}$	$\displaystyle=\exp\Big{(}-\sum_{k^{\prime}=1}^{K}\sigma_{k^{\prime}}\delta_{k^%{\prime}}\Big{)}$

and $\{t_{k}\}_{k=0}^{K-1}$ are sampled adaptive depth values. Example visualizations of NeRF-based volumetric rendering can be seen in Fig. 5.

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (5)

3.3. Architecture and Training

The functions $\phi$ , $\omega$ , $\psi$ , and $\xi$ are parameterized with single SIREN-based (Sitzmann et al., 2020b) MLPs using periodic activation functions allowing to high-frequency details to be resolved. We refer to these components together as the ReFiNe network.

Our supervision objective consists of three terms: a binary cross entropy occupancy loss $\mathcal{L}_{o}$ , geometry loss $\mathcal{L}_{g}$ and color loss $\mathcal{L}_{c}$ minimizing the $l_{2}$ distance between respective predictions and ground truth values in each object’s field annotation $\mathcal{D}$ .

The final loss is formulated as:

(6)

\mathcal{L}=w_{o}\mathcal{L}_{o}+w_{g}\mathcal{L}_{g}+w_{c}\mathcal{L}_{c},

where $w_{o}=2,w_{g}=10,w_{c}=1$ for SDF, and $w_{o}=2,w_{g}=1,w_{c}=1$ for NeRF. The color loss value is dropped entirely when training on purely geometric SDFs. During training, we optimize the parameters of the recursive autodecoder $\phi$ , occupancy prediction network $\omega$ , decoding networks $\xi,\psi$ as well as the set of $K$ LoD 0 latent variables $\bm{z}^{0}_{i\in K}$ , where each latent represents a single object in $\mathcal{O}$ .

All our networks are trained on a single NVIDIA A100 GPU until convergence. The convergence time varies based on the number and complexity of objects to be encoded, as well as the network’s configuration. It ranges from 10 hours for smaller datasets (Thingi32 and SRN Cars) to 40 hours for larger datasets (GSO and RTMV).

4. Experiments

To demonstrate the utility of our method, we perform experiments across a variety of datasets (Thingi32, ShapeNet150, SRN Cars, GSO, and RTMV) and field representations (SDF, SDF+RGB, and NeRF). We highlight that our method encodes entire datasets within a single neural network, and thus we aim to compare with baselines that focus on the same task and require the same kind of supervision, as opposed to methods that overfit to single shapes or scenes.

4.1. Network Details

For experiments on Thingi32 and ShapeNet150 ReFiNe’s recursive autodecoder network $\phi$ consists of a single 1024-dimensional layer, and all decoding networks $\omega$ , $\psi$ and $\xi$ use two-layers of 256 fully connected units each. For the SRN Cars experiment we use a smaller capacity network featuring 128 two-layer decoding networks.For GSO and RTMV, we increase the capacity of the ReFiNe network, such that $\phi$ consists of a single 4096-dimensional layer, and all decoding networks use two-layers of 512 fully connected units each.We use the Adam solver (Kingma and Ba, 2014) with a learning rate of $2\times 10^{-5}$ to optimize the weights of our networks and a learning rate of $1\times 10^{-4}$ for latent vectors. In general, when reporting network sizes we do not include the storage cost of latent vectors.

Throughout the experiments, we employ either concatenation (Tables 1 and 3) or summation latent fusion (Table 2). The summation fusion scheme preserves the network size across different possible LoDs by keeping input sizes constant for decoder networks. On the other hand, the concatenation scheme comes at a higher storage cost as the corresponding decoding networks must have larger input layers, but it results in improved reconstruction quality. For an ablation comparing the fusion schemes, please refer to the supplemental material.

4.2. Training Data Generation

For object datasets represented as meshes, we normalize meshes to a unit sphere and additionally scale by a factor of 0.9. We first generate an octree of a desired LoD covering the mesh. We then perform dilation to secure a sufficient feature margin for trilinear interpolation. Finally, we sample points around the surface and compute respective SDF values.For colored shapes, we also sample points on the surface and store respective RGB values.

Method	Thingi32					ShapeNet150
Method	CD $\downarrow$	NC $\uparrow$	gIoU $\uparrow$	s $\downarrow$	MB $\downarrow$	CD $\downarrow$	NC $\uparrow$	gIoU $\uparrow$	s $\downarrow$	MB $\downarrow$
DeepSDF	0.088	0.941	96.4	0.14	7.4	0.250	0.933	90.2	0.12	7.4
Curriculum DeepSDF	0.102	0.941	96.3	0.14	7.4	0.214	0.903	93.3	0.12	7.4
ROAD / LoD6	0.138	0.959	96.4	0.03	3.2	0.175	0.928	86.3	0.01	3.8
ROAD / LoD7	0.045	0.969	98.4	0.03	3.2	0.067	0.936	94.2	0.01	3.8
ROAD / LoD8	0.022	0.971	98.7	0.04	3.2	0.041	0.935	94.9	0.02	3.8
ROAD / LoD9	0.017	0.970	98.7	0.08	3.2	0.036	0.931	94.9	0.06	3.8
ReFiNe / LoD4	0.023	0.980	98.8	0.07	3.1	0.041	0.945	96.6	0.04	3.7
ReFiNe / LoD5	0.022	0.981	99.1	0.07	3.1	0.036	0.944	96.5	0.05	3.8
ReFiNe / LoD6	0.019	0.981	99.4	0.07	3.2	0.027	0.954	97.4	0.05	3.8

To efficiently train ReFiNe on NeRFs, we first overfit single-scene NeRFs (Müller et al., 2022) on separate scenes.Each neural field can be constructed from a collection of RGB images $\{I_{i}\}_{i=0}^{N-1}$ , where camera intrinsic parameters $\textbf{K}_{i}\in\mathbb{R}^{3\times 3}$ as well as extrinsics $\mathbb{R}^{4\times 4}$ are assumed to be known.If ground truth depth maps are provided (RTMV), then the octree structure for each scene is computed and subsequently used to supervise our recursive autodecoder $\phi$ . If depth maps are not available (SRN Cars), we instead use adaptive pruning as implemented in (Müller et al., 2022). Then, we also densely sample points augmented with viewpoints inside the octree to store groundtruth density and color values for later supervision of geometry network $\psi$ and color network $\xi$ .

4.3. Reconstruction Benchmarks

4.3.1. Thingi32 / ShapeNet150 (SDF)

In the first benchmark we evaluate our method’s ability to represent and reconstruct object surfaces in the form of a SDF. We follow the experimental setup of (Takikawa et al., 2021; Zakharov et al., 2022) and train two networks: one on a subset of 32 objects from Thingi10K (Zhou and Jacobson, 2016) denoted Thingi32, and another on a subset of 150 objects from ShapeNet (Chang et al., 2015) denoted ShapeNet150. We use a latent dimension of 64 for Thingi32 and a latent dimension of 80 for ShapeNet150. We compute the commonly used Chamfer (CD), gIoU, and normal consistency (NC) metrics to evaluate surface reconstruction and we also record a memory footprint and inference time for each baseline. To extract a pointcloud from ReFiNe, we utilize the zero isosurface projection discussed in Section 3. Following ROAD’s (Zakharov et al., 2022) setup, gIoU is computed by recovering the object mesh using Poisson surface reconstruction (Kazhdan et al., 2006). We compare to DeepSDF (Park et al., 2019), Curriculum DeepSDF (Duan et al., 2020), using both methods’ open-sourced implementations for data generation and training with some minor hyper-parameter tuning to improve performance. Further details can be found in the supplemental material.

4.3.2. SRN Cars (NeRF)

In the next benchmark, we evaluate ReFiNe on another popular representation - Neural Radiance Fields (NeRFs). We use a feature dimension of 64 and compare our method against CodeNeRF (Jang and Agapito, 2021) and SRN (sit, 2019) on a subset of the SRN dataset consisting of 32 cars. We use 45 images for training and 5 non-overlapping images for testing on the task of novel view synthesis. As seen in Table 2, our representation outperforms both SRN and CodeNeRF baselines. Fig. 6 shows that ReFiNe does better when it comes to reconstructing high-frequency details.To compare inference time for NeRF-based baselines, we compute the average rendering time over the test images of the SRN benchmark. Our method demonstrates runtimes similar to those of SRN, with both significantly faster than CodeNeRF.

4.4. Scaling to Larger Datasets

Next, we demonstrate our model’s ability to scale to larger multi-modal datasets. For the experiments in this section, we use a latent size of 256.

4.4.1. GSO (SDF+RGB)

In the first experiment, we train ReFiNe to output a colored SDF field on the large Google Scanned Objects (GSO) dataset (Downs et al., 2022) containing 1030 diverse colored household objects targeting robotics applications. Despite the high complexity both in terms of geometry as well as color, our method achieves 0.044 Chamfer and 25.36 3D PSNR using a single network of size 45.6 MB together with a list of 256 dimensional latent vectors of 1.05 MB. Our method achieves a compression rate above $99.8\%$ compared to storing the original meshes (1.5 GB) and corresponding textures (24.2 GB). Qualitative results are shown in Fig. 4 and demonstrate the reconstruction quality of our approach.

4.4.2. RTMV (NeRF)

In this experiment we want to demonstrate that our method is not limited to reconstructing objects and is able to cover diverse scenes of a much higher complexity. We evaluate ReFiNe on the RTMV view synthesis benchmark (Tremblay et al., 2022) which consists of 40 scenes from 4 different environments (10 scenes each). Each scene comprises 150 unique views, with 100 views used for training, 5 views for validation, and 45 for testing.

As results in Fig. 5 show, ReFiNe is able to faithfully reconstruct the encoded scenes while storing all of them within a single network with low storage requirements and without specifically optimizing for compression. We attribute this to the recursive nature of our method splitting scene space into primitives at each recursive step. As we show in Table 3, our most lightweight network is only 8.36 MB, resulting in an average storage requirement of 210 KB per scene while still achieving an acceptable reconstruction quality of 24.2 PSNR. Similar to the SRN benchmark, we also compute the average rendering time over the test images, observing a gradual increase in runtime with larger latent sizes. Additionally we perform an ablation testing the effect of changing the latent size on the final reconstruction. We report results in Table 3 and Fig. 7 and note that performance gradually degrades when lowering the latent size, while at the same time decreasing storage requirements.

5. Limitations and Future Work

Our representation is currently limited to bounded scenes. This limitation can potentially be resolved by introducing an inverted sphere scene model for backgrounds from (Zhang et al., 2020). We would also like to leverage diffusion-based generative models to explore the task of 3D synthesis conditioned on various modalities such as text, images, and depth maps.

Acknowledgements.

We would like to thank Prof. Greg Shakhnarovich for his valuable feedback and help with reviewing the draft for this paper.

References

(1)
sit (2019)2019.Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations, In Sitzmann, Vincent and Zollhoefer, Michael and Wetzstein, Gordon.NeurIPS.
Adamkiewicz et al. (2022)Michal Adamkiewicz, Timothy Chen, Adam Caccavale, Rachel Gardner, Preston Culbertson, Jeannette Bohg, and Mac Schwager. 2022.Vision-only robot navigation in a neural radiance world.RA-L (2022).
Barron et al. (2021)Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. 2021.Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV.
Bi et al. (2020)Sai Bi, Zexiang Xu, Pratul Srinivasan, Ben Mildenhall, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David Kriegman, and Ravi Ramamoorthi. 2020.Neural reflectance fields for appearance acquisition.arXiv (2020).
Breyer et al. (2021)Michel Breyer, Jen Jen Chung, Lionel Ott, Roland Siegwart, and Juan Nieto. 2021.Volumetric grasping network: Real-time 6 dof grasp detection in clutter. In CoRL.
Cai et al. (2020)Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. 2020.Learning gradient fields for shape generation. In ECCV.
Chang et al. (2015)Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. 2015.ShapeNet: An information-rich 3D model repository.arXiv (2015).
Chen and Zhang (2019)Zhiqin Chen and Hao Zhang. 2019.Learning implicit fields for generative shape modeling. In CVPR.
Davies et al. (2020)Thomas Davies, Derek Nowrouzezahrai, and Alec Jacobson. 2020.Overfit neural networks as a compact shape representation.arXiv (2020).
Deng et al. (2020)Boyang Deng, JP Lewis, Timothy Jeruzalski, Gerard Pons-Moll, Geoffrey Hinton, Mohammad Norouzi, and Andrea Tagliasacchi. 2020.Neural Articulated Shape Approximation. In ECCV.
Deng et al. (2022)Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. 2022.Depth-supervised nerf: Fewer views and faster training for free. In CVPR.
Deng et al. (2021)Yu Deng, Jiaolong Yang, and Xin Tong. 2021.Deformed implicit field: Modeling 3d shapes with learned dense correspondence. In CVPR.
Downs et al. (2022)Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. 2022.Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items.arXiv (2022).
Duan et al. (2020)Yueqi Duan, Haidong Zhu, He Wang, Li Yi, Ram Nevatia, and Leonidas J Guibas. 2020.Curriculum deepsdf. In ECCV.
Fuji Tsang et al. (2022)Clement Fuji Tsang, Maria Shugrina, Jean Francois Lafleche, Towaki Takikawa, Jiehan Wang, Charles Loop, Wenzheng Chen, Krishna Murthy Jatavallabhula, Edward Smith, Artem Rozantsev, Or Perel, Tianchang Shen, Jun Gao, Sanja Fidler, Gavriel State, Jason Gorski, Tommy Xiang, Jianing Li, Michael Li, and Rev Lebaredian. 2022.Kaolin: A Pytorch Library for Accelerating 3D Deep Learning Research.https://github.com/NVIDIAGameWorks/kaolin.
Guizilini et al. (2022)Vitor Guizilini, Igor Vasiljevic, Jiading Fang, Rares Ambrus, Greg Shakhnarovich, Matthew R Walter, and Adrien Gaidon. 2022.Depth field networks for generalizable multi-view scene representation. In ECCV.
Hodan et al. (2018)Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, et al. 2018.Bop: Benchmark for 6d object pose estimation. In ECCV.
Ichnowski et al. (2022)Jeffrey Ichnowski, Yahav Avigal, Justin Kerr, and Ken Goldberg. 2022.Dex-NeRF: Using a Neural Radiance Field to Grasp Transparent Objects. In CoRL.
Irshad et al. (2022)Muhammad Zubair Irshad, Sergey Zakharov, Rares Ambrus, Thomas Kollar, Zsolt Kira, and Adrien Gaidon. 2022.ShAPO: Implicit Representations for Multi-Object Shape Appearance and Pose Optimization. In ECCV.
Jacquin (1990)Arnaud E Jacquin. 1990.Fractal image coding based on a theory of iterated contractive image transformations. In VCIP.
Jang and Agapito (2021)Wonbong Jang and Lourdes Agapito. 2021.Codenerf: Disentangled neural radiance fields for object categories. In ICCV.
Kaskman et al. (2019)Roman Kaskman, Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. 2019.Homebreweddb: Rgb-d dataset for 6d pose estimation of 3d objects. In ICCV Workshops.
Kato et al. (2020)Hiroharu Kato, Deniz Beker, Mihai Morariu, Takahiro Ando, Toru Matsuoka, Wadim Kehl, and Adrien Gaidon. 2020.Differentiable rendering: A survey.arXiv (2020).
Kazhdan et al. (2006)Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. 2006.Poisson surface reconstruction. In SGP.
Kim et al. (2024)Doyub Kim, Minjae Lee, and Ken Museth. 2024.Neuralvdb: High-resolution sparse volume representation using hierarchical neural networks.TOG (2024).
Kingma and Ba (2014)Diederik P Kingma and Jimmy Ba. 2014.Adam: A method for stochastic optimization.arXiv (2014).
Lindell et al. (2021)David B. Lindell, Julien N.P. Martel, and Gordon Wetzstein. 2021.AutoInt: Automatic Integration for Fast Neural Volume Rendering. In CVPR.
Liu et al. (2020a)Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. 2020a.Neural Sparse Voxel Fields. In NeurIPS.
Liu et al. (2020b)Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc Pollefeys, and Zhaopeng Cui. 2020b.Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In CVPR.
Locatello et al. (2020)Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. 2020.Object-centric learning with slot attention. In NeurIPS.
Lombardi et al. (2019)Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. 2019.Neural volumes: learning dynamic renderable volumes from images.TOG (2019).
Martin-Brualla et al. (2021)Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duckworth. 2021.NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. In CVPR.
Mescheder et al. (2019)Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. 2019.Occupancy networks: Learning 3d reconstruction in function space. In CVPR.
Mildenhall et al. (2020)Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2020.Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV.
Mitra et al. (2019)Niloy J. Mitra, Iasonas Kokkinos, Paul Guerrero, Nils Thuerey, Vladimir Kim, and Leonidas Guibas. 2019.CreativeAI: Deep Learning for Graphics. In SIGGRAPH 2019 Courses.
Morton (1966)Guy M Morton. 1966.A computer oriented geodetic data base and a new technique in file sequencing.(1966).
Mu et al. (2021)Jiteng Mu, Weichao Qiu, Adam Kortylewski, Alan Yuille, Nuno Vasconcelos, and Xiaolong Wang. 2021.A-sdf: Learning disentangled signed distance functions for articulated shape representation. In ICCV.
Müller et al. (2022)Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022.Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.TOG (2022).
Neff et al. (2021)Thomas Neff, Pascal Stadlbauer, Mathias Parger, Andreas Kurz, Joerg H Mueller, Chakravarty R Alla Chaitanya, Anton Kaplanyan, and Markus Steinberger. 2021.DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks. In Computer Graphics Forum.
Niemeyer and Geiger (2021)Michael Niemeyer and Andreas Geiger. 2021.Giraffe: Representing scenes as compositional generative neural feature fields. In CVPR.
Niemeyer et al. (2020)Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. 2020.Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In CVPR.
Ortiz et al. (2022)Joseph Ortiz, Alexander Clegg, Jing Dong, Edgar Sucar, David Novotny, Michael Zollhoefer, and Mustafa Mukadam. 2022.iSDF: Real-Time Neural Signed Distance Fields for Robot Perception. In RSS.
Ost et al. (2021)Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. 2021.Neural scene graphs for dynamic scenes. In CVPR.
Palafox et al. (2021)Pablo Palafox, Aljaž Božič, Justus Thies, Matthias Nießner, and Angela Dai. 2021.Npms: Neural parametric models for 3d deformable shapes. In ICCV.
Park et al. (2019)Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019.DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation. In CVPR.
Park et al. (2021)Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. 2021.Nerfies: Deformable Neural Radiance Fields. In ICCV.
Peng et al. (2020)Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. 2020.Convolutional occupancy networks. In ECCV.
Pumarola et al. (2021)Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. 2021.D-NeRF: Neural Radiance Fields for Dynamic Scenes. In CVPR.
Rashid et al. (2023)Adam Rashid, Satvik Sharma, Chung Min Kim, Justin Kerr, Lawrence Yunliang Chen, Angjoo Kanazawa, and Ken Goldberg. 2023.Language embedded radiance fields for zero-shot task-oriented grasping. In CoRL.
Rebain et al. (2021)Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, Kwang Moo Yi, and Andrea Tagliasacchi. 2021.Derf: Decomposed radiance fields. In CVPR.
Sajjadi et al. (2022a)Mehdi SM Sajjadi, Daniel Duckworth, Aravindh Mahendran, Sjoerd Van Steenkiste, Filip Pavetic, Mario Lucic, Leonidas J Guibas, Klaus Greff, and Thomas Kipf. 2022a.Object scene representation transformer. In NeurIPS.
Sajjadi et al. (2022b)Mehdi SM Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario Lučić, Daniel Duckworth, Alexey Dosovitskiy, et al. 2022b.Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. In CVPR.
Shechtman and Irani (2007)Eli Shechtman and Michal Irani. 2007.Matching local self-similarities across images and videos. In CVPR.
Sitzmann et al. (2020a)Vincent Sitzmann, Eric Chan, Richard Tucker, Noah Snavely, and Gordon Wetzstein. 2020a.Metasdf: Meta-learning signed distance functions. In NeurIPS.
Sitzmann et al. (2020b)Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. 2020b.Implicit neural representations with periodic activation functions. In NeurIPS.
Srinivasan et al. (2021)Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben Mildenhall, and Jonathan T. Barron. 2021.NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis. In CVPR.
Stelzner et al. (2021)Karl Stelzner, Kristian Kersting, and Adam R Kosiorek. 2021.Decomposing 3d scenes into objects via unsupervised volume segmentation.arXiv (2021).
Sucar et al. (2021)Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J Davison. 2021.iMAP: Implicit mapping and positioning in real-time. In ICCV.
Takikawa et al. (2022a)Towaki Takikawa, Alex Evans, Jonathan Tremblay, Thomas Müller, Morgan McGuire, Alec Jacobson, and Sanja Fidler. 2022a.Variable bitrate neural fields. In SIGGRAPH.
Takikawa et al. (2021)Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Morgan McGuire, and Sanja Fidler. 2021.Neural geometric level of detail: Real-time rendering with implicit 3D shapes. In CVPR.
Takikawa et al. (2022b)Towaki Takikawa, Or Perel, Clement Fuji Tsang, Charles Loop, Joey Litalien, Jonathan Tremblay, Sanja Fidler, and Maria Shugrina. 2022b.Kaolin Wisp: A PyTorch Library and Engine for Neural Fields Research.https://github.com/NVIDIAGameWorks/kaolin-wisp.
Tancik et al. (2022)Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. 2022.Block-nerf: Scalable large scene neural view synthesis. In CVPR.
Tancik et al. (2021)Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi Schmidt, Pratul P Srinivasan, Jonathan T Barron, and Ren Ng. 2021.Learned initializations for optimizing coordinate-based neural representations. In CVPR.
Tang et al. (2021)Jia-Heng Tang, Weikai Chen, Jie Yang, Bo Wang, Songrun Liu, Bo Yang, and Lin Gao. 2021.OctField: Hierarchical Implicit Functions for 3D Modeling. In NeurIPS.
Tewari et al. (2021)Ayush Tewari, Justus Thies, Ben Mildenhall, Pratul Srinivasan, Edgar Tretschk, Yifan Wang, Christoph Lassner, Vincent Sitzmann, Ricardo Martin-Brualla, Stephen Lombardi, et al. 2021.Advances in neural rendering.arXiv (2021).
Tremblay et al. (2022)Jonathan Tremblay, Moustafa Meshry, Alex Evans, Jan Kautz, Alexander Keller, Sameh Khamis, Charles Loop, Nathan Morrical, Koki Nagano, Towaki Takikawa, and Stan Birchfield. 2022.RTMV: A Ray-Traced Multi-View Synthetic Dataset for Novel View Synthesis.ECCV Workshops.
Wang et al. (2022)Yifan Wang, Lukas Rahmann, and Olga Sorkine-Hornung. 2022.Geometry-consistent neural shape representation with implicit displacement fields. In ICLR.
Wei et al. (2021)Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. 2021.Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In ICCV.
Williams et al. (2022)Francis Williams, Zan Gojcic, Sameh Khamis, Denis Zorin, Joan Bruna, Sanja Fidler, and Or Litany. 2022.Neural fields as learnable kernels for 3d reconstruction. In CVPR.
Xian et al. (2021)Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. 2021.Space-time neural irradiance fields for free-viewpoint video. In CVPR.
Xie et al. (2021)Yiheng Xie, Towaki Takikawa, Shunsuke Saito, Or Litany, Shiqin Yan, Numair Khan, Federico Tombari, James Tompkin, Vincent Sitzmann, and Srinath Sridhar. 2021.Neural Fields in Visual Computing and Beyond.arXiv (2021).
Yang et al. (2019)Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. 2019.Pointflow: 3d point cloud generation with continuous normalizing flows. In ICCV.
Yang et al. (2022)Guo-Wei Yang, Wen-Yang Zhou, Hao-Yang Peng, Dun Liang, Tai-Jiang Mu, and Shi-Min Hu. 2022.Recursive-nerf: An efficient and dynamically growing nerf.TVCG (2022).
Yi et al. (2023)Brent Yi, Weijia Zeng, Sam Buchanan, and Yi Ma. 2023.Canonical factors for hybrid neural fields. In ICCV.
Yu et al. (2021)Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. 2021.pixelnerf: Neural radiance fields from one or few images. In CVPR.
Yu et al. (2022)Hong-Xing Yu, Leonidas J. Guibas, and Jiajun Wu. 2022.Unsupervised Discovery of Object Radiance Fields. In ICLR.
Yuan et al. (2021)Wentao Yuan, Zhaoyang Lv, Tanner Schmidt, and Steven Lovegrove. 2021.STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in Motion with Neural Rendering. In CVPR.
Zakharov et al. (2022)Sergey Zakharov, Rares Ambrus, Katherine Liu, and Adrien Gaidon. 2022.ROAD: Learning an Implicit Recursive Octree Auto-Decoder to Efficiently Encode 3D Shapes. In CoRL.
Zakharov et al. (2021)Sergey Zakharov, Rares Andrei Ambrus, Vitor Campagnolo Guizilini, Dennis Park, Wadim Kehl, Fredo Durand, Joshua B Tenenbaum, Vincent Sitzmann, Jiajun Wu, and Adrien Gaidon. 2021.Single-Shot Scene Reconstruction. In CoRL.
Zakharov et al. (2020)Sergey Zakharov, Wadim Kehl, Arjun Bhargava, and Adrien Gaidon. 2020.Autolabeling 3D Objects with Differentiable Rendering of SDF Shape Priors. In CVPR.
Zeng et al. (2022)Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. 2022.LION: Latent Point Diffusion Models for 3D Shape Generation. In NeurIPS.
Zhang et al. (2020)Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. 2020.Nerf++: Analyzing and improving neural radiance fields.arXiv (2020).
Zhou et al. (2021)Linqi Zhou, Yilun Du, and Jiajun Wu. 2021.3d shape generation and completion through point-voxel diffusion. In ICCV.
Zhou and Jacobson (2016)Qingnan Zhou and Alec Jacobson. 2016.Thingi10k: A dataset of 10,000 3d-printing models.arXiv (2016).

Supplementary Material

Appendix A Training Details

A.1. ReFiNe Training Data

ReFiNe’s training data consists of a ground truth octree structure covering the mesh at a desired LoD and densely sampled coordinates together with respective GT values (SDF, RGB, density). We sample $10^{6}$ points within 2 bands - a smaller one (LoD-1) and a larger one (LoD+1) to ensure sufficient coverage for recovering high frequency details and store respective supervision values (e.g. SDF, RGB, density).

Following (Fuji Tsang et al., 2022), our octree is represented as a tensor of bytes, where each bit stands for the binary occupancy sorted in Morton order. The Morton order defines a space-filling curve, which provides a bijective mapping to 3D coordinates from 1D coordinates. As a result, this frees us from storing indirection pointers and allows efficient tree access. We additionally dilate our octree using a simple $3\times 3\times 3$ dilation kernel to secure a sufficient feature margin for trilinear interpolation.

All our networks are trained on a single NVIDIA A100 GPU.

A.2. Baseline Method Details

DeepSDF

We use the open-source implementation of DeepSDF (Park et al., 2019). To generate training data, we preprocess models from Thingi32 and ShapeNet150 via the provided code and parameters, which aims to generate approximately 500k training points. We improve results on the overfitting scenario by setting the dropout rate to zero and removing the latent regularization. We use a learning rate of $0.001$ for the decoder network parameters and $0.002$ for latents as well as a decay factor of $0.75$ every 500 steps, training the methods until convergence (about 20k epochs). For the experiments on Thingi32 we use a batch size of 32 objects and for ShapeNet150 we use a batch size of 64 objects. All other parameters we leave as provided by the example implementations (i.e., we used a code length of 256 and keep the neural network architecture unchanged).

Curriculum DeepSDF

We also use the open-source implementation of Curriculum-DeepSDF (Duan et al., 2020). We duplicate the parameter changes made to DeepSDF for consistency, and use the same training data input. We do not modify the curriculum proposed in (Duan et al., 2020) other than lengthening the last stage of training. We observe that the proposed curriculum provided quantitative reconstruction gains for ShapeNet150 and not Thingi32, suggesting that a different curriculum may improve results for the latter dataset. However, searching for the optimal curriculum is expensive and we choose to report results based on the baseline curriculum given in the open-source implementation.

SRN & CodeNeRF

We use the open-source implementations with default configurations for SRN (sit, 2019) and CodeNeRF (Jang and Agapito, 2021) and train both methods on our subset of the SRN dataset as described in Section 4.3 of the main paper. Both baselines use a default latent code size of 256, whereas CodeNeRF uses 2 latent codes of 256 to represent an object - one for geometry, another for appearance. In Table 2 and Fig. 6 of the main paper we demonstrate that our method outperforms both baselines, while using a more lightweight architecture and a latent code size of 64.

Fusion	Reconstruction		Runtime	Size
Fusion	CD $\downarrow$	3D PSNR $\uparrow$	s $\downarrow$	MB $\downarrow$
Sum	0.046	33.61	0.11	3.2
Concatenate	0.046	34.89	0.12	3.8

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (8)

Appendix B Evaluation Details

To calculate the Chamfer distance for DeepSDF and Curriculum DeepSDF we first extract surface points following the protocol of (Irshad et al., 2022). In particular, we define a coarse voxel grid of LoD 2 and estimate SDF values for eachof the points using a pretrained SDF network. The voxels whose SDF values are larger than their size are pruned and the remaining voxels are propagated to the next level via subdivision. When the desired LoD is reached, we use zero isosurface projection to extract surface points using predicted SDF values and estimated surface normals. Finally, we use the Chamfer distance implementation from (Takikawa et al., 2022b) to compare our prediction against a ground truth point cloud of $2^{17}$ points sampled from the original mesh. When reconstructing SDF + Color, we additionally use PSNR to evaluate RGB values regressed from the same $2^{17}$ points.To compute gIoU, we first reconstruct a mesh using Poisson surface reconstruction from (Kazhdan et al., 2006) and then compare against $2^{17}$ ground truth values randomly computed using the original mesh.

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (9)

Appendix C Additional Results

Multiscale Feature Interpolation

In Table 4 we perform an ablation studying how the multi-scale feature fusion scheme affects the final reconstruction quality. For this experiment, we use the latent size of 64, our recursive autodecoder network $\phi$ consists of a single 1024-dimensional layer, and all decoding networks $\omega$ , $\psi$ and $\xi$ use two-layers of 256 fully connected units each. We use HomebrewedDB (Kaskman et al., 2019) - a 6D pose estimation dataset from BOP benchmark (Hodan et al., 2018) comprising 33 colored meshes (17 toy, 8 household and 8 industry-relevant) of various complexity in terms of both geometry and color. Two methods of feature fusion to combine interpolated features from multiple LoDs are considered: Sum, where the latents are simply added together, and Concatenate, where the interpolated latents from each LoD are concatenated together. Both modalities are trained to encode the full dataset consisting of 33 objects. As was shown in Table 2 of the main paper, the Sum fusion scheme preserves the network size across different possible LoDs, because it doesn’t change the input size for the respective decoder networks and we have a single recursive network $\phi$ by design. On the other hand, the Concatenate scheme comes at a higher storage cost as the corresponding decoding networks must have larger input layers, but results in an improved 3D PSNR value as shown in Table 4. As can be seen in Fig. 9, while both schemes manage to faithfully represent object geometry the Concatenate scheme does better when it comes to preserving high-frequency color details.

Latent Space Interpolation and Clustering

We present a qualitative analysis of our latent space conducted on the ShapeNet150 and SRN Cars datasets. As our method outputs a continuous feature field, it can be used for interpolation in the latent space between objects of similar geometry. Figure 10 shows an example of such interpolation between two objects of different classes.In addition, we plot latent spaces of Thingi32, ShapeNet150, and Google Scanned Objects represented by respective networks using the principal component analysis (see Fig. 8). Projected latent spaces suggest that the structure of ReFiNe’s latent space clusters similar objects defined either by geometry (Thingi32, ShapeNet150) or geometry and color (Google Scanned Objects), pointing to potential classification utility.

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (10)

Single-Scene Baselines

Our main paper features baselines that are carefully selected to adhere to our key paradigm of representing the entire dataset with a single network, where each object or scene is represented by a single compact latent vector. However, it is also useful to evaluate how our results fare against single-scene methods that use a single network per object or scene. In this section, we compare our results with single-scene methods on SDF and NeRF benchmarks. All storage sizes include both network and latent vector sizes.Table 5 shows the results on the ShapeNet150 and Thingi32 SDF benchmark. We compare our method against two baselines: mip-Neural Implicits (Davies et al., 2020), and NGLOD (Takikawa et al., 2021). Our method outperforms single-scene baselines on ShapeNet150 both in terms of reconstruction quality and storage and demonstrates a comparable performance on Thingi32.Similarly, Table 6 shows the results on the RTMV benchmark. We compare our method against two baselines: mip-NeRF (Barron et al., 2021), and SVLF (Tremblay et al., 2022). As results show, ReFiNe is able to approach the performance of single-scene methods while storing all 40 scenes within a single network providing substantially lower storage requirements without specifically optimizing for compression. We attribute this to the recursive nature of our method splitting scene space into primitives at each recursive step.

Method	Type	ShapeNet150			Thingi32
Method	Type	CD $\downarrow$	gIoU $\uparrow$	MB $\downarrow$	CD $\downarrow$	gIoU $\uparrow$	MB $\downarrow$
Neural Implicits	Per-Scene	0.500	82.2	4.4	0.092	96.0	0.9
NGLOD	Per-Scene	0.062	91.7	185.4	0.027	99.4	39.6
ReFiNe/LoD6	Per-Dataset	0.019	99.4	3.9	0.027	97.4	3.2

Method	Type	View synthesis			Storage
Method	Type	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	MB $\downarrow$
mip-NeRF	Per-Scene	30.53	0.91	0.06	7.4 * 40
SVLF	Per-Scene	28.83	0.91	0.06	947 * 40
ReFiNe/LoD6	Per-Dataset	26.72	0.87	0.19	45.6

SIREN vs ReLU

Our recursive subdivision network $\phi$ and all decoding networks $\omega$ , $\psi$ and $\xi$ are parametrized with SIREN-based MLPs using periodic activation functions. In this ablation, we evaluate how replacing SIREN-based MLPs with standard vanilla ReLU-based MLPs affects the reconstruction metrics for scenes using different field representations. To accomplish this, we select a single object from each modality (T-Rex from Thingi32 for SDF, Dog from HB for SDF+RGB, car from SRNCars for NeRF) and overfit a single MLP to each of the modalities. All baselines use a latent size of 64, a single 1024-dimensional layer for recursive subdivision network $\phi$ , and 256-dimensional two-layer decoding networks $\omega$ , $\psi$ , and $\xi$ .Our results shown in Table 7 demonstrate that a naive ReLU-based MLP implementation performs worse overall and especially suffers when it comes to reconstructing high frequency details and colors.

Network Size vs Reconstruction Quality

Similar to our latent size experiments in Table 3 of the main paper, in this ablation we study how changing the hidden dimension of our recursive subdivision network $\phi$ affects reconstruction quality. We train four baselines with different sizes for the hidden dimension of the recursive subdivision network, $\phi$ : 128, 256, 512, and 1024. The remaining parameters are consistent across all four networks: a latent size of 64, and each of the decoding networks $\omega$ , $\psi$ , and $\xi$ utilizes two layers of 256 fully connected units. All the baselines are trained on the HB dataset (SDF+RGB). As shown in Fig. 11, we observe a graceful degradation of quality with decreasing network capacity.

ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation (11)

Qualitative Results

In Figs. 12 and 13, we present additional qualitative results comparing ReFiNe against baselines: DeepSDF, Curriculum DeepSDF, and ROAD. We also demonstrate reconstructions of the HomebrewedDB in Fig. 14 and additional RTMV qualitative results in Fig. 15.

Activation	T-Rex (SDF)	Dog (SDF+RGB)		Car (NeRF)
Activation	CD $\downarrow$	CD $\downarrow$	3D PSNR $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$
ReLU	0.026	0.021	34.08	28.170	0.951
SIREN	0.025	0.020	42.16	29.130	0.962

Appendix D Applications

In recent years neural fields have found its use in various domains including robotics and graphics. In recent years, neural fields have found utility in various domains, including robotics and graphics. In robotics, neural fields are actively employed to represent 3D geometry and appearance, with applications in object pose estimation and refinement (Zakharov et al., 2020; Irshad et al., 2022), grasping (Breyer et al., 2021; Ichnowski et al., 2022), and trajectory planning (Adamkiewicz et al., 2022). In graphics, they have been successfully utilized for object reconstruction from sparse and noisy data (Williams et al., 2022) and for representing high-quality 3D assets (Takikawa et al., 2022a).

ReFiNe employs a recursive hierarchical formulation that leverages object self-similarity, resulting in a highly compressed and efficient shape latent space. We demonstrate that our method achieves impressive results in SDF-based reconstruction (Table 1 of the main paper) and NeRF-based novel-view synthesis (Tables 2 and 3 of the main paper), and features well-clustered latent spaces allowing for smooth interpolation (Figs. 8 and 10). We believe that these properties will accelerate the applicability of neural fields in real-world tasks, particularly those involving compression.

Method	View Synthesis			Runtime	Size
Method	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	s $\downarrow$	MB $\downarrow$
ReFiNe / Lat 32	24.18	0.83	0.23	1.19	8.4
ReFiNe / Lat 64	25.29	0.85	0.21	1.57	13.7
ReFiNe / Lat 128	25.96	0.86	0.20	2.34	24.3
ReFiNe / Lat 256	26.72	0.87	0.19	3.89	45.6