3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets

1 Department of Computer Science, University of Oxford
2 Adobe Research

*Work partially done during internship at Adobe Research

We present 3DMiner, a scalable framework designed to obtain associating poses and reconstruct shapes from diverse and realistic sets of images without any 3D data, pose annotation, camera information, or keypoints .

Abstract

We present 3DMiner - a pipeline for mining 3D shapes from challenging large-scale unannotated image datasets. Unlike other unsupervised 3D reconstruction methods, we assume that, within a large-enough dataset, there must exist images of objects with similar shapes but varying backgrounds, textures, and viewpoints. Our approach leverages the recent advances in learning self-supervised image representations to cluster images with geometrically similar shapes and find common image correspondences between them. We then exploit these correspondences to obtain rough camera estimates as initialization for bundle-adjustment. Finally, for every image cluster, we apply a progressive bundle-adjusting reconstruction method to learn a neural occupancy field representing the underlying shape. We show that this procedure is robust to several types of errors introduced in previous steps (e.g., wrong camera poses, images containing dissimilar shapes, etc.), allowing us to obtain shape and pose annotations for images in-the-wild. When using images from Pix3D chairs, our method is capable of producing significantly better results than state-of-the-art unsupervised 3D reconstruction techniques, both quantitatively and qualitatively. Furthermore, we show how 3DMiner can be applied to in-the-wild data by reconstructing shapes present in images from the LAION-5B dataset.

3DMiner Pipeline

3DMiner Overview

Our method starts by grouping images that depict similar 3D shapes, regardless of the texture of the shape, the camera view-point or the background. To do so, we perturb each image with various transformations (e.g. color jittering, perspective and rotation) and we pool their DINO-ViT features to create a robust image embedding. We cluster images by running agglomerative clustering on the embeddings. Within each cluster, we find key point correspondences using dense DINO-ViT features. We feed those corresponding keypoints to a Structure from Motion algorithm (rigid factorization) to get coarse orthographic camera estimations. Finally, we jointly refine the camera parameters and learn an occupancy field to get the final shape.

Reconstruction Results

3DMiner Overview

Qualiitative Results on Pix3D and LAION-5B. To the best of our knowledge we are the first to perform reconstruction on the challenging LAION-5B image dataset.



We visualize the clusters of Pix3D chairs with DINO-ViT and our augmentations.





3DMiner Overview

Quantitative Comparisons against SMR and Unicorn on Pix3D Chairs.



3DMiner Overview

We plot the reprojection error per cluster (averaged over each image in the cluster) in ascending order and show representative reconstructions for four data points. We empirically observe that the reprojection error is a good indicator of the quality of the reconstruction.



BibTeX

BibTex Code Here