We present a novel approach to mesh shape editing, building on recent progress in 3D reconstruction from multi-view images. We formulate shape editing as a conditional reconstruction problem, where the model must reconstruct the input shape with the exception of a specified 3D region, in which the geometry should be generated from the conditional signal. To this end, we train a conditional Large Reconstruction Model (LRM) for masked reconstruction, using multi-view consistent masks rendered from a randomly generated 3D occlusion, and using one clean viewpoint as the conditional signal. During inference, we manually define a 3D region to edit and provide an edited image from a canonical viewpoint to fill in that region. We demonstrate that, in just a single forward pass, our method not only preserves the input geometry in the unmasked region through reconstruction capabilities on par with SoTA, but is also expressive enough to perform a variety of mesh edits from a single image guidance that past works struggle with, while being 10x faster than the top-performing competing prior work.
Our model is a Large Reconstruction Model, taking posed images of an object as input and predicting triplanes which may be decoded into an SDF and RGB colors as output. In contrast to ordinary LRMs, we randomly generate rectangular 3D masks during training and render them from the same camera poses as the input images. Patches that contain pixels occluded by these masks are replaced with a learnable token. Through this masking procedure, our LRM learns how to "inpaint" a 3D masked region in the input shape.
@article{gao2024meshedit,
title={3D Mesh Editing using Masked LRMs},
author={William Gao and Dilin Wang and Yuchen Fan and Aljaž Božič and Tuur Stuyck and Zhengqin Li and Zhao Dong and Rakesh Ranjan and Nikolaos Sarafianos},
journal={arXiv preprint arXiv:2412.08641},
year={2024}
}