Paper deep dive
Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors
Jiatong Xia, Zicheng Duan, Anton van den Hengel, Lingqiao Liu
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/22/2026, 6:09:03 AM
Summary
Points-to-3D is a diffusion-based framework that enhances 3D asset and scene generation by incorporating explicit point cloud priors into the latent space of the TRELLIS model. By treating visible-region point clouds as hard structural constraints and employing a structure inpainting network with a two-stage sampling strategy, the method achieves superior geometric fidelity and structural controllability compared to existing state-of-the-art baselines.
Entities (5)
Relation Signals (3)
Points-to-3D → builton → TRELLIS
confidence 100% · Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization
Points-to-3D → evaluatedon → Toys4K
confidence 95% · We evaluate Points-to-3D on object-level (Toys4K [54]) and scene-level (3D-FRONT [14]) benchmarks.
Points-to-3D → usesinput → VGGT
confidence 90% · Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input.
Cypher Suggestions (2)
Identify the base model for the framework · confidence 95% · unvalidated
MATCH (f:Framework {name: 'Points-to-3D'})-[:BUILT_ON]->(m:Model) RETURN m.nameFind all datasets used for evaluation of the framework · confidence 90% · unvalidated
MATCH (f:Framework {name: 'Points-to-3D'})-[:EVALUATED_ON]->(d:Dataset) RETURN d.nameAbstract
Abstract:Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input this http URL practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.
Tags
Links
- Source: https://arxiv.org/abs/2603.18782v1
- Canonical: https://arxiv.org/abs/2603.18782v1
Full Text
58,579 characters extracted from source content.
Expand or collapse full text
Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors Jiatong Xia, Zicheng Duan111Jiatong Xia and Zicheng Duan equally contributed to this work., Anton van den Hengel, Lingqiao Liu Australian Institute for Machine Learning, University of Adelaide, Australia Jiatong Xia and Zicheng Duan equally contributed to this work.Corresponding author, e-mail: .@.. lingqiao.liu@adelaide.edu.au Abstract Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain—from active sensors such as LiDAR or from feed-forward predictors like VGGT—offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation. Project page: https://jiatongxia.github.io/points2-3D/ Figure 1: We introduce explicit 3D point cloud priors into 3D generation framework, given a pre-existing point cloud or a feed-forward point cloud prediction from image input, our model generates high-quality 3D assets that faithfully preserve the observed structure while plausibly completing unobserved regions with coherent geometry. 22footnotetext: Corresponding author, e-mail: lingqiao.liu@adelaide.edu.au 1 Introduction Advances in 3D generation now allow models to synthesize realistic and diverse 3D assets from single-view images or text prompts. These “foundation” 3D models [30, 78, 70, 20, 67] can produce 3D assets across broad categories, supporting applications in content creation and virtual environments. However, conditioning solely on 2D images or text provides limited geometric controllability: while the output may appear plausible, the model lacks any mechanism to respect real 3D measurements. In practice, partial point clouds from sensors or image-based predictors provide reliable geometry for visible regions, yet current 3D generative pipelines make little use of this readily available structural information. In this work, we address this gap by enabling geometry-controllable 3D generation driven by point cloud priors. We focus on the setting where a visible-region point cloud—captured or predicted—is treated as a hard structural constraint, requiring the generated asset to align with observed geometry while plausibly completing unobserved parts. Achieving this cannot be done by simply injecting the point cloud as an additional condition; it requires to integrates the structural prior into the latent space itself. Recent latent 3D diffusion models [30, 21, 57], represented by TRELLIS [67], factorize 3D generation into two stages: a coarse structural stage operating on a sparse occupancy representation, followed by a semantic and appearance refinement stage. This paradigm offers an explicit structure latent that could be guided by 3D priors. Yet in their default formulation, these structure latents are initialized purely from Gaussian noise and rely only on text or image embeddings, making them unable to anchor structural generation to actual 3D observations. To overcome this limitation, we introduce Points-to-3D, a point-cloud–guided 3D generation framework that re-defines how the structural latent is initialized and completed. Instead of starting from pure noise, we voxelize the visible point cloud and encode it with the TRELLIS sparse structure VAE to obtain a partially observed latent that directly reflects the measured geometry. Regions supported by observations are preserved as fixed constraints, while unobserved regions remain free to be synthesized. A mask-aware inpainting network completes this mixed latent, enabling the model to generate coherent structures that respect real 3D measurements while plausibly filling missing areas. To support this formulation, we construct a visibility-aware training pipeline that first produces realistic partial–complete structure pairs from ground-truth assets, these pairs supervise the inpainting model to generate geometric cues from visible regions to invisible ones while maintaining consistency with the input point cloud. During inference, Points-to-3D adopts a lightweight two-stage procedure: it first establishes a globally consistent structure under visibility constraints, and then performs a brief refinement step to enhance boundary quality without disturbing anchored geometry. This design enables controllable and structurally faithful 3D generation from both sensor-captured and image-predicted point clouds. We evaluate Points-to-3D on object-level (Toys4K [54]) and scene-level (3D-FRONT [14]) benchmarks. Across all settings, our method consistently outperforms TRELLIS [67] and other baselines in rendered-view quality and geometric fidelity. Gains are especially significant in regions covered by point-cloud priors, where Points-to-3D achieves near-perfect alignment while maintaining realistic completions in unseen areas. Furthermore, combining our point-cloud–anchored structure generation with text conditioning enables controllable text-to-3D generation guided by concrete 3D measurements. 2 Related Work 2.1 3D Modeling Recovering the 3D model of specific scenes or objects is a fundamental problem in computer vision and graphics. Classical 3D reconstruction leverages multi-images to recover geometry, including Structure-from-Motion (SfM) [47], Multi-View Stereo (MVS) [71, 72, 75], SDF-based approaches [42, 52, 76], e.t.c. Radiance-field models like Neural Radiance Fields (NeRF) [39, 1, 40, 28, 4, 2, 66, 26] and 3D Gaussian Splatting (3DGS) [25, 19, 32, 22, 74] further enable high-fidelity reconstruction and novel-view synthesis after scene-specific optimization, and recent feed-forward variants [5, 8, 9] reducing the per-scene cost. DUSt3R-related feed-forward reconstruction methods [62, 31, 77, 68, 65, 58], exemplified by VGGT [61], predict per-pixel point cloud and implicitly handle camera poses, achieving strong performance even with a single image. However, recovering 3D assets from only one image is still beyond the capabilities of reconstruction methods. 3D generative models [67, 30, 21, 57, 34, 59, 29] effectively address this scenario: conditioned on a single reference image or even text prompt, they can synthesize plausible 3D assets aligned with the reference content. 2.2 3D Generative Models Prior 3D generation relies on GANs [3, 48, 17] produce convincing results, yet their instability restricts scalability and output diversity. models [18, 53, 35], starting with 2D generation [46, 43, 60], diffusion-based methods rapidly expanding across a spectrum of 3D representations [30, 21, 57, 36, 49, 51, 56, 78, 70, 20]. Recently, TRELLIS [67] introduces a novel latent representation that enables decoding into versatile 3D output formats, demonstrating strong quality, versatility, and editability, and offering a superior paradigm and framework for 3D generation. Subsequent works [33, 63, 69, 38] have leveraged TRELLIS to implement a wide range of practical applications. Nevertheless, most existing improvements focus on enhancing performance at the reference-conditioning level, while directly embedding 3D priors into latent initialization to enable more reliable generation still remains largely unexplored in 3D generation. 2.3 Point Cloud Priors Incorporating 3D priors to assist downstream tasks has proven to be a effective strategy. Where point cloud stand out as one of the most practical and informative representations. Leveraging point cloud priors has advanced a wide range of 3D perception tasks [50, 80, 64] as well as reconstruction tasks [13, 45]. In particular, visible-region point clouds are easy to obtain from diverse sources, including active sensors such as LiDAR and structured-light depth cameras—now widely accessible even on mobile devices—or from reconstruction approaches such as VGGT [61]. Integrating these easily obtainable point cloud into 3D generative models offers promising potential for explicit geometry control and accurate modeling of complex multi-object scenes. This work seeks to establish a simple yet effective paradigm for incorporating point cloud priors into a diffusion-based 3D generation framework. 2.4 Inpainting Inpainting is a common paradigm for completing missing content while preserving observed structures. In 2D vision, diffusion-based inpainting methods [37, 23, 6, 73] use spatial masks to guide the synthesis of occluded regions, achieving coherent and controllable image completion. Similar ideas appear in 3D completion [11, 10, 24, 16] where partial scans are used to infer full geometry, but such approaches usually operate as separate completion modules rather than within a generative framework. TRELLIS, however, has the potential to perform inpainting directly within its sparse structured latent spaces, enabling improved global coherence without the need for external modules. Building on this perspective, we formulate point-cloud–conditioned 3D generation as a latent inpainting problem, allowing observed geometry to be embedded and completed naturally within the generative process without relying on auxiliary completion components. 3 Method Figure 2: Overall framework. Given point cloud priors—either pre-existing or predicted by VGGT from input image—we first voxelize and VAE-encode it to obtain an S latent, where the empty regions are filled with random noise and concatenated with an extracted mask to form the input paradigm for our model. During training, the input training data is fed into our inpainting flow transformer inpG_inp, which is optimized via a conditional flow matching loss. During inference, the input test data is processed by the trained inpG_inp through a two-stage sampling procedure: (1) a structural inpainting stage with s sampling steps to inpaint the global structure. And (2) a boundary refinement stage with remaining (t−s)(t-s) steps to refine the inpainting boundaries, yielding the final output S latent. We seek to achieve geometry-controllable 3D generation by conditioning on point clouds, whether captured by real-world sensors or inferred from a single image via feed-forward prediction. This section first outlines TRELLIS, the baseline that underpins our work, then formalizes the problem and introduces our method, detailing the model architecture, training-data construction, and sampling strategy. 3.1 Preliminaries: TRELLIS TRELLIS [67] is a recently proposed 3D generation model that produces diverse and high-fidelity 3D assets from image or text prompts. Unlike conventional diffusion models that operate directly in voxel or implicit-function space, TRELLIS performs diffusion in a compact latent space specifically designed to encode 3D structure and appearance. This latent space is learned via a pair of variational autoencoders (VAEs) trained to compress and reconstruct 3D assets. The first VAE encodes voxelized 3D features derived from the original asset into a latent representation called the structured latent (SLAT), denoted as =(i,i)i=1Lz=\(z_i,p_i)\_i=1^L, where i∈ℝcslatz_i ^c_slat is a local feature attached to voxel position i∈[0,N−1]3p_i∈[0,N-1]^3 with N=64N=64. The SLAT z can be decoded into multiple 3D output formats—Gaussian splats, radiance fields, or meshes—through corresponding decoders, enabling flexible rendering backends. The second VAE (ℰs,s)(E_s,D_s) learns a compact representation of geometry by encoding a binary voxel occupancy grid ∈0,1N×N×NM∈\0,1\^N× N× N—whose occupied positions correspond to ii=1L\p_i\_i=1^L—into a sparse structure (S) latent ∈ℝr×r×r×csq ^r× r× r× c_s (with r=16r=16), which can be decoded back into M. The generation process in TRELLIS proceeds in two stages following a coarse-to-fine paradigm. In the Structure Generation stage, a Flow Transformer sG_s takes Gaussian noise ϵs∼(0,) ε_s (0,I) and a condition embedding c (from image or text) to sample the S latent q, which is then decoded by sD_s into a binary voxel grid M, defining the asset’s geometric scaffold. In the subsequent Structured Latent Generation stage, a Sparse Flow Transformer lG_l takes noise ϵslat∼(0,) ε_slat (0,I), the voxel positions ii=1L\p_i\_i=1^L, and the same condition c to generate the SLAT z, which is decoded into the final 3D asset with texture and semantics. Overall, TRELLIS establishes a two-level generative hierarchy that first synthesizes a sparse geometric structure and then enriches it with detailed appearance. While the VAEs in TRELLIS possess the intrinsic ability to encode meaningful 3D geometry, the generative process itself is not conditioned on external 3D information. Our approach, Points-to-3D, leverages this encoded structural capability by directly injecting point-cloud priors into the VAE latent space, thereby grounding the diffusion process to explicit 3D observations. 3.2 Problem Formulation In many real-world settings, we aim to generate 3D assets conditioned on point cloud priors—obtained either via active sensing (e.g., LiDAR) or model prediction (e.g., VGGT). These point cloud typically cover only the visible portion of the scene. In such a case, the goal is to use the visible-region point cloud P as a prior for geometry-controllable 3D asset generation: preserving the observed foreground structure while completing unobserved regions guided by foreground cues. To this end, we cast the task as inpainting conditioned on P, inferring missing geometry from the surrounding latent context. Specifically, unlike TRELLIS—which initializes generation from pure noise ϵs ε_s—our structural inpainting stage begins by voxelizing the visible point-cloud priors P into a binary 3D occupancy grid ′∈0,1N×N×NM ∈\0,1\^N× N× N. This voxelized structure is then encoded with the VAE encoder ℰsE_s to obtain the initial S latent vis∈ℝr×r×r×csq_ vis ^r× r× r× c_s (with r=16r=16), which serves as the generation starting point. Formally: vis _ vis =ℰs(′). =E_s(M ). (1) To indicate which S latent regions should be preserved, we derive an occupancy mask s∈ℝr×r×r×cmm_ s ^r× r× r× c_m by down-sampling ′M to the latent resolution. Then, we preserve the visible region S latent with sm_ s and fill the remaining with noise to obtain the inpainting input S latent combq_comb: comb=s⊙vis+(1−s)⊙ϵs.q_comb=m_s _ vis+(1-m_s) ε_s. (2) Ultimately, we aim to build an inpainting model inpG_inp based on the structure generation model sG_s to take combq_comb as input, and inpaint the final S latent q, facilitating visible regions geometry-controllable generation. 3.3 Point Clouds Priors Driven Generative Model Model design. As shown in the purple box of Fig. 2, to enforce the inpainting model, inpG_inp, on distinguishing the regions to preserve and generate, we further concatenate the mask sm_s to the combq_comb along the channel dimension. This turn combq_comb to inpx_inp: inp=Concat[comb,s],inp∈ℝr×r×r×(cs+cm)x_inp=Concat[q_comb,\,m_s],x_inp ^r× r× r×(c_s+c_m) (3) To adapt inpG_inp with more input channels, we simply replace its input layer inherited from sG_s by a newly registered projection layer with channel dimension (cs+cm)(c_s+c_m), and maintain all other network structures unchanged. Then, we fully fine-tune inpG_inp to learn to inpaint a completed sparse structure latent pred∈ℝr×r×r×csq_pred ^r× r× r× c_s using Conditional Flow Matching loss (CFM) and regard the ground-truth sparse latent gtq_gt as supervision. This can be formulated as: ℒCFM=t,gt,ϵ‖inp(inp,t)−(ϵ−gt)‖22L_CFM=E_t,q_gt,ε \|G_inp(x_ inp,\,t)- (ε-q_ gt ) \|_2^2 (4) Note that the condition c and time-dependent noise scheduling for xinpx_inp are omitted for simplicity. Figure 3: Training data processing. We preserve the visible portion of the complete point cloud and convert it into training inputs. Training data from visible point clouds. We construct diverse pairs of training data from visible point clouds together with their corresponding ground-truth sparse structure latent to train our model as illustrated in Fig. 3. The main challenge lies in accurately obtaining the visible-region point clouds corresponding to the input condition images of each 3D asset. To achieve this, we render the depth map tD_t with height and width as H and W from T viewpoints with the condition images tI_t for each ground-truth 3D asset. For each object, we first uniformly sample S surface points ^=^i=(ui,vi,wi)i=1S P=\ p_i=(u_i,v_i,w_i)\^S_i=1, and given the world-to-camera transformation t=[t∣t]T_t= [R_t _t ] for the t-th view, each point is transformed to the camera as: ^it=t(^i−t)=(uit,vit,wit)⊤ p^t_i=R_t ( p_i-t_t )= (u^t_i,v^t_i,w^t_i ) (5) The corresponding image-plane projection t∈[1,H]×[1,W]u_t∈[1,H]×[1,W] is computed using the intrinsic matrix K. We apply an observation mask O^t to indicate which point is considered visible in view t if its projected depth witw^t_i is consistent with the rendered depth within a tolerance threshold τ: it=1, if |t(i)−wit|<τ,0, otherwise. O^t_i= cases1,& if |D_t (u_i )-w^t_i |<τ,\\ 0,& otherwise. cases (6) The visible point cloud tP_t for view t is thus obtained as t=^it∣it=1P_t= \ p^t_i ^t_i=1 \. Each tP_t is then voxelized into the sparse structure voxel, which is then encoded and calculated to the S latent combtq_comb^t. Simultaneously, the downsampled occupancy mask stm_s^t is obtained to indicate the visible region of the obtained S latent. The ground-truth S latent gtq_gt is derived from the complete 3D structure of the object. Consequently, the samples (combt,st,t,gt) (q_comb^t,m_s^t,I_t,q_gt ) are used to supervise the model sG_s (green box in Fig 2) to learn structure completion from visible-region priors. 3.4 Staged Sampling from Point Cloud Priors During inference, we split the t step generation into two separate sampling stages, namely the structural inpainting stage and the boundary refinement stage. As illustrated in Fig. 2 orange box, the first stage produces a coarse but globally consistent skeleton structure guided by the visible point clouds using inpainting, while the second stage refines the boundary regions that connect newly generated content to the predefined visible areas. Specifically, in each sampling step of the structural inpainting stage, the trained model outputs predq_pred and reconstructs the inpainting input inpx_inp for the next iteration by concatenating predq_pred with the visibility mask sm_s following Eq. 3. We repeat this process for s steps to obtain a draft skeleton 3D structure that is mostly coherent with the visible point cloud. However, slight inconsistencies and missing details may appear around the boundary regions between generated and predefined visible areas, mainly due to information loss introduced during down-sampling. To address this, we define the latter (t−s)(t-s) steps as the boundary refinement stage. Here, we replace the visibility mask sm_s with an all-ones mask 1m_1, effectively converting inpainting into standard denoising. This allows the model to refine details on either side of the masked or unmasked regions without drastically modifying the existing global geometry, resulting in a fully completed and high-quality sparse structure. 4 Results Table 1: Comparison on single-object generation on Toy4K dataset. We showcase the performance of our method in two scenarios: one where explicit point cloud priors are provided, and another where point cloud are inferred from condition images using VGGT [61]. Rendering Geometry Method PSNR↑PSNR SSIM(%)↑SSIM(\%) LPIPS↓LPIPS DINO(%)↓DINO(\%) CD↓CD F-Score↑F-Score PSNR-N↑PSNR-N LPIPS-N↓LPIPS-N GaussianAnything [30] 20.08 89.31 0.183 26.74 0.084 0.513 20.99 0.199 Real3D [21] 19.55 90.65 0.169 27.65 0.065 0.574 21.31 0.178 LGM [57] 20.55 89.98 0.181 23.45 0.075 0.487 20.04 0.202 VoxHammer [33] (3D Inversion) 20.51 90.01 0.123 15.10 0.046 0.724 20.28 0.158 TRELLIS [67] 21.94 91.46 0.105 7.82 0.034 0.832 23.81 0.105 SAM3D [7] 22.42 91.45 0.111 8.01 0.033 0.835 23.85 0.101 Points-to-3D (Ours-VGGT Esti.) 22.55 92.09 0.088 7.37 0.024 0.881 24.53 0.085 Points-to-3D (Ours-P.C.Priors) 22.91 92.83 0.070 7.29 0.013 0.964 27.10 0.053 Figure 4: Single-object generation on Toys4K. For the explicit point cloud priors results, we use point cloud extracted strictly from the visible region of input images, whereas the “VGGT-estimated” results use point clouds inferred from the condition images by VGGT. 4.1 Experiments Setup Datasets. We train our model on a combination of three datasets: 3D-FUTURE [15] dataset, HSSD [27] dataset and ABO [12] dataset. During training, we render T=24T=24 input views for each object and sample S=50,000S=50,000 point clouds from each object mesh. For each views, we compute the corresponding visible point clouds, which is then process to an initial S latent used for training. We evaluate our model on two types of test datasets: the single-object dataset Toys4K [54] and the scene dataset 3D-FRONT [14]. And our method is tested under two settings to cover a broader range of application scenarios. In the first setting, we use the sampled visible point cloud from each view as available input. And in the second setting, where no preprocessed point cloud priors are provided, the test view condition image is fed into VGGT to obtain an estimated point cloud for our model as the initial point cloud priors. We also evaluate our method on several real-world images from the Pix3D [55] dataset. Evaluation metrics. We evaluate the final generation results in two aspects. For the rendered images of the generated 3D assets, we assess the image quality by comparing with those rendered from the ground-truth 3D asset, and using PSNR, SSIM, LPIPS [79], and DINO [41] feature similarity as evaluation metrics. For the geometric quality, we employ Chamfer Distance and F-score, as well as the rendered normal maps PSNR and LPIPS as evaluation metrics. For text-to-3D generation evaluation, we use the CLIP [44] score to measure the consistency between the generated results and the input text prompts. Implementation. We train our model for 20k iterations with a batch size of 8 on 4 Nvidia A100 GPUs, and following the other TRELLIS [67]’s sparse structure flow transformer’s training setting. During inference, we set the t=50t=50 sampling steps for the trained Sparse Structure Flow Transformer, allocating s=25s=25 steps for structural inpainting and the remaining steps for refinement. For the other comparison methods, we use their official code and settings to reproduce their results. We reproduce the results of the 3D editing method VoxHammer [33] to represent the 3D inversion’s results. Specifically, we use the structural voxels obtained from the same initial point clouds as in our method to define the “Unedited Region” in VoxHammer, and then apply their pipeline to obtain the final generated results. Table 2: Comparison on scene-level generation on 3D-FRONT dataset. Points-to-3D consistently outperforms state-of-the-art multi-object generation methods across all evaluation metrics. Rendering Geometry Method PSNR↑PSNR SSIM(%)↑SSIM(\%) LPIPS↓LPIPS DINO(%)↓DINO(\%) CD↓CD F-Score↑F-Score PSNR-N↑PSNR-N LPIPS-N↓LPIPS-N TRELLIS [67] 18.21 83.12 0.239 12.33 0.094 0.478 18.76 0.258 VoxHammer [33] (3D Inversion) 19.29 84.70 0.179 18.41 0.051 0.686 20.43 0.181 SceneGen [38] 18.32 83.35 0.231 14.43 0.086 0.485 19.08 0.229 MIDI [20] 19.23 85.59 0.166 14.25 0.075 0.513 20.82 0.164 Points-to-3D (Ours-VGGT Esti.) 20.52 86.51 0.152 8.90 0.040 0.743 20.97 0.160 Points-to-3D (Ours-P.C.Priors) 21.63 87.73 0.124 8.29 0.025 0.886 22.38 0.124 Figure 5: Scene-level generation on 3D-FRONT. The input point cloud priors setting is the same as in Fig. 4. Table 3: Comparison on visible and overall geometry results on Toys4K. We present the comparison between our method and TRELLIS. For each method, the upper row (O.) shows the overall results, while the lower row (V.) shows the visible region results. Methods CD↓CD F-Score↑F-Score PSNR-N↑PSNR-N LPIPS-N↓LPIPS-N TRELLIS [67]-O. 0.034 0.832 23.81 0.105 TRELLIS [67]-V. 0.032 0.854 24.77 0.093 SAM3D [7]-O. 0.033 0.835 23.85 0.101 SAM3D [7]-V. 0.031 0.841 24.81 0.090 Points-to-3D-O. 0.013 0.964 27.10 0.053 Points-to-3D-V. 0.007 0.998 29.00 0.036 4.2 Main Results Single-object generation. We first present the results of single-object generation on Toys4K [54]. As shown in Tab. 1, our method consistently outperforms existing approaches across all evaluation metrics, whether using existing point cloud priors or VGGT [61]-predicted point cloud. Notably, in terms of geometric metrics, the results with point cloud priors achieve an F-score of 0.963, demonstrates our approach produces geometry closely approximates the ground-truth structure. The significant improvement in geometry enhances the visual fidelity of the results. As illustrated in Fig. 4, our results better match the overall appearance of the ground-truth compared to other methods, and normal maps further highlight the superior geometric quality achieved by our approach. Notably, while VoxHammer adopts the same 3D priors as ours, the image condition fails to provide cues for the missing parts of the 3D priors, making 3D inversion process struggles to complete the unknown regions. SAM3D [7] highlights the value of 3D priors and also leverages point maps, it integrates these priors indirectly through the attention mechanism of the flow transformer block, which—as also stated in their paper—does not support explicit geometric control and exhibits limited ability to enforce precise geometric control compared to our approach. In contrast, our method leverages the trained model’s inpainting capability to fully exploit the existing 3D priors and effectively infer the missing geometry. Scene-level generation. We evaluate our method on the scene-level generation dataset 3D-FRONT [14]. As shown in Tab. 2, incorporating point cloud priors provides substantial guidance for reconstructing overall geometry in complex scene scenarios, which support our method achieves significant improvements across all evaluation metrics compared to other methods. The rendered images and normal maps in Fig. 5 further demonstrate that our results better align with the ground-truth scene geometry. Unlike MIDI [20] or SceneGen [38], which implicitly utilize spatial information, our framework explicitly incorporates geometric priors within the architecture, enabling more direct and effective control over 3D geometry, offering a promising solution for generating complex 3D scenes. Visible region performance. We also highlight the generation results for the visible regions—i.e., the areas covered by our point cloud priors. As shown in Tab. 3, within these visible regions, our generated results achieve an F-Score of 0.998 and a chamfer distance of 0.007, indicating a strong alignment with the ground truth structure. This demonstrates our structure generation pipeline effectively preserves the information provided by the point cloud priors while producing high-quality overall geometry. Compared with others, both in the visible regions and across the entire structure, our approach achieves substantial improvements in geometric fidelity, fulfilling the primary objectives of our work. SAM3D does not achieve improved geometry even within the regions covered by the input pointmap (i.e., the visible areas in the table). While our method injects 3D priors through a more direct and explicit mechanism, enabling effective and reliable geometric controllability, providing current 3D generation frameworks a stronger opportunity to benefit from sensed 3D priors as well as future improvements in feed-forward point-map prediction methods. Table 4: Ablation study. We evaluate the number of inpainting steps (Inp.) and refinement steps (Ref.) in our sampling strategy. Inp. Ref. CD↓CD F-Score↑F-Score PSNR-N↑PSNR-N LPIPS-N↓LPIPS-N 50 0 0.014 0.960 25.88 0.065 40 10 0.013 0.962 26.49 0.059 30 20 0.013 0.963 26.89 0.056 25 25 0.013 0.963 27.10 0.053 20 30 0.013 0.962 27.03 0.055 10 40 0.014 0.961 26.72 0.061 Figure 6: Ablation study. Allocating the full sampling to inpainting (Inp.) results in geometric “holes” along the inpainting edge. 4.3 Ablation Studies VGGT point clouds estimation. When point cloud are not available as input, our method can also leverage the condition image to predict an initial point cloud using feed-forward methods like VGGT [61]. We evaluate the generation results based on VGGT-estimated point cloud, as shown in Tab. 1 and Tab. 2. Although the generation results with VGGT point cloud exhibit some gap compared to using accurate point cloud priors—this is largely due to the inherent prediction errors of VGGT. Nevertheless, compared to other existing approaches, it consistently achieves substantial improvements in both geometric accuracy and visual fidelity. These highlight the strong robustness and flexibility of our pipeline, with the absence of high-precision priors, our framework can still effectively utilize predicted point cloud form image-only inputs to achieve high-quality geometry generation. Staged sampling strategy. We propose a staged sampling strategy in our pipeline, which leverages a limited number of last steps with noise to perform global optimization, effectively addressing the “holes” along inpainting boundaries that are otherwise difficult to avoid. We investigate the effect of refinement step allocation through an ablation study. In Tab.4, we present generation results under different allocations of inpainting and refinement steps with the same total sampling steps. When the entire sampling process is allocated to inpainting, the geometric reconstruction suffers from “holes” on inpainting edge, as further illustrated in Fig. 6. By setting the sampling schedule to 25 inpainting steps followed by 25 refinement steps, the geometric metrics reach their best performance, and the previously observed “holes” are effectively eliminated as in Fig. 6, yielding the overall best generation results. 4.4 Real-world Input Examples We further evaluate the robustness of our method on real-world images from the Pix3D [55] dataset. As illustrated in Fig. 7, our approach maintains robust performance on real image inputs, producing geometry that aligns more faithfully with the input images compared to the baseline method. Figure 7: Real-world examples on Pix3D. 5 Conclusion We introduce Points-to-3D, a diffusion-based framework that first leverages explicit 3D point cloud priors as input to enable geometry-controllable 3D asset and scene generation. Built upon the latent 3D diffusion model TRELLIS [67], we investigate a natural way to embed point cloud as initialization within the framework. After training TRELLIS’s structure generation network to acquire inpainting capabilities, we employ a staged sampling strategy—structural inpainting followed by boundary refinement—that reconstructs the global geometry while preserving the input visible regions. Experiments demonstrate the benefits of explicitly embedding 3D priors, highlighting a promising direction for controllable and reliable 3D generation in real-world applications. References [1] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan (2021) Mip-nerf: a multiscale representation for anti-aliasing neural radiance fields. In ICCV, p. 5855–5864. Cited by: §2.1. [2] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman (2023) Zip-nerf: anti-aliased grid-based neural radiance fields. ICCV. Cited by: §2.1. [3] E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis, et al. (2022) Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 16123–16133. Cited by: §2.2. [4] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su (2022) Tensorf: tensorial radiance fields. In ECCV, p. 333–350. Cited by: §2.1. [5] A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and H. Su (2021) MVSNeRF: fast generalizable radiance field reconstruction from multi-view stereo. In ICCV, p. 14124–14133. Cited by: §2.1. [6] X. Chen, L. Huang, Y. Liu, Y. Shen, D. Zhao, and H. Zhao (2024) Anydoor: zero-shot object-level image customization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 6593–6602. Cited by: §2.4. [7] X. Chen, F. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, et al. (2025) Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624. Cited by: §4.2, Table 1, Table 3, Table 3. [8] Y. Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T. Cham, and J. Cai (2024) Mvsplat: efficient 3d gaussian splatting from sparse multi-view images. In ECCV, p. 370–386. Cited by: §2.1. [9] Y. Chen, C. Zheng, H. Xu, B. Zhuang, A. Vedaldi, T. Cham, and J. Cai (2024) MVSplat360: feed-forward 360 scene synthesis from sparse views. Cited by: §2.1. [10] Y. Cheng, H. Lee, S. Tulyakov, A. G. Schwing, and L. Gui (2023) Sdfusion: multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 4456–4465. Cited by: §2.4. [11] R. Chu, E. Xie, S. Mo, Z. Li, M. Nießner, C. Fu, and J. Jia (2024) Diffcomplete: diffusion-based generative 3d shape completion. Advances in Neural Information Processing Systems 36. Cited by: §2.4. [12] J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, E. Gundogdu, X. Zhang, T. F. Y. Vicente, T. Dideriksen, H. Arora, et al. (2022) Abo: dataset and benchmarks for real-world 3d object understanding. In CVPR, Cited by: Appendix A, §4.1. [13] K. Deng, A. Liu, J. Zhu, and D. Ramanan (2022-06) Depth-supervised NeRF: fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3. [14] H. Fu, B. Cai, L. Gao, L. Zhang, J. Wang, C. Li, Q. Zeng, C. Sun, R. Jia, B. Zhao, et al. (2021) 3d-front: 3d furnished rooms with layouts and semantics. In ICCV, Cited by: Appendix A, §1, §4.1, §4.2. [15] H. Fu, R. Jia, L. Gao, M. Gong, B. Zhao, S. Maybank, and D. Tao (2021) 3d-future: 3d furniture shape with texture. IJCV 129, p. 3313–3337. Cited by: Appendix A, §4.1. [16] J. D. Galvis, X. Zuo, S. Schaefer, and S. Leutengger (2024) SC-diff: 3d shape completion with latent diffusion models. arXiv preprint arXiv:2403.12470. Cited by: §2.4. [17] J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Z. Gojcic, and S. Fidler (2022) Get3d: a generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems 35, p. 31841–31854. Cited by: §2.2. [18] J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Advances in neural information processing systems 33, p. 6840–6851. Cited by: §2.2. [19] B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao (2024) 2D gaussian splatting for geometrically accurate radiance fields. In SIGGRAPH 2024 Conference Papers, Cited by: §2.1. [20] Z. Huang, Y. Guo, X. An, Y. Yang, Y. Li, Z. Zou, D. Liang, X. Liu, Y. Cao, and L. Sheng (2025) Midi: multi-instance diffusion for single image to 3d scene generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, p. 23646–23657. Cited by: §1, §2.2, §4.2, Table 2. [21] H. Jiang, Q. Huang, and G. Pavlakos (2025) Real3d: scaling up large reconstruction models with real-world images. Cited by: §1, §2.1, §2.2, Table 1. [22] Y. Jiang, J. Tu, Y. Liu, X. Gao, X. Long, W. Wang, and Y. Ma (2024) Gaussianshader: 3d gaussian splatting with shading functions for reflective surfaces. In CVPR, p. 5322–5332. Cited by: §2.1. [23] X. Ju, X. Liu, X. Wang, Y. Bian, Y. Shan, and Q. Xu (2024) Brushnet: a plug-and-play image inpainting model with decomposed dual-branch diffusion. In European Conference on Computer Vision, p. 150–168. Cited by: §2.4. [24] Y. Kasten, O. Rahamim, and G. Chechik (2024) Point cloud completion with pretrained text-to-image diffusion models. Advances in Neural Information Processing Systems 36. Cited by: §2.4. [25] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis (2023-07) 3D gaussian splatting for real-time radiance field rendering. ACM TOG 42 (4). Cited by: §2.1. [26] J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik (2023) LERF: language embedded radiance fields. In ICCV, p. 19729–19739. Cited by: §2.1. [27] M. Khanna, Y. Mao, H. Jiang, S. Haresh, B. Shacklett, D. Batra, A. Clegg, E. Undersander, A. X. Chang, and M. Savva (2024) Habitat synthetic scenes dataset (hssd-200): an analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. In CVPR, Cited by: Appendix A, §4.1. [28] A. Kurz, T. Neff, Z. Lv, M. Zollhöfer, and M. Steinberger (2022) AdaNeRF: adaptive sampling for real-time rendering of neural radiance fields. In ECCV, p. 254–270. Cited by: §2.1. [29] Z. Lai, Y. Zhao, Z. Zhao, H. Liu, Q. Lin, J. Huang, C. Guo, and X. Yue (2025) LATTICE: democratize high-fidelity 3d generation at scale. External Links: 2512.03052, Link Cited by: §2.1. [30] Y. Lan, S. Zhou, Z. Lyu, F. Hong, S. Yang, B. Dai, X. Pan, and C. C. Loy (2025) GaussianAnything: interactive point cloud latent diffusion for 3d generation. In ICLR, Cited by: §1, §1, §2.1, §2.2, Table 1. [31] V. Leroy, Y. Cabon, and J. Revaud (2024) Grounding image matching in 3d with mast3r. In ECCV, Cited by: §2.1. [32] J. Li, J. Zhang, X. Bai, J. Zheng, X. Ning, J. Zhou, and L. Gu (2024) DNGaussian: optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization. In CVPR, Cited by: §2.1. [33] L. Li, Z. Huang, H. Feng, G. Zhuang, R. Chen, C. Guo, and L. Sheng (2025) Voxhammer: training-free precise and coherent 3d editing in native 3d space. arXiv preprint arXiv:2508.19247. Cited by: §2.2, §4.1, Table 1, Table 2. [34] Y. Li, Z. Zou, Z. Liu, D. Wang, Y. Liang, Z. Yu, X. Liu, Y. Guo, D. Liang, W. Ouyang, et al. (2025) TripoSG: high-fidelity 3d shape synthesis using large-scale rectified flow models. arXiv preprint arXiv:2502.06608. Cited by: §2.1. [35] Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le (2022) Flow matching for generative modeling. arXiv preprint arXiv:2210.02747. Cited by: §2.2. [36] M. Liu, R. Shi, L. Chen, Z. Zhang, C. Xu, X. Wei, H. Chen, C. Zeng, J. Gu, and H. Su (2024) One-2-3-45++: fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 10072–10083. Cited by: §2.2. [37] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J. Zhu, and S. Ermon (2021) Sdedit: guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073. Cited by: §2.4. [38] Y. Meng, H. Wu, Y. Zhang, and W. Xie (2025) Scenegen: single-image 3d scene generation in one feedforward pass. arXiv preprint arXiv:2508.15769. Cited by: §2.2, §4.2, Table 2. [39] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) NeRF: representing scenes as neural radiance fields for view synthesis. In ECCV, Cited by: §2.1. [40] T. Neff, P. Stadlbauer, M. Parger, A. Kurz, J. H. Mueller, C. R. A. Chaitanya, A. Kaplanyan, and M. Steinberger (2021) DONeRF: towards real-time rendering of compact neural radiance fields using depth oracle networks. In Comput. Graph. Forum, Vol. 40, p. 45–59. Cited by: §2.1. [41] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. (2024) Dinov2: learning robust visual features without supervision. tmlr. Cited by: §4.1. [42] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019-06) DeepSDF: learning continuous signed distance functions for shape representation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1. [43] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach (2023) Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952. Cited by: §2.2. [44] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In icml, Cited by: §4.1. [45] B. Roessle, J. T. Barron, B. Mildenhall, P. P. Srinivasan, and M. Nießner (2022) Dense depth priors for neural radiance fields from sparse input views. In CVPR, p. 12892–12901. Cited by: §2.3. [46] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer (2022) High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 10684–10695. Cited by: §2.2. [47] J. L. Schonberger and J. Frahm (2016) Structure-from-motion revisited. In CVPR, p. 4104–4113. Cited by: §2.1. [48] K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger (2020) Graf: generative radiance fields for 3d-aware image synthesis. Advances in neural information processing systems 33, p. 20154–20166. Cited by: §2.2. [49] R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su (2023) Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110. Cited by: §2.2. [50] S. Shi, X. Wang, and H. Li (2019-06) PointRCNN: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3. [51] Y. Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang (2023) Mvdream: multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512. Cited by: §2.2. [52] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein (2020) Implicit neural representations with periodic activation functions. Advances in neural information processing systems 33, p. 7462–7473. Cited by: §2.1. [53] J. Song, C. Meng, and S. Ermon (2020) Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502. Cited by: §2.2. [54] S. Stojanov, A. Thai, and J. M. Rehg (2021) Using shape to categorize: low-shot learning with an explicit shape bias. Cited by: Appendix A, §B.1, §B.4, §1, §4.1, §4.2. [55] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman (2018) Pix3D: dataset and methods for single-image 3d shape modeling. In CVPR, Cited by: §4.1, §4.4. [56] S. Szymanowicz, J. Y. Zhang, P. Srinivasan, R. Gao, A. Brussee, A. Holynski, R. Martin-Brualla, J. T. Barron, and P. Henzler (2025-10) Bolt3D: generating 3d scenes in seconds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), p. 24846–24857. Cited by: §2.2. [57] J. Tang, Z. Chen, X. Chen, T. Wang, G. Zeng, and Z. Liu (2024) Lgm: large multi-view gaussian model for high-resolution 3d content creation. In European Conference on Computer Vision, p. 1–18. Cited by: Table 6, §1, §2.1, §2.2, Table 1. [58] Z. Tang, Y. Fan, D. Wang, H. Xu, R. Ranjan, A. Schwing, and Z. Yan (2025) Mv-dust3r+: single-stage scene reconstruction from sparse views in 2 seconds. In Proceedings of the Computer Vision and Pattern Recognition Conference, p. 5283–5293. Cited by: §2.1. [59] T. H. Team (2024) Hunyuan3D 1.0: a unified framework for text-to-3d and image-to-3d generation. External Links: 2411.02293 Cited by: §2.1. [60] T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. (2025) Wan: open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314. Cited by: §2.2. [61] J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny (2025) VGGT: visual geometry grounded transformer. In CVPR, Cited by: §B.1, §B.2, §2.1, §2.3, §4.2, §4.3, Table 1, Table 1. [62] S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud (2024) Dust3r: geometric 3d vision made easy. In CVPR, Cited by: §2.1. [63] T. Wu, C. Zheng, F. Guan, A. Vedaldi, and T. Cham (2025) Amodal3r: amodal 3d reconstruction from occluded 2d images. arXiv preprint arXiv:2503.13439. Cited by: §2.2. [64] X. Wu, L. Jiang, P. Wang, Z. Liu, X. Liu, Y. Qiao, W. Ouyang, T. He, and H. Zhao (2024) Point transformer v3: simpler faster stronger. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 4840–4851. Cited by: §2.3. [65] J. Xia and L. Liu (2025) Training-free instance-aware 3d scene reconstruction and diffusion-based view synthesis from sparse images. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers, Cited by: §2.1. [66] J. Xia, L. Sun, and L. Liu (2025) Enhancing close-up novel view synthesis via pseudo-labeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39, p. 8567–8574. Cited by: §2.1. [67] J. Xiang, Z. Lv, S. Xu, Y. Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang (2025) Structured 3d latents for scalable and versatile 3d generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, p. 21469–21480. Cited by: §B.1, Table 5, Table 6, §1, §1, §1, §2.1, §2.2, §3.1, §4.1, Table 1, Table 2, Table 3, Table 3, §5. [68] J. Yang, A. Sax, K. J. Liang, M. Henaff, H. Tang, A. Cao, J. Chai, F. Meier, and M. Feiszli (2025) Fast3r: towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the Computer Vision and Pattern Recognition Conference, p. 21924–21935. Cited by: §2.1. [69] Y. Yang, Y. Zhou, Y. Guo, Z. Zou, Y. Huang, Y. Liu, H. Xu, D. Liang, Y. Cao, and X. Liu (2025) Omnipart: part-aware 3d generation with semantic decoupling and structural cohesion. arXiv preprint arXiv:2507.06165. Cited by: §2.2. [70] K. Yao, L. Zhang, X. Yan, Y. Zeng, Q. Zhang, L. Xu, W. Yang, J. Gu, and J. Yu (2025) Cast: component-aligned 3d scene reconstruction from an rgb image. ACM Transactions on Graphics (TOG) 44 (4), p. 1–19. Cited by: §1, §2.2. [71] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2018) Mvsnet: depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV), p. 767–783. Cited by: §2.1. [72] Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan (2019) Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 5525–5534. Cited by: §2.1. [73] X. Yu, T. Wang, S. Y. Kim, P. Guerrero, X. Chen, Q. Liu, Z. Lin, and X. Qi (2025) Objectmover: generative object movement with video prior. In Proceedings of the Computer Vision and Pattern Recognition Conference, p. 17682–17691. Cited by: §2.4. [74] Z. Yu, A. Chen, B. Huang, T. Sattler, and A. Geiger (2024) Mip-splatting: alias-free 3d gaussian splatting. In CVPR, p. 19447–19456. Cited by: §2.1. [75] Z. Yu and S. Gao (2020) Fast-mvsnet: sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 1949–1958. Cited by: §2.1. [76] J. Zhang, Y. Yao, and L. Quan (2021) Learning signed distance field for multi-view surface reconstruction. International Conference on Computer Vision (ICCV). Cited by: §2.1. [77] J. Zhang, C. Herrmann, J. Hur, V. Jampani, T. Darrell, F. Cole, D. Sun, and M. Yang (2025) Monst3r: a simple approach for estimating geometry in the presence of motion. In ICLR, Cited by: §2.1. [78] L. Zhang, Z. Wang, Q. Zhang, Q. Qiu, A. Pang, H. Jiang, W. Yang, L. Xu, and J. Yu (2024) Clay: a controllable large-scale generative model for creating high-quality 3d assets. ACM Transactions on Graphics (TOG) 43 (4), p. 1–20. Cited by: §1, §2.2. [79] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: Appendix A, §4.1. [80] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun (2021) Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, p. 16259–16268. Cited by: §2.3. Appendix A Experimental Details Our training dataset consists of object collections from the 3D-FUTURE [15] (9,472 objects), HSSD [27] (6,670 objects), and ABO [12] (4,485 objects) datasets. For each object, we render the image of the T=24T=24 views, together with the corresponding depth map, and extract the visible point cloud for each view by enforcing depth consistency with a threshold τ = 0.05 times the depth range (maximum minus minimum depth) in that view. The visible point cloud is then converted into an initial S latent, which is paired with the original S latent as ground truth to train the sparse structure flow transformer for inpainting. For evaluation, we use randomly sampled subset of the Toys4K [54] (500 objects) dataset and 3D-FRONT [14] (500 scenes) dataset. For each test object or scene, we render 8 views using cameras with yaw angles (0∘,45∘,90∘,135∘,180∘,225∘,270∘,315∘)(0 ,45 ,90 ,135 ,180 ,225 ,270 ,315 ) and a fixed pitch angle of 30∘30 . The camera is positioned at a radius of 1.8 from the object center. For PSNR, SSIM, and LPIPS [79], we directly compare the rendered images of generated results with the rendered images of the ground-truth objects and report the average scores. For the DINO-based similarity metric, we report the average discrepancy between the rendered images of the generated and ground-truth assets, quantified as (1−SDINO)(1-S_DINO), where SDINOS_DINO denotes the DINO similarity score. For the normal-based metric, we render normal maps from the 8 views and compute the average score between the normal maps of the generated and ground-truth assets. For Chamfer Distance (CD) and F-score, we normalize all the objects within the range (-0.5, 0.5) and set the F-score distance threshold to 0.05. During testing, for the point cloud priors input, we align the point cloud to the orientation of the corresponding ground-truth object to ensure that the generation conditioned on this point cloud can be directly evaluated. Appendix B More Results We provide additional qualitative examples and experimental results to further demonstrate the performance of our method. Figure 8: Generation results with 3 input views on Toys4K. The first column of our results uses sampled point-cloud priors extracted from the visible regions of the three input images, whereas the “VGGT-estimated” results rely on point clouds inferred from the input images by VGGT. Figure 9: Input point cloud priors examples. We show the observable point cloud priors examples for the two input modes with single-view input in this paper, along with their corresponding generation results. Figure 10: More image-to-3D examples. More single-image to 3D generation visualization results on Toy4K (row 1-3) and 3D-Front dataset (row 4-6). Table 5: Comparison on single-object generation with 3 views input on Toy4K dataset. Rendering Geometry Method PSNR↑PSNR SSIM(%)↑SSIM(\%) LPIPS↓LPIPS DINO(%)↓DINO(\%) CD↓~CD F-Score↑F-Score PSNR-N↑PSNR-N LPIPS-N↓LPIPS-N TRELLIS [67] 23.19 92.63 0.075 5.79 0.025 0.904 26.22 0.066 Points-to-3D (Ours-VGGT Esti.) 23.44 93.21 0.057 5.58 0.015 0.971 28.35 0.035 Points-to-3D (Ours-P.C.Priors) 23.98 94.02 0.050 5.26 0.009 0.988 30.45 0.028 B.1 Multi-Views Input Generation Because our flow-based model performs iterative denoising, it can directly incorporate multi-view reference images as conditioning inputs at different denoising steps. For VGGT-estimated point clouds, multi-view inputs produce more accurate predictions; and greater point cloud coverage consistently leads to better reconstruction. We further evaluate the case of using three input views on Toys4K [54] dataset. Specifically, we first feed the multi-view reference images into VGGT [61] to obtain a more complete predicted point cloud. As shown in Tab. 5, while multi-view input naturally improves the baseline TRELLIS [67] geometry, our method achieves substantially higher structural accuracy, consistently maintaining controllable geometry. For accurate point cloud priors, we extract the visible sampled surface point cloud from the three views using depth consistency and use it as the input prior. With these priors, our method produces reconstructions that are very close to the ground truth. Fig. 8 further shows the visualization comparisons. These results demonstrate the robustness and effectiveness of our method across different numbers of input images. Figure 11: More real-world image generation examples. Figure 12: Text-to-3D generation examples. B.2 Point Cloud Priors Examples In Fig. 9, we illustrate examples of the two types of point cloud priors considered in this work, which correspond to the two most common practical scenarios: (1) partial point clouds directly captured by hardware sensors (e.g., LiDAR on an iPhone), and (2) point cloud estimated from input images via feed-forward point-map prediction (e.g., VGGT [61]). This experimental setup enables a comprehensive evaluation of our method over a broader spectrum of practical cases. As shown in Fig. 9, these visible-region priors impose reliable geometric constraints that steer our model toward controllable and faithful 3D generation. B.3 More Image-to-3D Examples We provide additional visualization results for image-to-3D generation in Fig. 10, demonstrating the effectiveness of our method. Experiments highlight that our method addresses a major limitation of existing 3D generation frameworks that struggle to fully incorporate available 3D information, and achieves substantial improvements in both single-object and scene-level generation. B.4 More Real-world and Text-to-3D Examples We showcase more results in real-world image generation in Fig. 11, demonstrating the robustness of our method in practical scenarios. Moreover, we also assess our model under text-to-3D settings on Toys4K [54], where text prompts and point cloud priors are provided as input. As shown in Tab. 6 and Fig. 12, our method successfully generates geometries that are semantically consistent with the input prompts and structurally well-controlled by the given point cloud priors. Table 6: Comparison of text-to-3D generation on Toys4K. Methods CLIP↑CLIP CD↓CD F-Score↑F-Score PSNR-N↑PSNR-N LPIPS-N↓LPIPS-N LGM [57] 0.247 0.086 0.412 19.55 0.223 TRELLIS [67] 0.298 0.047 0.639 21.25 0.159 Points-to-3D 0.299 0.022 0.892 24.75 0.094