Direct pixel-to-3D mapping
Each voxel samples the input image feature map through projection and bilinear interpolation, reducing ambiguous associations between 2D details and 3D geometry.
Pixal3D is an image-to-3D system for creators and researchers who care about fidelity. It lifts multi-scale image features into a 3D feature volume, so the generated mesh stays visually tied to the source pixels instead of drifting into a generic canonical shape.
This page is built from the arXiv paper, the TencentARC model card, the GitHub repository, the project page, and the Pixal3D-Server Space.
Most 3D-native generators build a shape in canonical space, then inject image cues through attention. Pixal3D changes the contract: it generates in a pixel-aligned space and conditions sparse diffusion with explicit back-projected features.
Each voxel samples the input image feature map through projection and bilinear interpolation, reducing ambiguous associations between 2D details and 3D geometry.
A pixel-aligned sparse SDF VAE compresses high-resolution geometry into efficient latents without abandoning the source-view coordinate frame.
The pipeline targets complete assets, not only previews: meshes can be paired with PBR texture maps and exported for Blender, Unity, Unreal, Godot, and WebGL.
The paper frames Pixal3D as a generation system inspired by reconstruction: keep the input view meaningful, lift features into 3D, generate coarse structure, then decode detailed sparse geometry.
DINOv2-style multi-scale visual features capture object identity, material cues, edges, and fine details from a single reference image.
Projected 3D samples gather image features into a 3D conditioning volume, creating an explicit bridge between image pixels and spatial voxels.
A dense stage predicts occupancy and a sparse latent stage refines SDF detail, giving the model both global structure and local surface precision.
The generated latent is decoded into a mesh. The result is suitable for GLB/glTF workflows and downstream cleanup or retopology.
Pixal3D argues that fidelity is bottlenecked by unclear 2D-to-3D correspondence. Its pixel back-projection conditioner replaces loose cross-attention with a structured feature volume. The same idea extends naturally to multi-view generation by aggregating feature volumes across views.
@article{li2026pixal3d,
title = {Pixal3D: Pixel-Aligned 3D Generation from Images},
author = {Li, Dong-Yang and Zhao, Wang and Chen, Yuxin and Hu, Wenbo and Guo, Meng-Hao and Zhang, Fang-Lue and Shan, Ying and Hu, Shi-Min},
journal = {arXiv preprint arXiv:2605.10922},
year = {2026}
}
The paper reports single-view normal-map metrics on Toys4K and a 30-participant user study on a harder in-the-wild test set.
| Method | IoU ↑ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | Fidelity ↑ |
|---|---|---|---|---|---|
| Pixal3D | 93.57 | 24.21 | 0.897 | 0.108 | 4.91 |
| Hunyuan3D-2.1 | 83.33 | 21.96 | 0.889 | 0.179 | 2.77 |
| TRELLIS | 79.48 | 20.98 | 0.883 | 0.204 | 1.86 |
| Direct3D-S2 | 74.23 | 19.49 | 0.851 | 0.268 | 3.21 |
| TripoSG | 73.54 | 19.73 | 0.873 | 0.250 | 2.25 |
Pixal3D is strongest when the input image is clear enough for pixel-level evidence to matter. Treat it as a fast first pass for high-fidelity asset production, then polish for the target runtime.
Use a sharp image with a single main subject, readable silhouette, diffuse lighting, and minimal occlusion. Three-quarter views usually provide more usable geometry than flat front views.
Upload in the embedded Space. If the queue is busy, use the official Hugging Face Space page to pick an available instance.
Clone the repository, follow the Trellis.2 environment setup, install requirements, install utils3d, then run inference.py or app.py.
Import GLB/glTF into Blender for scale cleanup, collision proxies, LODs, UV checks, and engine-specific material settings.
The browser demo is enough for quick testing. Local use is better for batch processing, reproducibility, and pipeline integration.
Turn concept art and product references into editable meshes for indie games, prototypes, props, and environment pieces.
Create inspectable 3D previews from product photography when a full photogrammetry capture is not available.
Study pixel-aligned conditioning, sparse voxel diffusion, multi-view aggregation, and scene-level object separation.
Generate a high-fidelity base, then voxelize or simplify it in Blender, MagicaVoxel, Blockbench, or a custom pipeline.
The browser demo is enough for quick testing. Local use is better for batch processing, reproducibility, and pipeline integration.
git clone https://github.com/TencentARC/Pixal3D.git
cd Pixal3D
pip install -r requirements.txt
pip install https://github.com/LDYang694/Storages/releases/download/20260430/utils3d-0.0.2-py3-none-any.whl
python inference.py --image assets/test_image/0.png --output output.glb
python app.py
Try Pixal3D in a full-width browser demo. Learn how pixel back-projection turns one image into high-fidelity 3D assets with geometry, PBR textures, GLB export, and multi-view extensions.
No. Photogrammetry reconstructs a scene from many calibrated images. Pixal3D is a generative image-to-3D method that can work from one image, while borrowing reconstruction-style pixel alignment.
Back-projection gives the 3D generator an explicit route from a voxel to the image evidence that should condition it. That reduces the ambiguity created when image features are only passed through attention.
Yes. The paper describes aggregating independently back-projected feature volumes across views, and reports stronger multi-view geometry as more views are added.
The Space itself is hosted by Hugging Face. Availability, queue length, and authentication prompts are controlled by Hugging Face and the Space owner.
Open it in Blender or your engine, verify scale and orientation, inspect materials, create LODs, add collisions, and optimize for your target platform.
No. pixal3d.xyz is a community information page that links to the official paper, model, demo, and repository.