English 中文 Deutsch Italiano Français Español 日本語 한국어
SIGGRAPH 2026 · Tencent ARC Lab · Tsinghua University

Pixal3D: Pixel-Aligned 3D Generation from Images

Pixal3D is an image-to-3D system for creators and researchers who care about fidelity. It lifts multi-scale image features into a 3D feature volume, so the generated mesh stays visually tied to the source pixels instead of drifting into a generic canonical shape.

93.57IoU on Toys4K single-view normal evaluation
4.91/5User-study fidelity score on in-the-wild images
1024Training resolution in the paper schedule
GLBGame-engine friendly mesh export
The first screen is led by the Pixal3D-Server Gradio app. Queue state depends on Hugging Face availability. Open the official Hugging Face Space
Official sources

Research, model, code, and demo links in one place

This page is built from the arXiv paper, the TencentARC model card, the GitHub repository, the project page, and the Pixal3D-Server Space.

Why it matters

Image-to-3D fidelity starts with correspondence

Most 3D-native generators build a shape in canonical space, then inject image cues through attention. Pixal3D changes the contract: it generates in a pixel-aligned space and conditions sparse diffusion with explicit back-projected features.

Direct pixel-to-3D mapping

Each voxel samples the input image feature map through projection and bilinear interpolation, reducing ambiguous associations between 2D details and 3D geometry.

Structured sparse latents

A pixel-aligned sparse SDF VAE compresses high-resolution geometry into efficient latents without abandoning the source-view coordinate frame.

Texture-ready output

The pipeline targets complete assets, not only previews: meshes can be paired with PBR texture maps and exported for Blender, Unity, Unreal, Godot, and WebGL.

Method

A four-part pipeline from image pixels to usable assets

The paper frames Pixal3D as a generation system inspired by reconstruction: keep the input view meaningful, lift features into 3D, generate coarse structure, then decode detailed sparse geometry.

Encode the image

DINOv2-style multi-scale visual features capture object identity, material cues, edges, and fine details from a single reference image.

Back-project features

Projected 3D samples gather image features into a 3D conditioning volume, creating an explicit bridge between image pixels and spatial voxels.

Generate coarse-to-fine

A dense stage predicts occupancy and a sparse latent stage refines SDF detail, giving the model both global structure and local surface precision.

Decode and export

The generated latent is decoded into a mesh. The result is suitable for GLB/glTF workflows and downstream cleanup or retopology.

Paper notes

What the SIGGRAPH 2026 paper adds

Pixal3D argues that fidelity is bottlenecked by unclear 2D-to-3D correspondence. Its pixel back-projection conditioner replaces loose cross-attention with a structured feature volume. The same idea extends naturally to multi-view generation by aggregating feature volumes across views.

Release status

  • May 2026 — Improved main branch based on the Trellis.2 backbone.
  • May 2026 — Inference code and online Gradio demo released.
  • April 2026 — Paper accepted to SIGGRAPH 2026.

Branches

  • main — Latest implementation with improved Trellis.2-based performance.
  • paper — Original Direct3D-S2-based version for reproducing paper results.
@article{li2026pixal3d,
  title   = {Pixal3D: Pixel-Aligned 3D Generation from Images},
  author  = {Li, Dong-Yang and Zhao, Wang and Chen, Yuxin and Hu, Wenbo and Guo, Meng-Hao and Zhang, Fang-Lue and Shan, Ying and Hu, Shi-Min},
  journal = {arXiv preprint arXiv:2605.10922},
  year    = {2026}
}
Benchmarks

Reported fidelity gains on Toys4K and in-the-wild images

The paper reports single-view normal-map metrics on Toys4K and a 30-participant user study on a harder in-the-wild test set.

93.57IoU
24.21PSNR
0.108LPIPS
4.91User fidelity
MethodIoU ↑PSNR ↑SSIM ↑LPIPS ↓Fidelity ↑
Pixal3D93.5724.210.8970.1084.91
Hunyuan3D-2.183.3321.960.8890.1792.77
TRELLIS79.4820.980.8830.2041.86
Direct3D-S274.2319.490.8510.2683.21
TripoSG73.5419.730.8730.2502.25
Workflow

How to get the best result from one image

Pixal3D is strongest when the input image is clear enough for pixel-level evidence to matter. Treat it as a fast first pass for high-fidelity asset production, then polish for the target runtime.

Choose the reference

Use a sharp image with a single main subject, readable silhouette, diffuse lighting, and minimal occlusion. Three-quarter views usually provide more usable geometry than flat front views.

Run the browser demo

Upload in the embedded Space. If the queue is busy, use the official Hugging Face Space page to pick an available instance.

Use locally when needed

Clone the repository, follow the Trellis.2 environment setup, install requirements, install utils3d, then run inference.py or app.py.

Prepare for production

Import GLB/glTF into Blender for scale cleanup, collision proxies, LODs, UV checks, and engine-specific material settings.

Use cases

Where Pixal3D fits

The browser demo is enough for quick testing. Local use is better for batch processing, reproducibility, and pipeline integration.

Game assets

Turn concept art and product references into editable meshes for indie games, prototypes, props, and environment pieces.

E-commerce and AR

Create inspectable 3D previews from product photography when a full photogrammetry capture is not available.

Research baselines

Study pixel-aligned conditioning, sparse voxel diffusion, multi-view aggregation, and scene-level object separation.

Voxel and stylized art

Generate a high-fidelity base, then voxelize or simplify it in Blender, MagicaVoxel, Blockbench, or a custom pipeline.

Local install

Run Pixal3D outside the browser

The browser demo is enough for quick testing. Local use is better for batch processing, reproducibility, and pipeline integration.

git clone https://github.com/TencentARC/Pixal3D.git
cd Pixal3D
pip install -r requirements.txt
pip install https://github.com/LDYang694/Storages/releases/download/20260430/utils3d-0.0.2-py3-none-any.whl
python inference.py --image assets/test_image/0.png --output output.glb
python app.py

Practical limits

  • Input images with hidden backsides still require generative completion; inspect the unseen side before shipping.
  • Generated meshes may need retopology, decimation, collision setup, and material cleanup for real-time games.
  • Check the Pixal3D license and the input-image rights before commercial use.
FAQ

Questions people ask before trying Pixal3D

Try Pixal3D in a full-width browser demo. Learn how pixel back-projection turns one image into high-fidelity 3D assets with geometry, PBR textures, GLB export, and multi-view extensions.

No. Photogrammetry reconstructs a scene from many calibrated images. Pixal3D is a generative image-to-3D method that can work from one image, while borrowing reconstruction-style pixel alignment.

Back-projection gives the 3D generator an explicit route from a voxel to the image evidence that should condition it. That reduces the ambiguity created when image features are only passed through attention.

Yes. The paper describes aggregating independently back-projected feature volumes across views, and reports stronger multi-view geometry as more views are added.

The Space itself is hosted by Hugging Face. Availability, queue length, and authentication prompts are controlled by Hugging Face and the Space owner.

Open it in Blender or your engine, verify scale and orientation, inspect materials, create LODs, add collisions, and optimize for your target platform.

No. pixal3d.xyz is a community information page that links to the official paper, model, demo, and repository.