PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Abstract

Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm.

Motivation

It is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We introduce PixelGen, a simple pixel diffusion framework with perceptual loss. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics.

Figure 2: This work shows that pixel diffusion with perceptual loss outperforms latent diffusion. (a) A traditional two-stage latent diffusion denoises in the latent space, which is influenced by the artifacts of the VAE. (b) PixelGen introduces perceptual loss to encourage the diffusion model to focus on the perceptual manifold, enabling the pixel diffusion to learn a meaningful manifold rather than the complex full image manifold. (c) PixelGen outperforms the latent diffusion models using only 80 training epochs on ImageNet without CFG.

Illustration of Perceptual Manifold

Figure 3: Illustration of different manifolds within the pixel space. The image manifold is a large manifold containing both perceptually significant information and imperceptible signals. The perceptual manifold contains perceptually important signals, providing a better target for pixel space diffusion. P-DINO and LPIPS are the two complementary perceptual supervision utilized in PixelGen.

Implementation

We propose PixelGen, a simple yet effective pixel diffusion framework with perceptual supervision. PixelGen directly operates in the pixel domain without relying on latent representations, VAEs, or auxiliary stages. Following the x-prediction paradigm, the diffusion model predicts clean images instead of noise or velocity. To retain the benefits of flow matching, the predicted image is converted into velocity, resulting in a flow-matching objective. PixelGen focuses on the perceptual manifold rather than the full image manifold. To this end, we introduce two complementary perceptual losses. An LPIPS loss emphasizes local textures and fine-grained details, while a Perceptual DINO (P-DINO) loss aligns global semantics using patch-level features from a frozen DINOv2 encoder.

Figure 4: Overview of PixelGen. The diffusion model directly predicts the image x instead of velocity or noise to simplify the prediction target. A flow-matching diffusion loss is retained to keep the advantages of flow matching via velocity conversion. Two complementary perceptual losses are introduced to encourage the diffusion model to focus on the perceptual manifold.

Empirically Analysis

Perceptual supervision improves pixel diffusion by enhancing local details and global semantics. Starting from the JiT baseline, we progressively introduce the LPIPS loss and the P-DINO loss. As shown in Figure 5, the baseline model produces blurry images with weak structural consistency. After adding the LPIPS loss, local textures become sharper, and fine details are better preserved. This indicates that LPIPS effectively emphasizes perceptually important local patterns. When the P-DINO loss is further introduced, the generated images exhibit improved global structure and better semantics. These qualitative improvements are supported by quantitative results. The baseline model achieves an FID of 23.67 on ImageNet without classifier-free guidance. With LPIPS loss, the FID decreases to 10.00. After adding the P-DINO loss, it further drops to 7.46. This confirms that LPIPS and P-DINO provide complementary supervision. LPIPS focuses on local perceptual fidelity, while P-DINO enhances global semantics. Together, they guide the diffusion model toward a perceptually meaningful manifold.

Figure 5: Effectiveness of perceptual supervision in PixelGen. LPIPS and P-DINO losses are progressively added to a baseline pixel diffusion model. The LPIPS loss improves local texture fidelity, while P-DINO further enhances global semantics.

Evaluations

Quantitative Results

Qualitative Results

Figure 6: More Qualitative results of text-to-image generation at a 512x512 resolution. Our PixelGen supports multiple languages with the Qwen3 text encoder, such as Chinese and English.

Figure 7: Qualitative results of class-to-image generation at a 256x256 resolution.

BibTeX

@misc{ma2025PixelGenfrequencyPixelGenupledpixeldiffusion,
      title={PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss}, 
      author={Zehong Ma and Ruihan Xu and Shiliang Zhang},
      year={2025},
      eprint={2602.02493},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.02493}, 
    }