Diffusion-LM vs Current Implementation Analysis

Status: Complete
Type: Research

Objective

Comprehensive comparison between our current text diffusion implementation and the Diffusion-LM paper approach to identify potential improvements and architectural differences that could enhance text generation quality.

Background

Following poor text generation quality in our Shakespeare baseline experiments, this research investigates how our approach differs from established methods in the literature, specifically focusing on the Diffusion-LM paper by Li et al. ("Diffusion-LM Improves Controllable Text Generation").

Key Findings

Architectural Differences Identified

1. Token Decoding Strategy

Current Implementation: Simple cosine similarity + argmax for embedding-to-token conversion
Diffusion-LM: Learned softmax rounding function trained end-to-end
Implication: Our decoding bottleneck may be addressable through learned mappings

2. Embedding Space / Training Targets

Current Implementation: Pre-trained embeddings (Gemma-2b-it) as diffusion target
Diffusion-LM: Custom embedding space learned jointly with diffusion process
Implication: Trade-off between leveraging pre-trained knowledge vs. task-specific optimization

Critical Insights

Decoding as Primary Bottleneck

Our hypothesis that embedding-to-token decoding is the main quality bottleneck aligns with Diffusion-LM's emphasis on learned rounding functions. The paper's approach suggests that: - Simple nearest-neighbor decoding loses semantic information - Learned mappings can preserve diffusion process benefits through to final tokens - End-to-end training of decoding improves coherence

Embedding Space Considerations

Advantage of Pre-trained Embeddings: Rich semantic representations, faster convergence
Advantage of Custom Space: Optimized for diffusion process, potentially better quality
Research Question: Can we get best of both worlds through fine-tuning approaches?

Potential Improvements for Our Implementation

High-Priority Enhancements

Learned Rounding Function: Replace cosine similarity with trainable softmax mapping
Custom Embedding Space

Other Enhancements

Fluency Regularization: Add explicit regularization terms for linguistic coherence
Gradient-based Control: Implement controllable generation during diffusion

Implementation Complexity Analysis

Learned Rounding: Medium complexity, high potential impact
Gradient Control: High complexity, medium potential impact
Custom Embedding Space: High complexity, uncertain impact given our pre-trained approach

Research Questions Raised

Pre-trained vs Custom Embeddings: Should we abandon Gemma embeddings for task-specific space?

Next Steps

Begin Phase 1 experiments with learned rounding function implementation
Establish better evaluation metrics for text diffusion quality
Create systematic comparison framework for different approaches

Related GitHub Issue: #12 - Comparison with Diffusion-LM paper approach