Shakespeare Text Diffusion Baseline

Status: Complete
Type: Baseline

Objective

Implement initial text diffusion in embedding space using Shakespeare corpus. Explore whether standard diffusion approaches can work for text generation through continuous embeddings.

Configuration

Model: TinyTransformer for embedding space diffusion
Training: Pure diffusion in embedding space
Architecture: Transformer encoder adapted for diffusion
Dataset: Shakespeare corpus, tokenized and embedded
Decoding: Cosine similarity between generated embeddings and token embeddings
Hardware: T4
Git Commit: 4422ce927fbf61e226157e4a3f2ac8de91b583bb

Hypothesis

Text diffusion in embedding space should be possible, though the continuous-to-discrete mapping (embeddings to tokens) may present challenges for generation quality.

Results

Quantitative

Training converges and loss decreases as expected
Model learns to denoise embeddings progressively

Qualitative

Generated text quality is poor (samples stored in samples/bad_text/)
Text lacks coherence and often produces nonsensical sequences. Seems like random characters.
Clear disconnect between continuous embedding space and discrete token outputs

Key Learnings

Embedding space diffusion is technically feasible: The mathematical framework works
Decoding is the major bottleneck: Cosine similarity approach has significant limitations
Continuous-discrete gap is challenging: Moving from smooth embeddings to sharp token decisions loses information
Need better bridging strategy: Simple nearest-neighbor decoding insufficient for quality text

Next Steps

Experiment with guided generation combining autoregressive and diffusion approaches
Investigate better decoding strategies beyond cosine similarity
Consider hybrid approaches that maintain some discrete structure
Explore different pre-trained embedding models

Sample Generation:

uv run python -m src.shakespeare --sample

Training:

uv run python -m src.shakespeare --train