Einsum Optimization and Deployment¶

Returning to this project after three months with the help of claude code. My goal is to demonstrate true cooperative multi-agent behavior when trained using adversarial self-play.

But before this, I wanted to make sure that I was really getting great performance during training. Recall that back in April, I had tensorized the game environment so that the training loop could run without passing data between CPU and GPU. I was able to simulate approximately 256 games per second (batch size of 512 running every 2 seconds).

I knew I still had plenty of room to optimize, and the easiest way to do that is to look for places to use Einsum notation, which is often a more optimized way of expressing operations that I had previously expressed with clumsy chains of sums, products and reductions.

Einsum notation was invented by Einstein more than a hundred years ago to help him represent the higher order matrix operations that he was using in his calculations. Fast forward to the present day and it can help us reduce intermediate tensor memory allocations, and leverage highly optimized einsum kernels to squeeze the last bit of performance from the compute graph optimizer.

I also wanted to streamline my deployment process so that I could run experiments without having it take up too much mental bandwidth. Inspired by my workflow in TinyDiffusionModels, I created a lightweight deployment harness with special instructions so that Claude Code could operate it at my behest. I've been finding this much more engoyable to operate than my old ML Ops tangles of second order configuratios.

Results were very positive, showing a 2x speedup in batch iteration speed and very quick convergence.