Skip to content

V1 Development and Experimentation

Functional Core

My initial goal was to set up the environment with a pythonic functional core. I created a GameState object represented all the data at a given point in time, as well as PlayerAction objects. A function of the game state and player actions (from all players on both teams) determined the GameState at the next timestep.

Since the code was in pure python/pydantic and functional, it was easy to test rigorously. I also created a visualizer function that converted an array of game states to an MP4 video. All of my tests wrote out videos so that I could eyeball the logic as needed.

Environment

Inspired by OpenAI's Gymnasium, I created an environment with a .step() method that followed the standard interface: it takes in a set of actions for the step, and produces a set of observations for the next timestep, as well as a reward score and basic information about the environment state.

The actions and observations are both flat numpy ndarray types. Even though I was aiming to develop an adversarial multi-agent game, I didnt have any special handling for the two teams in the interface. The actions of each team were simply concatenated together, and the observations were, or course, identical. I would have to handle all the multi-agent logic in my training script later on.

Note that, since I had defined all my game logic in the GameState functional data model, the environment itself was just a thin layer that mediated between ndarray types (that my neural networks would work with) and the native pydantic objects of the data model.

I'll talk about the reward (returned by the .step() method alongside the observation) more later on, but I started with a simple score-based reward.

Policy Network

I created the simplest possible policy network that would support the REINFORCE algorithm. This is simply a neural network that outputs a mu and sigma (each of shape output_size), which will be the parameters of a Normal distribution used for REINFOCE monte carlo sampling.

Training Loop

To begin training, I implemented a basic self-play loop using the REINFORCE algorithm. Two policy networks—one for each team—were initialized independently. On each timestep both teams produced actions from their policy networks based on the current observation. These actions were sampled from Normal distributions parameterized by the network outputs (mu, sigma), and log-probabilities (log-likelihoods in Bayesian terms) were stored for each timestep to calculate the eventual policy gradient loss.

Discounted Reward Computation

The environment produces a reward value every time step, but we dont use this directly for gradient descent.

Following the REINFORCE approach, I computed discounted rewards at the end of each episode and normalized them, then applied the REINFORCE loss to update each team’s policy network via gradient descent. Discounted rewards set the reward at any given timestep to the sum of future rewards, discounted by a decay factor, gamma (typically .99). The intuition here is that an action taken at time t might cause or contribute to a reward at time t+N, and so should be reinforced.

The rewards were handled in a simple zero-sum format: team A received the reward from the environment (after discounted summation and normalization as above), while team B received its negation.

Compute

The training loop with 5k iterations of a batch of 64 games could run in about an hour on my macbook's CPU.

Why not use GPU hardware? Well my environment relies on a functional data model in raw python for the game simulation. In other words, whether the model is in the GPU or CPU, something will happen on the CPU every timestep. Putting the model on the GPU would likely make things slower due to data transfer / synchronization time. I put off resolving this until V2.

Results

Because the episode trajectory was driven entirely by the competition between the two agents, their behaviors co-evolved over time. This created an emergent curriculum: early on, the agents moved randomly and struggled to reach the ball, but as training progressed, they began to learn to position, chase, and kick. I rendered short videos every few episodes to qualitatively inspect progress.

Conclusions and Next Steps

My agents had certainly learned something - they were able to chase the ball around and seemed to have developed basic skills for scoring and defense. However, there are several areas I would like to improve.

Policy Network Improvements

  • The model is quite small and is likely reaching some limits in its ability to learn more advanced tactics. Scaling up is an easy next step.
  • The model has a very basic architecture, and has almost no special handling for team A vs team B. Adding resonant layers and other architectural tricks will help it learn faster.

Training Algorithm Improvements

  • The simplicity of the training loop (no critic, no replay buffer, no batch rollout) helped me iterate quickly and focus on the dynamics of multi-agent self-play. But it also exposed some limitations, especially around sample efficiency and variance in learning. Later versions would introduce improvements here.

End-to-end Tensorization

  • As mentioned above, data is being transfered between torch tensors and native python data types. If I fully tensorize the environment and game physics simulators, I can run the entire training loop in torch, speeding things up dramatically. Note: This is a daunting task and would typically require some serious coffee drinking and squinting at the screen for long periods of time, but it also seems like something AI would be good at!