V2 Development and Experimentation¶

Recall that at the end of V1, I had concluded with three areas for improvement:

Policy network improvements
Training algorithm improvements
End-to-end tensorization

End-to-end Tensorization¶

The training loop of V1 wasn't terribly slow, but I knew that scaling it up in its current form with either a more powerful policy network or training algorithm would put a significant drag on development. End-to-end tensorization, i.e., the expression of all simulation environment and training algorithms as tensors, would solve this with a major speedup. I decided to start with this with the hope of unlocking a step change in training speed.

Note

This is something that I wouldnt normally take on in a side project. Tensorization isn't rocket science, but it requires a sustained level of moderately high mental effort -- it would be like signing up for 3 hours of voluntary math homework.

This is a case where having AI code assist absolutely helped me achieve something that I wouldn't have taken on at all. All the cutting edge AI models were more than capable of converting the environment, and I was freed up to operate as a tech lead: thinking of useful utilities that would help build an effective testing harness and visualization compatibility.

In V1 I had implemented a batch training script, where multiple games were played at the same time, and update gradients were averaged across the games. Its trivial to generate a batch of predictions from a Policy Network, but the simulation environment was "vectorized" by simply running multiple times serially. In V2, we wanted our environment to manage the state of a batch of games by holding all the information in multi-dimensional arrays (in our case, torch tensors).

For example, in V1, this is how I handled kicks (the logic is that the ball is kicked by any player within a certain distance of the ball):

# Sum up all kick forces from players within kicking distance
total_kx = sum(
    a.kx for p, a in zip(self.team_a, team_a_actions)
    if self._distance_to_ball(p.position) <= MIN_KICKING_DISTANCE
) + sum(
    a.kx for p, a in zip(self.team_b, team_b_actions)
    if self._distance_to_ball(p.position) <= MIN_KICKING_DISTANCE
)

total_ky = sum(
    a.ky for p, a in zip(self.team_a, team_a_actions)
    if self._distance_to_ball(p.position) <= MIN_KICKING_DISTANCE
) + sum(
    a.ky for p, a in zip(self.team_b, team_b_actions)
    if self._distance_to_ball(p.position) <= MIN_KICKING_DISTANCE
)

In V2, this became:

# Calculate distances from players to ball
ball_expanded = self.ball_position.unsqueeze(1)  # [batch_size, 1, 2]
team1_to_ball = torch.norm(self.team1_positions - ball_expanded, dim=2)
team2_to_ball = torch.norm(self.team2_positions - ball_expanded, dim=2)

# Determine players within kicking distance
team1_can_kick = team1_to_ball < MIN_KICKING_DISTANCE
team2_can_kick = team2_to_ball < MIN_KICKING_DISTANCE

# Calculate new ball velocity based on kicks
team1_kick_mask = team1_can_kick.float().unsqueeze(-1)
team1_total_kicks = torch.sum(team1_kick_mask * team1_kick_vel, dim=1)
team1_kickers_count = torch.sum(team1_kick_mask, dim=(1, 2))

team2_kick_mask = team2_can_kick.float().unsqueeze(-1)
team2_total_kicks = torch.sum(team2_kick_mask * team2_kick_vel, dim=1)
team2_kickers_count = torch.sum(team2_kick_mask, dim=(1, 2))

# Combine team 1 and 2
total_kicks = team1_total_kicks + team2_total_kicks
total_kickers = team1_kickers_count + team2_kickers_count

# Safely average
kicking_mask = (total_kickers > 0).float().unsqueeze(-1)
safe_kickers = torch.clamp(total_kickers.unsqueeze(-1), min=1.0)
averaged_kicks = total_kicks / safe_kickers

This is the un-sexy plumbing work of machine learning, but it resulted in significantly faster operations. Unfortunately, I still ran into trouble when trying to run this on a GPU.

Results¶

Environment	Device	Batch Size	Iterations per Second	Total Games per Second
V1	CPU	1	8	8
V1	CPU	64	0.5	32
V2	CPU	64	6	384
V2	GPU	64	0.5	32
V2	GPU	512	0.5	256