import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
Welcome to Dynosaur, a machine learning project I started to learn all about reinforcement learning and beat the Google Dinosaur Jumper Game. To get started, head on over to http://dynosaur.billyz.me
The Google Dinosaur Game is found on Google Chrome's default offline page. The game has a simple objective: maximize your score by avoiding oncoming obstacles. The longer you survive, the higher the score becomes.
The dinosaur has three actions: jump (up arrow or space), duck (down arrow), or neither. When the dinosaur is ducking, its collision box becomes shorter than when it is standing up. When the dinosaur jumps, its collision box maintains the same dimensions while rising upwards using standard projectile physics.
There are two obstacles to avoid in the game: cacti and pterodactyls. Cacti have varying height but are always located on the ground. They can be clustered or singular, which alters the width of their collision boxes. In order to continue, dinosaurs must jump over the cacti. Pterodactyls are always singular and all have the same dimensions. However, pterodactyls can hover at three different heights. At the lowest height, the dinosaur must jump over the pterodactyls to continue. At the second lowest height, the dinosaur can jump over the pterodactyl or duck under in order to continue. At the highest height, the dinosaur can do all three actions (if timed correctly) in order to continue.
In order to make this game accessible to the bot, an open source version of the game, with identical physics, was utilized. The game file was modified to allow the bot to access data to use as inputs. These inputs are
In order to give the dinosaur some way to make decisions on whether to jump or duck, each dinosaur will have a neural network “brain”. Neural Networks consist of artificial neurons that are linked together to model biological neurons.
An artificial neuron applies activation function $\varphi(\nu)$ to Euclidean inner product of inputs and weights $\nu = \sum x_i \omega_i$ to produce output $y$, which is fed into the next neurons. In addition, the inner product may also be shifted by the bias of a node $b$. Thus $y=\varphi(\sum x_i \omega_i-b)$. Artificial neurons model biological neurons, which sum up all stimuli from dendrites and synapses across axon when above a certain threshold.
Activation functions are typically Sigmoid functions to round out extreme inputs and center them around zero.
A network of neurons allows one to create an overall system, in which each neuron’s output feeds into another’s input. Networks consists of input, hidden, and output neurons. Input neurons sense stimuli. Hidden neurons allow for internal abstraction of senses. Output neurons respond with interpretation of stimuli.
The dinosaurs will take the numerical values of the inputs measured above, propagate them through the networks and use the output values to decide whether or not to jump or duck. Between the two output neurons the actions will be decided using the following pseudocode:
if(output2 > outputThreshold2):
duck
else if(output1 > outputThreshold1):
jump
else:
idle
A dinosaur will perform and score based off of what neural network it has. Thus, we can consider a dinosaur's score a function of it's neural network. But because a network is merely an $n$-dimensional vector $v_n$ of weights and biases, the score is actually a function of an $n$-dimensional vector $f(v_n)$. Thus the problem to find an optimal scoring dinosaur is actually a multidimensional optimization problem to find the $n$-dimensional vector (set of weights and biases) that yield the global maximum.
The first method used to mutate the networks is known as neuroevolution. Neuroevolution serves as a heuristic for exploring the multidimensional search space utilizing genetic operators observed in biological evolution.
At its very base, neuroevolution modifies the genomes of a population. After the genomes are altered, the genomes are expressed into neural networks. In this project, the genome of the neuroevolution population is just the neural network. The neural network contains weights and biases all between $-1$ and $1$
In neuroevolution, the shape of the network is fixed--no edges or nodes will be added or removed. In this project, the shape of the neural network is that of a feed-forward network. However, the edges and nodes undergo weight and bias mutations, altering the values of the weights and biases used to calculate outputs.
After a dinosaur runs, it is assigned a fitness score using a fitness function. For standard neuroevolution, the fitness score is determined by $\psi(s) = s^2$ where $s$ is the dinosaur's score. This allows the fitness scale quadratically with score, weighting differences in lower scores less than differences in higher scores.
fig = plt.figure()
axes = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # left, bottom, width, height (range 0 to 1)
x = np.linspace(0, 5)
y = x ** 2
axes.plot(x, y, 'r')
axes.set_xlabel('s')
axes.set_ylabel(r'$\psi$')
axes.set_title(r'$\psi(s) = s^2$')
plt.show()
Selection as a genetic operator is meant to organize the dinosaurs in some way so that they are crossed to encourage certain traits. For example, a random selection would arbitrarily select two dinosaurs to crossover, regardless of their fitness. Of course, this wouldn't be very useful, so a selection that works based on fitness scores was used. During this stage in the project, there weren't very many successful simulations, so a variety of selections were investigated, such as elitism (simply selecting the top $n$ dinosaurs by fitness). However, use of elitism was discouraged because it reduced the amount of exploration neuroevolution did, and quickly located local maxima (continuously jumping), but not the global one. Ultimately, a simple fitness proportionate selection was utilized in order to weight the dinosaurs by their fitness and encourage higher fitness dinosaurs to reproduce with other higher fitness dinosaurs, without reducing the population's diversity too quickly. Roulette wheel selection (the method used) is a specific type of fitness proportionate selection. In it, each dinosaur is given the probability of being selected based on the percentage its fitness constituted the total fitness of the population.
The pseudocode for Roulette Wheel Selection is as follows
Compute total fitness F
Generate random number R between 0 (inclusive) and F (not inclusive)
Compute cumulative fitness
Determine the index I of the cumulative fitness distribution which R is greater but less than the next index
Select I
After the dinosaurs are selected in some order, they are then crossed. Crossover in neuroevolution is quite simple; because each dinosaur has the exact same shape for their neural network as the others, corresponding nodes and edges would be switched depending on a set crossover rate
The pseudocode for crossing is as follows:
for all corresponding nodes and edges:
if(random < crossoverRate):
swap corresponding nodes/edges
Because the function of the game is not necessarily multiplicative in respect to the genomes, it does not necessarily follow that two high fitness dinosaurs will yield a higher or equally as high fitness dinosaur. Likewise for low scoring dinosaurs. There isn't necessarily a correlation between the performance of parent dinosaurs and the offspring.
In the neuroevolution portion of this project, only numerical adjustment mutations of the weights and biases were explored. Mutations occur at a random set rate. At first, the numerical adjustments shifted the weights and biases a fixed value in either direction, however it was noted that this mutated weights and biases too much, often quickly reaching the boundary of possible values. In order to solve this problem, a much more natural version of the numerical adjustment mutation was derived: the magnitude of adjustment was made by sampling a Gaussian Distribution $\mathcal{N}(0,1)$. Gaussian distribution are theorized to represent normal distributions, with high frequency towards more centered values and less chance of outer values being selected.
An issue that is quickly realized is that neuroevolution, although favorable for having fixed, manageable neural network shapes, has an enormous search space. Assuming that each node and bias have values to the nearest hundredth, each numerical value in the neural network can then occupy $200$ possible values. Thus, for a neural network vector of length $n$, there are $200^n$ possible combinations of weights and biases that the neural network could occupy, which is exponential (and very, very bad).
fig = plt.figure()
axes = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # left, bottom, width, height (range 0 to 1)
x = np.linspace(0, 5)
y = 200 ** x
axes.plot(x, y, 'r')
axes.set_xlabel('Dimension')
axes.set_ylabel('Search Space')
plt.show()
The second method tried was based off a paper by Kenneth Stanley. Neuroevolution of Augmenting Topologies (NEAT) was a method commended for having much more success than the standard neuroevolution. Unlike in neuroevolution where the shape of the neural networks were fixed, the topologies of networks in NEAT are openly available to genetic modification. This is what makes it beneficial. Because each network starts initially as an empty network, and grows in degree from there, NEAT is able to explore the search space while minimizing dimensionality (see above)
Sequential NEAT borrows from the algorithms already created in Neuroevolution, however with one major difference: the code is now organized in an object-orientated manner. This makes it much easier to manage individuals within the population.
Because the topology of the networks are now alterable, the genome of the dinosaur can no longer be just the neural network. In order to appropriately manage neurons when crossing and mutating dinosaurs, JSON objects were used to nodes and edges. Node biases were simply indexed by their node number, while edges were indexed by their innovation number (will be explained later). Edges also contain a new attribute: whether or not they are enabled (also will be explained later). Then later, when the dinosaurs are simulated, a neural network is expressed from the genome.
Because the edges can now be from any node to any node (including itself), NEAT dinosaurs have access to networks known as recurrent neural networks. RNNs allow for more specified behavior, as networks now have access to its own values, creating a sense of short term memory.
Now that there's an element of size of a network, the fitness function must be tweaked slightly. As mentioned before, the thing that makes NEAT more favorable than neuroevolution is its ability to work from the bottom up in respect to dimensionality. In order to encourage this, the fitness function will be modified to take into account the size of the graph. It is now given by $\psi(s,N) = \frac{s^2}{\sqrt{N}}$ where $s$ is the score and $N$ is the number of edges in the graph. This encourages smaller networks while the square root penalizes differences in smaller degree networks more than differences in larger degree networks.
fig = plt.figure()
axes = fig.gca(projection='3d')
x = np.linspace(1, 5)
y = np.linspace(1, 5)
x, y = np.meshgrid(x, y)
z = x ** 2 / 1.0 * np.sqrt(y)
surf = axes.plot_surface(x, y, z)
axes.set_xlabel(r'$s$')
axes.set_ylabel(r'$N$')
axes.set_zlabel(r'$\psi$')
axes.set_title(r'$\psi(s,n) = \frac{s^2}{\sqrt{N}}$')
plt.show()
Now that the genome is different, a new crossover algorithm must be devised. The one used in this project is the one proposed by Stanley in his paper. It follows this diagram:
First, an issue that is run into is the naming conventions of edges. In order to solve this problem, a global edge labeling system is proposed by Stanley, known as innovation number. Whenever a novel edge is discovered (between a combination of nodes not recorded before), the total innovations is incremented and the newest edge is given the highest innovation number. These numbers will be used to easily compare the topology of one dinosaur to another. The proposed crossover algorithm lines up the two parents' genomes so they match in innovation number and are in the respective order. This process is known as synapsis, an event that occurs in biological crossover as well, when comparing homologous chromosomes. After the genes are lined up, the offspring inherits genes from the parents according to the following pseudocode
Without Loss of Generality, Assume Parent 1 has higher or equal fitness to Parent 2
If both parents have Gene x (mutual gene):
Inherit Gene x from a parent randomly (according to crossover rate)
If only one parent has Gene x (excess at the end, disjoint otherwise):
If Parents have equal fitness:
Inherit the gene/lack of gene from a parent randomly (according to crossover rate)
else:
Inherit the gene/lack of gene from Parent 1
Now that there are variable shape networks, there are new structural mutations, as well as disabling/enabling mutations. Each mutation's probability of occurring can be adjusted in the configuration files. The new mutations are as follows
Add Node Mutation
Disable Mutation
Unfortunately, the reduction of dimensionality was still not enough to get reasonable results. While better, running NEAT sequentially still requires a lot of time to simulate each dinosaur individually.
This is where the object-oriented data comes in handy. From sequential NEAT to parallel neat, nothing is really different. The only real difference is there are no longer any UI elements because all of the processing power is used to simulate as many dinosaurs at the same time as possible. All the dinosaurs begin at the same time, and they run until they lose, at which point they wait for all of the other dinosaurs to fail. After all the dinosaurs have failed, the genetic operations are carried out and the cycle is repeated. This is good because now the runtime of the program is independent of the size of the population, meaning that much more dinosaurs can be simulated in a much shorter time.
Because of it's parallel set-up, parallel NEAT was able to produce over-1000 scoring dinosaurs after 3-4 hours on average.
The most obvious issue with parallel NEAT is that a lot of runtime is wasted waiting for the highest-scoring dinosaurs to finish.
Now this is where things get good. In order to remedy the problem with parallel NEAT, I designed a variation of parallel NEAT where a new offspring automatically populates the game of a dinosaur that failed. Now there is no time wasted waiting for other dinosaurs to finish.
Under the assumption that when a dinosaur failed, all the other dinosaurs were still running, we run into a major problem: the selection algorithm used to pick dinosaurs to crossover no longer works because it depends on fitnesses and thus scores of the other dinosaurs. However, because all the other dinosaurs had begun at different times, it was impossible to tell which dinosaurs actually had a bad score and which had just started and so on.
The solution to the aforementioned problem that I came up with was to establish a rank priority queue. This FIFO queue would keep track of the dinosaurs in respect to how long they lived. When a dinosaur began running, it would push itself into the queue and when it stopped running, it would delete itself from the queue. In order to find the longest living dinosaur, some method of traversal from the head of the queue towards the tail. So preferably the queue operations would look something like this
Queue Operation | Time Complexity |
---|---|
Push | $O(1)$ |
Delete | $O(1)$ |
Traverse | $O(n)$ |
In order to accomplish this (and because javascript doesn't allow for addressing), I used three arrays to create a pseudo-pointer system. Two of the arrays were used to maintain a doubly-linked list, while one array was used to keep track of the address in the array of each dinosaur.
When a dinosaur fails, the pointer will begin at the head of the priority queue, and will traverse back and select a dinosaur according to a geometric distribution of probability. Thus, longer living dinosaurs are more likely to be selected, however shorter living ones are still allowed to reproduce.
Because continuous NEAT allows each simulation to progress independently of the others, this allows for a variety of evolutionary times. This works to our favor because, the shorter a dinosaur lasts, the more evolutionary cycles it will undergo in a set amount of time. Likewise, the longer a dinosaur lives, the less amount of evolutionary cycles it will undergo. This works almost like a dynamically adjusting learning rate, which is nice.
Continuous NEAT has a high score of 7900 and usually reaches an average score of 1000-2000 within an hour.
Here are some things I plan on working on
Backpropagation is a version of Dynosaur that is supervised, unlike the rest of Dynosaur. It utilizes Long Short Term Memory Architecture to learn how to respond to inputs using the user/players actions as labeled data
Sequential Q is an attempt to utilize neural network approximated Q-learning to learn how to play the game. Currently, I am using the Euclidean Distance from the dinosaur to the first obstacle as a measure of reward. This seems like it should work intuitively because we would want the dinosaur to be incentivized to stay away from the obstacle ahead. The network used to decide whether or not to act is a perceptron, but currently, the dinosaur's movement is very finnicky
Parallel Q is just Sequential Q's that all train a single neural network.
Cloud Dynosaur is a version of dynosaur that is primarily cloud-based. A backend server maintains a global population of dinosaurs. Meanwhile, users can come to the site and volunteer their hardware to train a sample of the existing dinosaurs and/or train new dinosaurs. After the dinosaurs are trained, they will be reuploaded back to the cloud.
Just as a side note, Parallel NEAT, Continuous NEAT, Sequential Q, and Parallel Q all don't have user interfaces (for the aforementioned reasons). Here's how to use them if you're curious. Just open up the console and type the following commands.
//n is how many dinosaurs you want to evolve. Be careful!
neat.init(n);
//Press Spacebar and Immediately Click Away
neat.startEvolution();
//n is how many dinosaurs you want to evolve. Be careful!
neat.init(n);
neat.startEvolution();
//Press Spacebar on Screen
//Press Spacebar on Screen
//n is how many dinosaurs you want to evolve. Be careful!
init(n);
//Press Spacebar on Screen