April 6, 2022

Genetic Algorithms vs Reinforcement Learning

Kriti Singh

Consider a recent decision you grappled with. Perhaps you struggled with it because you didn't know where to begin. Perhaps you had a haphazard and arbitrary solution in mind but were having trouble optimizing it.

Reinforcement Learning or Genetic Algorithms might be the answer to these roadblocks for you!

Reinforcement Learning (RL) is a decision-making science. It's about figuring out how to behave optimally in a given situation to maximize reward. This ideal behavior is acquired by encounters with the environment and observations of how it reacts, similar to how toddlers explore their surroundings and learn the skills which help them realize a goal.

Genetic Algorithms is a search heuristic that starts with a random set of possible (lousy) solutions, modifies it at random, and chooses the best amongst them eventually reaching the required precision.

This blog aims to discover the difference between the two when to use which approach and the exciting domain of combining the best of both worlds.

So, what is Reinforcement Learning?

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.

Reinforcement Learning is based on the premise that an agent will learn from their environment by interacting with it and obtaining incentives for doing actions.

Value-based, policy-based, and model-based are the three approaches to solve a Reinforcement Learning problem.

Value Based

The objective of value-based RL is to maximize the value function V(s). The value function is a function that tells us the maximum expected future reward the agent will get at each state. The value of each state is the total amount of the reward an agent can expect to accumulate over the future, starting at that state. At each step, the agent will utilize the value function to determine which state to pick. The agent takes the state with the biggest value.

Policy Based

In policy-based RL, we want to directly optimize the policy function π(s) without using a value function. The policy is what defines the agent's behavior at a given time.

Model Based:

In the model-based method, a virtual framework is designed for the environment, and the agent explores that environment to learn it. Because the model representation differs depending on the environment, there is no specific solution or algorithm for this approach.

We have two types of policy:

Deterministic: a policy at a given state will always return the same action.

Stochastic: output a distribution probability over actions.

For a deeper introduction into the theory of Reinforcement Learning – make sure to read Thomas Simonini's articles on the same.

A genetic algorithm is a search heuristic centered on Charles Darwin's natural selection hypothesis. This algorithm mimics natural selection, in which the fittest individuals are chosen for reproduction in order to create the following generation's offspring.

The process of natural selection starts with the selection of fittest individuals from a population. They generate offspring which inherit the parents' traits and are passed on to the next generation. If parents have better fitness, their offspring will possibly be better than parents and have a better chance at surviving This process will continue to iterate until a generation of the fittest individuals is discovered.

A genetic algorithm takes into account five steps.

Initialization. It is necessary for evolution to be seeded with an initial solution. The process begins with a set of individuals which is called a Population. Each individual is a solution to the problem you want to solve.

Mutation. This step merges the genetic information of two parents to produce new offspring. It ensures genetic diversity in a population from generation to generation.

Evaluation (Fitness Function) The fitness function determines an individual's level of fitness (the ability of an individual to compete with other individuals). It assigns each individual a fitness score. The fitness score determines the likelihood of an individual being chosen for reproduction.

Selection. The selection phase's goal is to choose the fittest individuals and let them transmit their genes on to the next generation. Based on their fitness rankings, two pairs of individuals (parents) are chosen. Individuals with high fitness have a better probability of being chosen for reproduction.

Output. Evolution is an iterative process that may be terminated at any stage to produce a better (or unchanged) version of the preceding generation. We may choose to terminate if we hit generation limit (number of iterations) or have reached the declared threshold.

If you would like to try the power of genetic algorithms from the comfort of your browser or you're just interested to see lanky humanoids getting out of a bar at 2 am OR misshapen cars trying their best on a racing course, and would eventually like to implement them – keep reading on!

In this blog, we aim to find out which approach — GA or RL — would be better to tackle the following problem statement:

Given a canvas of any color, other than black and white, the goal is to draw two shapes, a rectangle and an ellipse with three variables — length, width and angle of rotation (theta), in such a way that the number of black pixels and the number of white pixels in the image are equal.

The constraints here are:

The shapes should be aligned at the center of the canvas.

No shape can have a dimension set as zero in the initial environment declaration.

We implement both from scratch, without using any libraries.

Differences between GA and RL:

RL only depends on the initial environment provided to it. GA is dependent completely on parents, i.e., its initial population.

When we create the offspring population by inheriting traits from the parents, there is no way of knowing whether the best traits from the parents are getting inherited. This is completely random. This in turn gives no guarantee that during mutation, the best offspring are created.

RL actually learns and converges to a solution, whereas GA is based on probabilistic selection.

These differences are better explained with an experiment.

P = [(length of rectangle, width of rectangle), (length of ellipse, width of ellipse), (rotation of rectangle, rotation of ellipse)]

To visualize better, we have included the images generated for our initial parent population.

The initial population then undergoes a fitness evaluation, where we pass the generated images into our loss function, the loss is calculated as follows:

[katex]loss = \left(1 – \frac(number\: of\: black\: pixels)(number\: of \:white\: pixels)\right) ^2[/katex]

The respective losses are then appended to a list. This loss list is sorted in ascending order and we pick those parents from our population which correspond to the least two losses.

These two parents are selected as the parents that will mutate to form the offspring population. The parents are passed to the mutate function which outputs the list of mutated children. Each individual of the offspring population is created by inheriting some traits from the parent. To ensure that we do not lose the best properties of the parent, we also append the parents to the list of mutated children. This guarantees that the loss from our fitness function does not increase.

The following mutation functions takes in three parameters, two parents (the ones with the least loss) and the number of children in the offspring population we want to generate. We opted for 100 children in our offspring population to conduct our experiments.

We can visualize that the mutations are completely random, at complete contrast to RL, which learns iteratively.

We again evaluate this new list having both parents and the mutated children by passing it through the fitness function. The respective losses are logged again and we repeat the same process listed above until we reach our threshold.

We terminate the process if we reach the threshold or reach the condition where the loss becomes constant. This happens when both parents (the ones with the least loss) become the same, i.e., they have the same shape parameters — width, height and theta. This results in no further mutation which further results in a constant loss.

The shape parameters are generated the same way as they are in GA (mentioned above).

The parameters are then used to generate a following image with shapes having the exactly same dimensions and rotations as provided by the respective parent.

Since RL can learn we chose the following hyperparameters:

Learning Rate: The learning rate is a hyperparameter that controls how much to change the model in response to the estimated loss each iteration.

Choosing too small a learning rate could result in a very long training process that could get stuck. Choosing too big a value for learning rate could result in an unstable learning rate as well as the model being unable to fit due to the rapid decay in loss, resulting in the model being unable to learn effectively.

For our model, we concluded that the following hyperparameters suited best:

Policy reached threshold 0.1 in 15 steps.

Policy reached threshold 0.01 in 19 steps.

Policy reached threshold 0.001 in 21 steps.

Policy reached threshold 0.0001 in 32 steps.

Policy reached threshold 1e-05 in 51 steps.

Policy reached threshold 1e-06 in 51 steps.

Current loss: 0.05022826366423828

Current loss: 0.008961643926504285

Current loss: 0.0004327729057949906

Current loss: 8.309693633799824e-05

Current loss: 6.323353269732713e-07

Current loss: 6.323353269732713e-07

We evaluate both of their performances on an adaptive threshold scale from 1e-01 to 1e-06.

From our experiments, we realize it is very necessary to pick a good starting point, i.e., the initial environment, because they lead to drastically different outcomes. GA is completely dependent on its parents, and can only select the best ones. There is a probability factor in GA where we have no idea if the mutations that occurred are the best ones. We can see that in the following case where the loss decreases to a whopping 1e-8 in the second iteration for Seed 1 but for Seed 2, the training stops altogether (Loss not decreasing, training stopped).

Seed 1

Seed 2

Loss Graph of Seed 1

Loss Graph of Seed 2

In our experiments we ran into the possibility of both parents becoming the same, i.e., after the mutation step, the least two losses have the same dimensions (same parents). This results in no further mutation which further results in a constant loss thus resulting in a constant graph (see fig.). To understand this better, one can hypothetically imagine the progeny of two same parents — the child will look no different to the parent. To prevent this one can introduce more rigorous mutations in each generation to thoroughly explore the suite of solutions.

To not go into an infinite loop we checked if the consequent losses in our list was equal to the previous loss and terminated if it were the case. GA could not guarantee reaching the declared threshold.

However, RL guarantees to converge to a given threshold through incrementing/decrementing the dimensions by a product of learning rate and alpha iteratively. (declared above). This particular seed converged to a 1e-08 threshold in 3053 steps. This can be reduced by tuning chosen hyper parameters.

GA is adaptive through selection and reaches a global minimum. RL, on the other hand, possibly converges to a local minimum.

GA does exploration and is not bounded by learning.

As programmers, we are used to writing precise set of instructions to address an issue. Often we come across problems that are too complex or we have no idea how to solve it. How can we write a solution to a problem if we don't know how to solve the problem in the first place? This is where we believe genetic algorithms can help. Evolution's goal is to enhance the quality of a bad answer by randomly changing it until it solves the issue with the required precision. If you don't know what to do – GA is your answer.

The Nondominated Sorting Genetic Technique II (NSGAII), is a prominent policy search algorithm. NSGAII has been proven to converge faster than other algorithms in many applications and to maintain a high degree of variety, avoiding being stuck in a local minimum. The algorithm excels at generating a wide range of results.

For our current problem statement, we just had one objective: to have equal number of black and white pixels. However, in most real world applications, there are more than one objectives, otherwise known as multi-objective problems. Inevitably, these objectives are often at conflict with each other — compromises have to be made to amongst objectives. It is unlikely to find a single, "best" solution in such cases, and policymakers have to choose the strategy that best suits the welfare of everyone involved. Multi-objective optimization algorithms can help policymakers in drawing strategies for complex real world issues and be more aware about the tradeoffs that come with different solutions.

In a perfect world, we would combine the best of reinforcement learning and the best of genetic algorithms. By ensuring there is enough diversity in the mutations produced by genetic algorithms, we can then choose the best solution and introduce it to RL's environment and further tune it. We aim to do the same in future blogs.

Reinforcement Learning or Genetic Algorithms might be the answer to these roadblocks for you!

Reinforcement Learning (RL) is a decision-making science. It's about figuring out how to behave optimally in a given situation to maximize reward. This ideal behavior is acquired by encounters with the environment and observations of how it reacts, similar to how toddlers explore their surroundings and learn the skills which help them realize a goal.

Genetic Algorithms is a search heuristic that starts with a random set of possible (lousy) solutions, modifies it at random, and chooses the best amongst them eventually reaching the required precision.

This blog aims to discover the difference between the two when to use which approach and the exciting domain of combining the best of both worlds.

Reinforcement Learning

You may have seen that computers can now learn to play ATARI games (from raw game pixels! ), that they can defeat world champions at Go, that simulated quadrupeds can run and jump, and that robots can learn to execute complicated manipulation tasks that defy explicit programming. It turns out that all of these advancements fit within the tent of Reinforcement Learning research. So, what is Reinforcement Learning?

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.

Reinforcement Learning is based on the premise that an agent will learn from their environment by interacting with it and obtaining incentives for doing actions.

Value-based, policy-based, and model-based are the three approaches to solve a Reinforcement Learning problem.

Value Based

The objective of value-based RL is to maximize the value function V(s). The value function is a function that tells us the maximum expected future reward the agent will get at each state. The value of each state is the total amount of the reward an agent can expect to accumulate over the future, starting at that state. At each step, the agent will utilize the value function to determine which state to pick. The agent takes the state with the biggest value.

Policy Based

In policy-based RL, we want to directly optimize the policy function π(s) without using a value function. The policy is what defines the agent's behavior at a given time.

Model Based:

In the model-based method, a virtual framework is designed for the environment, and the agent explores that environment to learn it. Because the model representation differs depending on the environment, there is no specific solution or algorithm for this approach.

We have two types of policy:

Deterministic: a policy at a given state will always return the same action.

Stochastic: output a distribution probability over actions.

For a deeper introduction into the theory of Reinforcement Learning – make sure to read Thomas Simonini's articles on the same.

Genetic Algorithm

You've probably heard of Darwin's "survival of the fittest" theory. The underlying concept is straightforward: survival is a difficult undertaking, and those that succeed have a better chance of reproducing. The environment automatically picks the animals that best meet its needs, eliminating those who do not. A genetic algorithm is a search heuristic centered on Charles Darwin's natural selection hypothesis. This algorithm mimics natural selection, in which the fittest individuals are chosen for reproduction in order to create the following generation's offspring.

The process of natural selection starts with the selection of fittest individuals from a population. They generate offspring which inherit the parents' traits and are passed on to the next generation. If parents have better fitness, their offspring will possibly be better than parents and have a better chance at surviving This process will continue to iterate until a generation of the fittest individuals is discovered.

A genetic algorithm takes into account five steps.

Initialization. It is necessary for evolution to be seeded with an initial solution. The process begins with a set of individuals which is called a Population. Each individual is a solution to the problem you want to solve.

Mutation. This step merges the genetic information of two parents to produce new offspring. It ensures genetic diversity in a population from generation to generation.

Evaluation (Fitness Function) The fitness function determines an individual's level of fitness (the ability of an individual to compete with other individuals). It assigns each individual a fitness score. The fitness score determines the likelihood of an individual being chosen for reproduction.

Selection. The selection phase's goal is to choose the fittest individuals and let them transmit their genes on to the next generation. Based on their fitness rankings, two pairs of individuals (parents) are chosen. Individuals with high fitness have a better probability of being chosen for reproduction.

Output. Evolution is an iterative process that may be terminated at any stage to produce a better (or unchanged) version of the preceding generation. We may choose to terminate if we hit generation limit (number of iterations) or have reached the declared threshold.

If you would like to try the power of genetic algorithms from the comfort of your browser or you're just interested to see lanky humanoids getting out of a bar at 2 am OR misshapen cars trying their best on a racing course, and would eventually like to implement them – keep reading on!

In this blog, we aim to find out which approach — GA or RL — would be better to tackle the following problem statement:

Given a canvas of any color, other than black and white, the goal is to draw two shapes, a rectangle and an ellipse with three variables — length, width and angle of rotation (theta), in such a way that the number of black pixels and the number of white pixels in the image are equal.

The constraints here are:

The shapes should be aligned at the center of the canvas.

No shape can have a dimension set as zero in the initial environment declaration.

We implement both from scratch, without using any libraries.

Differences between GA and RL:

RL only depends on the initial environment provided to it. GA is dependent completely on parents, i.e., its initial population.

When we create the offspring population by inheriting traits from the parents, there is no way of knowing whether the best traits from the parents are getting inherited. This is completely random. This in turn gives no guarantee that during mutation, the best offspring are created.

RL actually learns and converges to a solution, whereas GA is based on probabilistic selection.

These differences are better explained with an experiment.

GA Implementation

We initialize the experiment with declaration of an initial set of shape parameters, referred to as initial population. These initial sets of shape parameters are randomly generated using Numpy's random integer generator. Each parent declared has the following set of parameters:P = [(length of rectangle, width of rectangle), (length of ellipse, width of ellipse), (rotation of rectangle, rotation of ellipse)]

`def get_parents(ip_seed): `

np.random.seed(ip_seed)

rect_shape = list(np.random.randint(10, 900, size =2))

ellipse_shape = list(np.random.randint(10, 900, size =2))

theta = list(np.random.randint(0,361, size =2))

return rect_shape, ellipse_shape, theta

These parameters are then used to generate a respective image with shapes having the exactly same dimensions and rotations as provided by the respective parent. These sets of images constitute the initial population and their respective parents are the properties of the initial population.To visualize better, we have included the images generated for our initial parent population.

Initial parent population

[katex]loss = \left(1 – \frac(number\: of\: black\: pixels)(number\: of \:white\: pixels)\right) ^2[/katex]

The respective losses are then appended to a list. This loss list is sorted in ascending order and we pick those parents from our population which correspond to the least two losses.

These two parents are selected as the parents that will mutate to form the offspring population. The parents are passed to the mutate function which outputs the list of mutated children. Each individual of the offspring population is created by inheriting some traits from the parent. To ensure that we do not lose the best properties of the parent, we also append the parents to the list of mutated children. This guarantees that the loss from our fitness function does not increase.

The following mutation functions takes in three parameters, two parents (the ones with the least loss) and the number of children in the offspring population we want to generate. We opted for 100 children in our offspring population to conduct our experiments.

We can visualize that the mutations are completely random, at complete contrast to RL, which learns iteratively.

`def mutate_parents(parent_1, parent_2, number_of_children): `

list_of_mutated_children = [parent_1, parent_2]

for i in range(0,number_of_children):

np.random.seed()

binary_mask_1 = np.random.randint
(0,2,size=np.array(parent_1).shape)

binary_mask_2 = np.logical_not(binary_mask_1)

new_child = np.add(np.multiply
(binary_mask_1,parent_1),np.multiply
(binary_mask_2,parent_2))

new_child = new_child.tolist()

list_of_mutated_children.
append(new_child)

return list_of_mutated_children

Mutations to create offspring population

We again evaluate this new list having both parents and the mutated children by passing it through the fitness function. The respective losses are logged again and we repeat the same process listed above until we reach our threshold.

We terminate the process if we reach the threshold or reach the condition where the loss becomes constant. This happens when both parents (the ones with the least loss) become the same, i.e., they have the same shape parameters — width, height and theta. This results in no further mutation which further results in a constant loss.

RL Implementation

GA needs an initial population to begin, RL requires an initial environment. We provide GA with an initial population of ten parents. This initial population is completely random. To compare the two approaches on the same subset, we run the GA experiment and get the best parent out of the initial population of ten. We then pass this best parent as the seed in RL to optimize on, thus giving RL the best initial environment. In this way we ensure that the comparisons drawn are fair.The shape parameters are generated the same way as they are in GA (mentioned above).

The parameters are then used to generate a following image with shapes having the exactly same dimensions and rotations as provided by the respective parent.

Since RL can learn we chose the following hyperparameters:

Learning Rate: The learning rate is a hyperparameter that controls how much to change the model in response to the estimated loss each iteration.

Choosing too small a learning rate could result in a very long training process that could get stuck. Choosing too big a value for learning rate could result in an unstable learning rate as well as the model being unable to fit due to the rapid decay in loss, resulting in the model being unable to learn effectively.

For our model, we concluded that the following hyperparameters suited best:

`learning_rate = 0.01 `

alpha = 100

We pass the dimensions through an update function that increments/decrements the dimensions by a product of learning rate and alpha basis random probability`def update(loss, white, black, img, rect_shape, ellipse_shape, theta): `

prob = np.random.randint(0,2)

r_theta = theta[0] + 5

e_theta = theta[1] + 5

img_e = img_with_theta(rect_shape, ellipse_shape,[theta[0], e_theta])

img_r = img_with_theta(rect_shape, ellipse_shape, [r_theta,theta[1]])

loss_r, white_r, black_r = cal_loss(img_r)

loss_e, white_e, black_e = cal_loss(img_e)

if min(loss, loss_r, loss_e) == loss_e or min(loss, loss_r, loss_e) == loss_r:

if min(loss_r, loss_e) == loss_e:

if e_theta>=360:

e_theta = e_theta - 360

return loss_e, white_e, black_e, img_e, rect_shape, ellipse_shape, [theta[0],e_theta]

else:

if r_theta>=360:

r_theta = r_theta - 360

return loss_r, white_r, black_r, img_r, rect_shape, ellipse_shape, [r_theta,theta[1]]

if white > black and prob == 0:

rect_shape[0] = rect_shape[0] + (learning_rate * alpha)

rect_shape[1] = rect_shape[1] + (learning_rate * alpha)

elif white > black and prob == 1:

ellipse_shape[0] = ellipse_shape[0] - (learning_rate * alpha)

ellipse_shape[1] = ellipse_shape[1] - (learning_rate * alpha)

elif black > white and prob == 0:

rect_shape[0] = rect_shape[0] - (learning_rate * alpha)

rect_shape[1] = rect_shape[1] - (learning_rate * alpha)

elif black > white and prob == 1:

ellipse_shape[0] = ellipse_shape[0] + (learning_rate * alpha)

ellipse_shape[1] = ellipse_shape[1] + (learning_rate * alpha)

img = img_with_theta(rect_shape, ellipse_shape, theta)

loss, white, black = cal_loss(img)

return loss, white, black, img, rect_shape, ellipse_shape, theta

Policy reached threshold 0.1 in 15 steps.

Policy reached threshold 0.01 in 19 steps.

Policy reached threshold 0.001 in 21 steps.

Policy reached threshold 0.0001 in 32 steps.

Policy reached threshold 1e-05 in 51 steps.

Policy reached threshold 1e-06 in 51 steps.

Current loss: 0.05022826366423828

Current loss: 0.008961643926504285

Current loss: 0.0004327729057949906

Current loss: 8.309693633799824e-05

Current loss: 6.323353269732713e-07

Current loss: 6.323353269732713e-07

Experiments

We initialize GA with np.random.seed to get consistent results. GA has the advantage of choosing the best parent out of ten options. To keep things fair, we gave the best parent seed out of 10 from GA as the seed for RL to optimize on. We evaluate both of their performances on an adaptive threshold scale from 1e-01 to 1e-06.

GA vs RL Conclusions:

We notice that GA can effectively drop to a very low threshold (even if a higher threshold is declared). This can be observed in the case of Seed 1. This is in sharp contrast to RL which would iteratively reach that threshold through learning and take time.From our experiments, we realize it is very necessary to pick a good starting point, i.e., the initial environment, because they lead to drastically different outcomes. GA is completely dependent on its parents, and can only select the best ones. There is a probability factor in GA where we have no idea if the mutations that occurred are the best ones. We can see that in the following case where the loss decreases to a whopping 1e-8 in the second iteration for Seed 1 but for Seed 2, the training stops altogether (Loss not decreasing, training stopped).

Seed 1

Seed 2

Loss Graph of Seed 1

Loss Graph of Seed 2

In our experiments we ran into the possibility of both parents becoming the same, i.e., after the mutation step, the least two losses have the same dimensions (same parents). This results in no further mutation which further results in a constant loss thus resulting in a constant graph (see fig.). To understand this better, one can hypothetically imagine the progeny of two same parents — the child will look no different to the parent. To prevent this one can introduce more rigorous mutations in each generation to thoroughly explore the suite of solutions.

To not go into an infinite loop we checked if the consequent losses in our list was equal to the previous loss and terminated if it were the case. GA could not guarantee reaching the declared threshold.

However, RL guarantees to converge to a given threshold through incrementing/decrementing the dimensions by a product of learning rate and alpha iteratively. (declared above). This particular seed converged to a 1e-08 threshold in 3053 steps. This can be reduced by tuning chosen hyper parameters.

GA does exploration and is not bounded by learning.

As programmers, we are used to writing precise set of instructions to address an issue. Often we come across problems that are too complex or we have no idea how to solve it. How can we write a solution to a problem if we don't know how to solve the problem in the first place? This is where we believe genetic algorithms can help. Evolution's goal is to enhance the quality of a bad answer by randomly changing it until it solves the issue with the required precision. If you don't know what to do – GA is your answer.

Improvements

Since we developed a genetic algorithm from scratch, it often converged to a local minima. However, by introducing various types of mutations, currently outside the scope of this blog, one can ensure preserving a good level of diversity, and rigorously explore the whole domain of possible solutions. The Nondominated Sorting Genetic Technique II (NSGAII), is a prominent policy search algorithm. NSGAII has been proven to converge faster than other algorithms in many applications and to maintain a high degree of variety, avoiding being stuck in a local minimum. The algorithm excels at generating a wide range of results.

For our current problem statement, we just had one objective: to have equal number of black and white pixels. However, in most real world applications, there are more than one objectives, otherwise known as multi-objective problems. Inevitably, these objectives are often at conflict with each other — compromises have to be made to amongst objectives. It is unlikely to find a single, "best" solution in such cases, and policymakers have to choose the strategy that best suits the welfare of everyone involved. Multi-objective optimization algorithms can help policymakers in drawing strategies for complex real world issues and be more aware about the tradeoffs that come with different solutions.

In a perfect world, we would combine the best of reinforcement learning and the best of genetic algorithms. By ensuring there is enough diversity in the mutations produced by genetic algorithms, we can then choose the best solution and introduce it to RL's environment and further tune it. We aim to do the same in future blogs.

Share

Tags