title | author | institute | bibliography | |||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Long-term bacterial adaptation to a constant environment as a model for genetic algorithm design |
|
|
paper.bib |
Evolutionary algorithms are a class of computational techniques for problem solving based on Darwinian evolution. These methods are centered on increasing the "fitness" of a population of candidate problem solutions through successive generations exposed to the forces of selection, mutation, and recombination until a solution is found. Using experimental evolutionary results from bacterial populations adapting to a constant environment, a novel genetic algorithm experimental model was tested, yielding an 18% reduction in the amount of computation required to converge on an optimal solution for a baseline problem. These results support the applicability of experimental evolution to evolutionary algorithm research and warrant further investigation.
Algorithm development is one of the central research areas of computer science. One frequent source of inspiration for algorithm design is biology in general and evolution in specific. Indeed, some computationally intractable problems can be approximately solved using evolutionarily inspired methods, broadly termed evolutionary algorithms.
One specific and widely used evolutionary algorithm method is genetic algorithms [@hollandAdaptationNaturalArtificial1992a]. To imitate evolution, genetic algorithms attempt to solve problems by increasing the fitness of (i.e. adapting) a population of breeding individuals, subject to the effects of mutation, recombination, and directional selection, over multiple generations. A genetic algorithm is centered on the optimization (minimization or maximization) of a real-valued fitness function
Even in its basic formulation, there are a variety of parameters left to the genetic algorithm's user, such as the appropriate fitness function, population size, genetic representation, population size, selection function, mutation function and rate, recombination function and rate, and stopping criterion. Furthermore, due to the inherent randomness of the initial population, multiple populations must often be evolved independently to yield an acceptable solution. Thus, a genetic algorithm optimization experiment often consists of multiple randomly initialized populations evolving in parallel, thus increasing the chances of finding an acceptable solution.
Unfortunately, genetic algorithms are not reflective of the biology in natura. Biological systems generally do not evolve in a controlled, unchanging environment. Furthermore, populations' genomes are not completely random, but rather highly interrelated and conserved. Indeed, this discrepancy has been one of the major areas of criticism for genetic algorithms. For example, @skienaAlgorithmDesignManual2008 described the use of genetic algorithms as "pseudobiology" and "voodoo". Additionally, the often-large number of generations required to converge on an solution means that numerous invocations of the fitness function are required. If the fitness function is costly to compute, genetic algorithms can be prohibitively slow to compute.
However, there have been experiments experiments which very closely mirror that of genetic algorithms, namely the long-term evolution experiment (LTEE) and its offshoots [@lenskiDynamicsAdaptationDiversification1994]. In the LTEE, twelve identical populations of asexual Escherichia coli have been allowed to evolve in parallel for tens of thousands of generations in a medium containing limited nutrients to study bacterial adaptation to a constant environment. Each day, 1% of the population are moved to a flask containing replenished albeit equally limited nutrients. Throughout the course of the experiment, viable samples have been frozen at 500 generation intervals, allowing biological "time travel" [@goodDynamicsMolecularEvolution2017]. One population of E. coli evolved the ability to metabolize citrate, a nutrient whose nonmetabolizability is a hallmark of the species, thereby dramatically increasing its fitness. While a full overview of the experiment is beyond the scope of this paper, one key insight from the LTEE is that identical starting populations can, without recombination, yield populations with significantly diverging fitnesses. In addition, the fitnesses of the populations have continued improving without bound for 50,000 generations [@wiserLongTermDynamicsAdaptation2013], seeming to follow a power-law relation.
The implications of this insight are twofold with respect to genetic algorithms. First, the ability to yield different results from the same starting population means that repeating the evolution process on the same population should be investigated, in contrast to the standard genetic algorithm approach of using numerous completely randomized initial populations. Second, by caching the fitness of each individual evaluated, the optimization of costly fitness functions can potentially be improved by reducing the amount duplicated computation.
As an experimental test of the applicability of the LTEE results to evolutionary computation, we will use the counting ones problem to examine the behavior of genetic algorithms. The counting ones problem is a model optimization problem in which the goal is to create a genome consisting entirely of 1s. The fitness function is formally defined as
Three experimental groups were tested, each of which contained twenty populations of one hundred individuals with one hundred genes (
To analyze the results, a one way ANOVA was carried out to determine whether the groups had the same mean fitness function invocation count and unique fitness function invocation count. Then, a Tukey HSD test [@tukeyComparingIndividualMeans1949] was carried out with a threshold of
The experiments were carried out using the pyeasyga
genetic algorithm software library version 0.3.0 (https://github.com/remiomosowon/pyeasyga), and Python 3.7.3. The data were analyzed using the pandas
data analysis library version 0.24.2 (https://github.com/pandas-dev/pandas), the scipy
scientific computing library version 1.2.1 (https://www.scipy.org), the numpy
scientific computing library version 1.16.2 (https://www.numpy.org), and the statsmodels
statistical analysis library version 0.9.0. The full experimental and analytical code is included in the appendix.
The LTEE-based models both performed comparably to the baseline in terms of mean fitness function invocation count required for convergence to the optimum while requiring significantly fewer unique fitness function invocations on the counting ones problem. Overall, the control required a mean
Group | Mean fitness function invocation count (std) | Mean unique fitness function invocation count (std) |
---|---|---|
Control, mutation rate = 0.01 | 1024180 (50925) | 9850 (244) |
mutation rate = 0.02 | 577885 (32931) | 10959 (228) |
mutation rate = 0.03 | 413755 (14768) | 11826 (211) |
mutation rate = 0.04 | 336090 (13158) | 12799 (272) |
LTEE, mutation rate = 0.01 | 994680 (49232) | 7680 (314) |
mutation rate = 0.02 | 565600 (26572) | 8855 (330) |
mutation rate = 0.03 | 407780 (23970) | 9763 (367) |
mutation rate = 0.04 | 337775 (13086) | 10722 (363) |
LTEE-subpopulations, mutation rate = 0.01 | 991435 (52354) | 8210 (269) |
mutation rate = 0.02 | 576615 (29386) | 9385 (228) |
mutation rate = 0.03 | 419345 (15192) | 10426 (215) |
mutation rate = 0.04 | 337185 (12625) | 11319 (299) |
The one-way ANOVA analysis yielded an
The results show that the LTEE-based models are able to match the baseline performance in terms of mean number of generations required to converge while requiring significantly fewer unique fitness function invocations. This is an important consideration because, as the cost of calculating the fitness function increases, the importance of reducing the number of unique fitness function invocations grows faster than the importance of the number of generations required for convergence. This is due to the fact that fetching cached genome fitnesses requires significantly fewer computational resources (
Furthermore, this experiment demonstrates the applicability of long-term bacterial evolution experimental results to that of genetic algorithm research. One area for future investigation is dynamic population sizes such that, as the fitness of the population increases, so does its population. One potential way to increase the population is to include clones of individuals in proportion to the individuals' fitnesses. In addition, the effects of random selection as compared to tournament selection would also be subject to reinvestigation, as random selection would select fit individuals at a greater rate by virtue of their overabundance in the population.
One limitation of this study is that it eliminated recombination, which is often used to speed up convergence. In the long-term bacterial evolution experiments, recombination has been shown to be able overwhelm selection [@maddamsettiAnalysisBacterialGenomes]. This does not generally appear to be the case in genetic algorithm populations, although further investigation into the effect of recombination is warranted, particularly to what types of fitness function, if any, recombination damages the performance of genetic algorithms on.
The mutation rate is of particular biological interest since, unlike the fitness function, there is a direct parallel to actual an actual, measurable biological metric. There are a variety of open research questions about genetic algorithm mutation rates, such as what the optimal mutation rate for a genetic algorithm optimization problem is [@thierensAdaptiveMutationRate2002]. Furthermore, should this rate vary throughout the course of the experiment? If so, how? The answers to these questions could potentially be elucidated by examining the LTEE. In it, some populations became hypermutators [@goodDynamicsMolecularEvolution2017]. One area for experimentation could be to use multiple (and adaptive) mutation rates in different populations when conducting a genetic algorithm optimization experiment.
As the LTEE progresses and more results are generated, more inspiration for genetic algorithm development may be created, allowing for further refinement to this valuable model of computation.
This paper introduces two key findings for evolutionary algorithm research. First, it connects the computational model of evolution used in genetic algorithms to long-term bacterial evolution experiments, which it mirrors more closely than evolution in natural environments. This development provides a clear body of experimental data from which to identify sources of inspiration that may be more readily translated into computational advances. Second, using the long-term bacterial evolution experiment model as a basis for computational experimentation, a novel method for reduction in the required number of unique fitness function invocations was identified and validated using the counting ones problem.
The project's software repository and raw data may be viewed at https://github.com/Benjamin-Lee/oeb278-paper.
Experimental code:
#!/usr/bin/env python
# coding: utf-8
# In[6]:
from pyeasyga.pyeasyga import GeneticAlgorithm
import random
genome_size = 100
population_size = 100
def create_individual(data):
return [random.randint(0, 1) for _ in range(genome_size)]
# In[66]:
def basic_population(mutation_probability):
results = []
cached = set()
def fitness (individual, data):
if tuple(individual) not in cached:
cached.add(tuple(individual))
return sum(individual)
for _ in range(20):
ga = GeneticAlgorithm(create_individual(None),
crossover_probability=0,
population_size=population_size,
mutation_probability=mutation_probability)
ga.fitness_function = fitness
ga.create_individual = create_individual
generations = 1
ga.create_first_generation()
while ga.best_individual()[0] < genome_size:
generations += 1
ga.create_next_generation()
results.append(generations)
return results, len(cached), sum(results)*population_size
# In[68]:
def ltee_population(mutation_probability):
results = []
cached = set()
def fitness (individual, data):
if tuple(individual) not in cached:
cached.add(tuple(individual))
return sum(individual)
ga = GeneticAlgorithm(create_individual(None),
crossover_probability=0,
population_size=population_size,
mutation_probability=mutation_probability)
ga.fitness_function = fitness
ga.create_individual = create_individual
ga.create_first_generation()
starting_population = ga.current_generation
for _ in range(20):
ga = GeneticAlgorithm(create_individual(None),
crossover_probability=0,
population_size=population_size,
mutation_probability=mutation_probability)
ga.fitness_function = fitness
ga.create_individual = create_individual
generations = 1
ga.current_generation = starting_population
while ga.best_individual()[0] < genome_size:
generations += 1
ga.create_next_generation()
results.append(generations)
return results, len(cached), sum(results)*population_size
# In[71]:
def ltee_subpopulations(mutation_probability):
results = []
cached = set()
def fitness (individual, data):
if tuple(individual) not in cached:
cached.add(tuple(individual))
return sum(individual)
for i in range(5):
ga = GeneticAlgorithm(create_individual(None),
crossover_probability=0,
population_size=population_size,
mutation_probability=mutation_probability)
ga.fitness_function = fitness
ga.create_individual = create_individual
ga.create_first_generation()
starting_population = ga.current_generation
for _ in range(4):
ga = GeneticAlgorithm(create_individual(None),
crossover_probability=0,
population_size=population_size,
mutation_probability=mutation_probability)
ga.fitness_function = fitness
ga.create_individual = create_individual
generations = 1
ga.current_generation = starting_population
while ga.best_individual()[0] < genome_size:
generations += 1
ga.create_next_generation()
results.append(generations)
return results, len(cached), sum(results)*population_size
# In[109]:
results = []
trials = 20
methods = [basic_population, ltee_population, ltee_subpopulations]
mutation_rates = [0.01, 0.02, 0.03, 0.04]
for mutation_rate in mutation_rates:
print(f"Mutation rate {mutation_rates.index(mutation_rate) + 1}/{len(mutation_rates)}")
for i in range(trials):
print(f"\tTrial {i+1}/{trials}")
for method in methods:
print(f"\t\tMethod {methods.index(method) + 1}/{len(methods)}")
result = method(mutation_rate)
results.append(dict(method=method.__name__,
mutation_rate=mutation_rate,
generations=result[0],
unique_individuals=result[1],
total_fitness_invocations=result[2]))
# In[110]:
import pandas as pd
# In[111]:
df = pd.DataFrame.from_dict(results)
# In[112]:
df.to_csv("results.csv")
Analysis code:
# In[140]:
df = pd.DataFrame.from_csv("results.csv")
# In[141]:
import ast
# In[142]:
df.generations = df.generations.apply(ast.literal_eval)
# In[114]:
from scipy.stats import f_oneway
# In[268]:
print(f_oneway(df.loc[df.method == "basic_population"].total_fitness_invocations,
df.loc[df.method == "ltee_population"].total_fitness_invocations,
df.loc[df.method == "ltee_subpopulations"].total_fitness_invocations))
# In[267]:
print(f_oneway(df.loc[df.method == "basic_population"].unique_individuals,
df.loc[df.method == "ltee_population"].unique_individuals,
df.loc[df.method == "ltee_subpopulations"].unique_individuals))
# In[269]:
from statsmodels.stats.multicomp import MultiComparison
print(MultiComparison(df.unique_individuals.to_numpy(), df.method.to_numpy()).tukeyhsd())