HELLO WORLD!

Guess my number

Let’s begin by learning a little bit about genetic algorithms. Reach way back in your memories to a game we played as kids. It is a simple game for two people where one picks a secret number between 1 and 10 and the other has to guess that number.

Is it 2?  No
Is it 3?  No
Is it 7?  No
Is it 1?  Yes

That works reasonably well for 1..10 but quickly becomes frustrating or boring as we increase the range to 1..100 or 1..1000. Why? Because we have no way to improve our guesses. There’s no challenge. The guess is either right or wrong, so it quickly becomes a mechanical process.

Is it 1?  No
Is it 2?  No
Is it 3?  No
Is it 4?  No
Is it 5?  No
...

So, to make it more interesting, instead of no let’s say higher or lower.

1?  Higher
7?  Lower
6?  Lower
5?  Lower
4?  Correct

That might be reasonably interesting for a while for 1..10 but soon you’ll increase the range to 1..100. Because people are competitive, the next revision is to see who is a better guesser by trying to find the number in the fewest guesses. At this point the person who evolves the most efficient guessing strategy wins.

However, one thing we automatically do when playing the game is make use of domain knowledge. For example, after this sequence:

1?  Higher
7?  Lower

Why wouldn’t we guess 8, 9, or 10? The reason is, of course, because we know that those numbers are not 'lower' than 7. Why wouldn’t we guess 1? Because we already tried it. We use our memory of what we’ve tried, our successes and failures, and our knowledge of the domain, number relationships, to improve our guesses.

A genetic algorithm does not know what lower means. It has no intelligence. It does not learn. It will make the same mistakes every time. It will only be as good at solving a problem as the person who writes the code. And yet, it can be used to find solutions to problems that humans would struggle to solve or could not solve at all. How is that possible?

Genetic algorithms use random exploration of the problem space combined with evolutionary processes like mutation and crossover (exchange of genetic information) to improve guesses. But also, because they have no experience in the problem domain, they try things a human would never think to try. Thus, a person using a genetic algorithm may learn more about the problem space and potential solutions. This gives them the ability to make improvements to the algorithm, in a virtuous cycle.

What can we learn from this?

Guess the Password

Now let’s see how this applies to guessing a password. Start with a randomly generated initial sequence of letters, then mutate/change one random letter at a time until the sequence of letters is "Hello World!". Conceptually:

pseudo code
_letters = [a..zA..Z !]
target = "Hello World!"
guess = get 12 random letters from _letters
while guess != target:
	index = get random value from [0..length of target]
	guess[index] = get 1 random value from _letters

If you try this in your favorite programming language you’ll find that it performs worse than playing the number guessing game with only yes and no answers because it cannot tell when one guess is better than another.

One solution is to help it make an informed guess by telling it how many of the letters from the guess are in the correct locations. For example "World!Hello?" would get 2 because only the 4th letter of each word is correct. The 2 indicates how close the answer is to correct. This is called the fitness value. "hello world?" would get a fitness value of 9 because 9 letters are correct. Only the h, w, and question mark are wrong.

First Program

It is time to start writing Python. By the way, if you do not already have a favorite Python development environment, I highly recommend JetBrains' PyCharm IDE.

Genes

To begin with, the genetic algorithm needs a gene set to use for building guesses. For this project that will be a generic set of letters. It also needs a target password to guess:

guessPassword.py
geneSet = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!."
target = "Hello World!"

Generate a guess

Next the algorithm needs a way to generate a random string from the gene set.

import random

def generate_parent(length):
    genes = []
    while len(genes) < length:
        sampleSize = min(length - len(genes), len(geneSet))
        genes.extend(random.sample(geneSet, sampleSize))
    return ''.join(genes)

Fitness

The fitness value the genetic algorithm provides is the only feedback the engine gets to guide it toward a solution. In this project the fitness value is the total number of letters in the guess that match the letter in the same position of the password.

def get_fitness(guess):
    return sum(1 for expected, actual in zip(target, guess)
               if expected == actual)

Mutation

Next, the engine needs a way to produce a new guess by mutating the current one. The following implementation converts the parent string to an array with list(parent), then replaces 1 letter in the array with a randomly selected one from geneSet, and finally recombines the result into a string with ''.join(childGenes).

def mutate(parent):
    index = random.randrange(0, len(parent))
    childGenes = list(parent)
    newGene, alternate = random.sample(geneSet, 2)
    childGenes[index] = alternate if newGene == childGenes[index] else newGene
    return ''.join(childGenes)

This implementation uses an alternate replacement if the randomly selected newGene is the same as the one it is supposed to replace, which can prevent a significant number of wasted guesses.

Display

Next, it is important to monitor what is happening so that the engine can be stopped if it gets stuck. Having a visual representation of the gene sequence, which may not be the literal gene sequence, is often critical to identifying what works and what does not so that the algorithm can be improved.

Normally the display function also outputs the fitness value and how much time has elapsed.

import datetime
...
def display(guess):
    timeDiff = datetime.datetime.now() - startTime
    fitness = get_fitness(guess)
    print("{}\t{}\t{}".format(guess, fitness, timeDiff))

Main

The main program begins by initializing bestParent to a random sequence of letters and calling the display function.

random.seed()
startTime = datetime.datetime.now()
bestParent = generate_parent(len(target))
bestFitness = get_fitness(bestParent)
display(bestParent)

The final piece is the heart of the genetic engine. It is a loop that:

  • generates a guess,
  • requests the fitness for that guess, then
  • compares the fitness to that of the previous best guess, and
  • keeps the guess with the better fitness.

This cycle repeats until a stop condition occurs, in this case when all the letters in the guess match those in the target.

while True:
    child = mutate(bestParent)
    childFitness = get_fitness(child)
    if bestFitness >= childFitness:
        continue
    display(child)
    if childFitness >= len(bestParent):
        break
    bestFitness = childFitness
    bestParent = child

Run the code and you’ll see output similar to the following. Success!

ftljCDPvhasn	1	0:00:00
ftljC Pvhasn	2	0:00:00
ftljC Pohasn	3	0:00:00.001000
HtljC Pohasn	4	0:00:00.002000
HtljC Wohasn	5	0:00:00.004000
Htljo Wohasn	6	0:00:00.005000
Htljo Wohas!	7	0:00:00.008000
Htljo Wohls!	8	0:00:00.010000
Heljo Wohls!	9	0:00:00.013000
Hello Wohls!	10	0:00:00.013000
Hello Wohld!	11	0:00:00.013000
Hello World!	12	0:00:00.015000

Extract a reusable engine

We have a working engine but it is currently tightly coupled to the Password project, so the next task is to extract the genetic engine code from that specific to guessing the password so it can be reused for other projects. Start by creating a new file named genetic.py.

Next move the mutate and generate_parent functions to the new file and rename them to _mutate and _generate_parent. This is how protected functions are named in Python. Protected functions are only accessible to other functions in the same module.

Generation and Mutation

Future projects will need to be able to customize the gene set, so that needs to become a parameter to _generate_parent and _mutate.

import random

def _generate_parent(length, geneSet):
    genes = []
    while len(genes) < length:
        sampleSize = min(length - len(genes), len(geneSet))
        genes.extend(random.sample(geneSet, sampleSize))
    return ''.join(genes)
def _mutate(parent, geneSet):
    index = random.randrange(0, len(parent))
    childGenes = list(parent)
    newGene, alternate = random.sample(geneSet, 2)
    childGenes[index] = alternate if newGene == childGenes[index] else newGene
    return ''.join(childGenes)

get_best

The next step is to move the main loop into a new public function named get_best in the genetic module. Its parameters are:

  • the function it calls to request the fitness for a guess,
  • the number of genes to use when creating a new gene sequence,
  • the optimal fitness value,
  • the set of genes to use for creating and mutating gene sequences, and
  • the function it should call to display, or report, each improvement found.
def get_best(get_fitness, targetLen, optimalFitness, geneSet, display):
    random.seed()
    bestParent = _generate_parent(targetLen, geneSet)
    bestFitness = get_fitness(bestParent)
    display(bestParent)
    if bestFitness >= optimalFitness:
        return bestParent

    while True:
        child = _mutate(bestParent, geneSet)
        childFitness = get_fitness(child)

        if bestFitness >= childFitness:
            continue
        display(child)
        if childFitness >= optimalFitness:
            return child
        bestFitness = childFitness
        bestParent = child

Notice that display and get_fitness are called with only one parameter - the child gene sequence. This is because a generic engine does not need access to the target value, and it does not care about how much time has passed, so those are not passed to it.

The result is a reusable module named genetic that can be used in other programs via import genetic.

Use the genetic module

The code remaining in guessPassword.py is specific to the password guessing project. To get it working again first import the genetic module.

guessPassword.py
import datetime
import genetic

Next, helper functions that take only one parameter must be defined so they are compatible with what the genetic engine expects. Each helper function will take the candidate gene sequence it receives and call the local functions with additional required parameters as necessary.

def test_Hello_World():
    target = "Hello World!"
    guess_password(target)


def guess_password(target):
    geneset = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!."
    startTime = datetime.datetime.now()

    def fnGetFitness(genes): 
        return get_fitness(genes, target)

    def fnDisplay(genes): 
        display(genes, target, startTime)

    optimalFitness = len(target)
    genetic.get_best(fnGetFitness, len(target), optimalFitness, geneset,
                     fnDisplay)

Display

Now change display to take the target password as a parameter. It could remain a global variable in the algorithm file but this change facilitates trying different passwords without side effects.

def display(genes, target, startTime):
    timeDiff = datetime.datetime.now() - startTime
    fitness = get_fitness(genes, target)
    print("{}\t{}\t{}".format(genes, fitness, timeDiff))

Fitness

The fitness function also needs to receive the target password as a parameter.

def get_fitness(genes, target):
    return sum(1 for expected, actual in zip(target, genes)
               if expected == actual)

Main

There are many ways to structure the main code, the most flexible is as a unit test. To start that transition first rename guessPassword.py to guessPasswordTests.py. Next, to make it possible to execute the code from the command line add:

guessPasswordTests.py
if __name__ == '__main__':
    test_Hello_World()

If you are following along in an editor be sure to run your code to verify it works at this point.

Use Python’s unittest framework

The next step is to make the code work with Python’s built in test framework.

import unittest

To do that the main test function must be moved into a class that inherits from unittest.TestCase. The other functions can be moved into the class as well if you want, but if they are then self must be added as the first parameter to each because they will then belong to the test class.

class GuessPasswordTests(unittest.TestCase):
    geneset = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!.,"

    def test_Hello_World(self):
        target = "Hello World!"
        self.guess_password(target)

    def guess_password(self, target):
...
        optimalFitness = len(target)
        best = genetic.get_best(fnGetFitness, len(target), optimalFitness,
                                self.geneset, fnDisplay)
        self.assertEqual(best, target)

When the unittest module’s main function is called, it automatically executes each function whose name starts with "test".

if __name__ == '__main__':
    unittest.main()

This allows the test to be run from the command line and, incidentally, without its display function output.

python -m unittest -b guessPasswordTests
.
----------------------------------------
Ran 1 test in 0.020s

OK

A longer password

"Hello World!" doesn’t sufficiently demonstrate the power of the genetic engine so try a longer password:

def test_For_I_am_fearfully_and_wonderfully_made(self):
    target = "For I am fearfully and wonderfully made."
    self.guess_password(target)

Run

...
ForMI am feabaully and wWndNyfulll made.	33	0:00:00.047094
For I am feabaully and wWndNyfulll made.	34	0:00:00.047094
For I am feabfully and wWndNyfulll made.	35	0:00:00.053111
For I am feabfully and wondNyfulll made.	36	0:00:00.064140
For I am feabfully and wondNyfully made.	37	0:00:00.067148
For I am feabfully and wondeyfully made.	38	0:00:00.095228
For I am feabfully and wonderfully made.	39	0:00:00.100236
For I am fearfully and wonderfully made.	40	0:00:00.195524

Nice!

Introduce a Chromosome class

The next change is to introduce a Chromosome object that has Genes and Fitness attributes. This will make the genetic engine more flexible by making it possible to pass those values around as a unit.

genetic.py
class Chromosome:
    def __init__(self, genes, fitness):
        self.Genes = genes
        self.Fitness = fitness
def _mutate(parent, geneSet, get_fitness):
    index = random.randrange(0, len(parent.Genes))
    childGenes = list(parent.Genes)
...
    genes = ''.join(childGenes)
    fitness = get_fitness(genes)
    return Chromosome(genes, fitness)
def _generate_parent(length, geneSet, get_fitness):
...
    genes = ''.join(genes)
    fitness = get_fitness(genes)
    return Chromosome(genes, fitness)
def get_best(get_fitness, targetLen, optimalFitness, geneSet, display):
    random.seed()
    bestParent = _generate_parent(targetLen, geneSet, get_fitness)
    display(bestParent)
    if bestParent.Fitness >= optimalFitness:
        return bestParent

    while True:
        child = _mutate(bestParent, geneSet, get_fitness)

        if bestParent.Fitness >= child.Fitness:
            continue
        display(child)
        if child.Fitness >= optimalFitness:
            return child
        bestParent = child

It does require compensating changes to the algorithm file functions but those changes remove some double work.

guessPasswordTests.py
def display(candidate, startTime):
    timeDiff = datetime.datetime.now() - startTime
    print("{}\t{}\t{}".format(
        candidate.Genes, candidate.Fitness, timeDiff))
class GuessPasswordTests(unittest.TestCase):
...
    def guess_password(self, target):
...
        def fnDisplay(candidate):
            display(candidate, startTime) 

        optimalFitness = len(target)
        best = genetic.get_best(fnGetFitness, len(target), optimalFitness,
                                self.geneset, fnDisplay)
        self.assertEqual(best.Genes, target) 

Benchmarking

The next improvement is to add support for benchmarking to genetic because it is useful to know how long the engine takes to find a solution on average and the standard deviation. That can be done with another class as follows:

genetic.py
class Benchmark:
    @staticmethod
    def run(function):
        timings = []
        for i in range(100):
            startTime = time.time()
            function()
            seconds = time.time() - startTime
            timings.append(seconds)
            mean = statistics.mean(timings)
            print("{} {:3.2f} {:3.2f}".format(
                1 + i, mean,
                statistics.stdev(timings, mean) if i > 1 else 0))

That requires the following imports:

genetic.py
import statistics
import time

Now, to use the benchmarking capability simply add a test and pass the function to be benchmarked.

guessPasswordTests.py
def test_benchmark(self):
    genetic.Benchmark.run(self.test_For_I_am_fearfully_and_wonderfully_made)

When run, this function works great but is a bit chatty because it also shows the display output for all 100 runs. That can be fixed in the run function by temporarily redirecting standard output to nowhere while running the function being benchmarked.

genetic.py
import sys
...
class Benchmark:
    @staticmethod
    def run(function):
...
        timings = []
        stdout = sys.stdout 
        for i in range(100):
            sys.stdout = None 
            startTime = time.time()
            function()
            seconds = time.time() - startTime
            sys.stdout = stdout 
            timings.append(seconds)
...

The output can also be improved by only displaying statistics for the first ten runs and then every 10th run after that.

genetic.py
...
            timings.append(seconds)
            mean = statistics.mean(timings)
            if i < 10 or i % 10 == 9:
                print("{} {:3.2f} {:3.2f}".format(
                    1 + i, mean,
                    statistics.stdev(timings, mean) if i > 1 else 0))

Now the benchmark test output looks like the following.

sample output
1 0.19 0.00
2 0.17 0.00
3 0.18 0.02
...
9 0.17 0.03
10 0.17 0.03
20 0.18 0.04
...
90 0.16 0.05
100 0.16 0.05

This means that, averaging 100 runs, it takes .16 seconds to guess the password, and 68 percent of the time (one standard deviation) it takes between .11 (.16 - .05) and .21 (.16 + .05) seconds. Unfortunately that is probably too fast to determine if a change is due to a code improvement or due to something else running on the computer at the same time. That problem can be solved by making the genetic algorithm guess a random sequence that takes 1-2 seconds to run. Your processor likely is different from mine so adjust the length as necessary.

guessPasswordTests.py
import random
...
    def test_Random(self):
        length = 150
        target = ''.join(random.choice(self.geneset) for _ in range(length))
        self.guess_password(target)

    def test_benchmark(self):
        genetic.Benchmark.run(self.test_Random)

On my system that results in:

               Benchmark
average (seconds)  |  standard deviation
----------------------------------------
      1.46         |         0.35

Summary

In this chapter we built a simple genetic engine that makes use of random mutation to produce better results. This engine was able to guess a secret password given only its length, a set of characters that might be in the password, and a fitness function that returns a count of the number characters in the guess that match the secret. This is a good benchmark project for the engine because as the target string gets longer the engine wastes more and more guesses trying to change positions that are already correct. As the engine evolves in later projects, we’ll try to keep this benchmark fast. Also, as you work your way through this book you’ll learn ways to improve the performance of the code in this project.

Final Code

The final code for this chapter is available from:

https://github.com/handcraftsman/GeneticAlgorithmsWithPython/tree/master/ch01