EQUATION GENERATION

So far we’ve only used genes as data elements that are applied to an external problem. Our next project, equation generation, will introduce us to a new way of using genes called genetic programming. The essence of genetic programming is to overlay some kind of grammar on the genes, which may or may not include embedded data. These operation-genes can then be evaluated to produce a result.

Example

When using genetic programming it is important to understand the characteristics of both the embedded data, if any, and how we expect the operations to interact with each other and/or the environment. For example, let’s say we are asked to find some combination of the numbers 1 to 7 and addition and subtraction operations that, when evaluated, produces the numeric value 29. There are many ways to solve this problem. For example: 7+7+7+7+7-6

We could simply use positive and negative integers between -7 and 7 because they essentially have the addition and subtraction operations built in, but we need to learn how to evaluate operations as independent genes, so we’ll make '+' and '-' separate tokens. That means the equation has 11 tokens or genes:

 7  +  7  +  7  +  7  +  7  -  6
 |  |  |  |  |  |  |  |  |  |  |
 1  2  3  4  5  6  7  8  9 10 11

When viewed this way we can easily see the pattern, or grammar, of alternating operations and numbers. This should be easy to enforce using an odd/even check when creating and mutating chromosomes.

Next, if we visualize of the equation from the point of view of the operations:

ch14.equation.generation.1.epub

It is easy to see that each operation has 2 parameters because each acts upon 2 other genes which can be either a number or another operation.

Evaluate

The heart of a genetic programming algorithm is the evaluate function. It knows how to treat the problem-specific gene sequence like a program. The evaluate function for this project will apply the operation genes to the neighboring numbers, rolling up to get the final result.

def evaluate(genes):
    result = genes[0]
    for i in range(1, len(genes), 2):
        operation = genes[i]
        nextValue = genes[i + 1]
        if operation == '+':
            result += nextValue
        elif operation == '-':
            result -= nextValue
    return result

The function starts by initializing result with the value of the first numeric gene. It then applies the operations, left to right, to the current result and the value-gene that follows the operation. The output of each operation is stored back into result.

Test and genes

We’re ready to write the rest of the genetic algorithm according to the usual pattern, starting with the gene set.

equationGenerationTests.py
import unittest
import datetime
import genetic
import random

class EquationGenerationTests(unittest.TestCase):
    def test(self):
        numbers = [1, 2, 3, 4, 5, 6, 7]
        operations = ['+', '-']

Create

Now we can immediately see that we’ll need a custom create function, for two reasons. First, we don’t know how many symbols we’ll need to produce a particular result, and second, we need to alternate numbers and operations.

def create(numbers, operations, minNumbers, maxNumbers):
    genes = [random.choice(numbers)]
    count = random.randint(minNumbers, 1 + maxNumbers)
    while count > 1:
        count -= 1
        genes.append(random.choice(operations))
        genes.append(random.choice(numbers))
    return genes

This implementation prevents the creation of gene sequences that have multiple numbers or operations in a row, so we don’t have to detect that situation in the fitness function.

Mutate

We also need a custom mutate function that knows how to append an operation-number pair to the gene sequence,

def mutate(genes, numbers, operations, minNumbers, maxNumbers):
    numberCount = (1 + len(genes)) / 2
    adding = numberCount < maxNumbers and random.randint(0, 100) == 0
    if adding:
        genes.append(random.choice(operations))
        genes.append(random.choice(numbers))
        return

remove an operation-number pair,

    removing = numberCount > minNumbers and random.randint(0, 20) == 0
    if removing:
        index = random.randrange(0, len(genes) - 1)
        del genes[index]
        del genes[index]
        return

and mutate an operation or number.

    index = random.randrange(0, len(genes))
    genes[index] = random.choice(operations) \
        if (index & 1) == 1 else random.choice(numbers)

Here we use the parity of the index to determine whether we’re replacing a number or an operation.

Display

In display we can simply write the genes separated by a space to see the equation.

def display(candidate, startTime):
    timeDiff = datetime.datetime.now() - startTime
    print("{}\t{}\t{}".format(
        (' '.join(map(str, [i for i in candidate.Genes]))),
        candidate.Fitness,
        timeDiff))
sample output
6 + 3 + 3 - 4 + 6 - 3 - 6 - 6 - 3 - 1	-5	0:00:00.001003

Fitness

That just leaves the fitness function where we’ll use the evaluate function to get the result of the equation, then compare it to the expected result. This time, instead of using a Fitness class we’ll use a 2-stage fitness function. The first stage tracks how far the result of evaluating the genes is from the expected total. If the result is correct then the second fitness range is used. This allows us to prefer shorter gene sequences, meaning equations with fewer operations.

def get_fitness(genes, expectedTotal):
    result = evaluate(genes)

    if result != expectedTotal:
        fitness = expectedTotal - abs(result - expectedTotal)
    else:
        fitness = 1000 - len(genes)

    return fitness

Test

The following full test harness should hold no surprises.

    def test(self):
        numbers = [1, 2, 3, 4, 5, 6, 7]
        operations = ['+', '-']
        expectedTotal = 29
        optimalLengthSolution = [7, '+', 7, '+', 7, '+', 7, '+', 7, '-', 6]
        minNumbers = (1 + len(optimalLengthSolution)) / 2
        maxNumbers = 6 * minNumbers
        startTime = datetime.datetime.now()

        def fnDisplay(candidate):
            display(candidate, startTime)

        def fnGetFitness(genes):
            return get_fitness(genes, expectedTotal)

        def fnCreate():
            return create(numbers, operations, minNumbers, maxNumbers)

        def fnMutate(child):
            mutate(child, numbers, operations, minNumbers, maxNumbers)

        optimalFitness = fnGetFitness(optimalLengthSolution)
        best = genetic.get_best(fnGetFitness, None, optimalFitness, None,
                                fnDisplay, fnMutate, fnCreate, maxAge=50)
        self.assertTrue(not optimalFitness > best.Fitness)

Run

When we run the test it finds an equivalent solution.

sample output
5 + 3 + 2 + 4 + 6 + 2 + 1	23	0:00:00
5 + 3 + 2 + 4 + 6 + 6 + 1	27	0:00:00
5 + 3 + 3 + 4 + 6 + 6 + 1	28	0:00:00
6 + 3 + 3 + 4 + 6 + 6 + 1	987	0:00:00
6 + 5 + 3 + 3 + 6 + 6	989	0:00:00.000998

Nice. But having to add 6 three times and 3 two times is boring. If we introduce a multiplication operation it might find 7 * 3 + 4 * 2, which saves an operation, or even 7 * 4 + 1, which saves two. Let’s try it.

Support multiplication

First we have to add the multiplication token (*) to the list of operation genes.

    def test(self):
...
        operations = ['+', '-', '*']

Next we have to implement multiplication in evaluate. The problem is that for the math to be correct we have to perform all multiplication operations before we perform any addition or subtraction. No problem, we can just group the operations by priority.

    def test(self):
...
        operations = ['+', '-', '*']
        prioritizedOperations = [['*'], ['+', '-']]
...
        def fnEvaluate(genes):
            return evaluate(genes, prioritizedOperations)

Then we’ll evaluate all operations in priority order rather than simultaneously, which means we’ll iterate over the array once for each priority group.

def evaluate(genes, prioritizedOperations):
    equation = genes[:]
    for operationSet in prioritizedOperations:
        iOffset = 0
        for i in range(1, len(equation), 2):
            i += iOffset
            opToken = equation[i]
            if opToken in operationSet:
                leftOperand = equation[i - 1]
                rightOperand = equation[i + 1]

                if opToken == '+':
                    leftOperand += rightOperand
                elif opToken == '-':
                    leftOperand -= rightOperand
                elif opToken == '*':
                    leftOperand *= rightOperand
                equation[i - 1] = leftOperand
                del equation[i + 1]
                del equation[i]
                iOffset -= 2
    return equation[0]

In the implementation above we make a copy of the genes so we can modify it each time through the loop. Each operation stores its output in the array location to its left then deletes itself and its second operand from the array. After all operations have been evaluated the result is in array index zero.

Note, since we changed the function to pass in the list of operations, we also need to change what we pass to get_fitness

def get_fitness(genes, expectedTotal, fnEvaluate):
    result = fnEvaluate(genes)
...

and update the function call in the test harness.

    def test(self):
...
        def fnGetFitness(genes):
            return get_fitness(genes, expectedTotal, fnEvaluate)

Run

This code finds a solution, but not a minimal length solution, because the defined optimalLengthSolution in the test is based on addition and subtraction so the engine stops as soon as it beats that solution.

...
4 - 3 * 3 - 7 * 2 + 7 + 6 + 5 * 7	983	0:00:00.103305
5 + 2 - 7 * 2 + 6 + 5 + 5 * 5	985	0:00:00.115337
5 - 4 - 3 + 6 + 5 + 4 * 5	987	0:00:00.124361
5 - 4 - 3 + 6 + 5 * 5	989	0:00:00.128372

Extract solve

To fix that we need to extract a solve function so we can pass in the list of operations with the optimal solution for that set of operations.

    def test_addition(self):
        operations = ['+', '-']
        prioritizedOperations = [['+', '-']]
        optimalLengthSolution = [7, '+', 7, '+', 7, '+', 7, '+', 7, '-', 6]
        self.solve(operations, prioritizedOperations, optimalLengthSolution)

    def solve(self, operations, prioritizedOperations,
              optimalLengthSolution):
        numbers = [1, 2, 3, 4, 5, 6, 7]
        expectedTotal = evaluate(optimalLengthSolution,
                                 prioritizedOperations)

Test multiplication

Now we can add a test for multiplication, and give it a more difficult equation to generate.

    def test_multiplication(self):
        operations = ['+', '-', '*']
        prioritizedOperations = [['*'], ['+', '-']]
        optimalLengthSolution = [6, '*', 3, '*', 3, '*', 6, '-', 7]
        self.solve(operations, prioritizedOperations, optimalLengthSolution)

Run 2

When we run that test it too can find an optimal length solution:

sample output
....
7 * 5 * 6 - 3 + 7 * 2 * 7 + 4 * 3	983	0:00:00.109255
7 * 7 * 7 - 4 * 4 - 6 - 5 + 1	985	0:00:00.141350
7 * 7 * 7 - 7 * 2 - 3 * 4	987	0:00:00.150363
5 * 7 * 7 + 6 * 2 * 6	989	0:00:00.158386
5 * 4 * 4 * 4 - 3	991	0:00:02.141662

Cool. When we review that solution, however, it too seems repetitive. It would be much nicer if we could reduce the equation to 5 * 43 - 3. Let’s do that.

Refactoring

But first, have you noticed that each time we add a new operation we have to change evaluate? That’s an indication of a poor design that violates the Open-Closed Principal We can fix that by extracting the operation implementations to separate functions:

def add(a, b):
    return a + b

def subtract(a, b):
    return a - b

def multiply(a, b):
    return a * b

Then make the prioritized operation list contain dictionaries where each dictionary key is the token for an operation and the value is a function that implements the operation.

    def test_addition(self):
        operations = ['+', '-']
        prioritizedOperations = [{'+': add,
                                  '-': subtract}]
...
    def test_multiplication(self):
        operations = ['+', '-', '*']
        prioritizedOperations = [{'*': multiply},
                                 {'+': add,
                                  '-': subtract}]
...

Finally, use the operation function in evaluate.

def evaluate(genes, prioritizedOperations):
    equation = genes[:]
    for operationSet in prioritizedOperations:
        iOffset = 0
        for i in range(1, len(equation), 2):
            i += iOffset
            opToken = equation[i]
            if opToken in operationSet:
                leftOperand = equation[i - 1]
                rightOperand = equation[i + 1]
                equation[i - 1] = operationSet[opToken](leftOperand,
                                                        rightOperand)
                del equation[i + 1]
                del equation[i]
                iOffset += -2
    return equation[0]

Supporting Exponents

Now we can easily add support for a new operation, like exponentiation, without changing any other code:

    def test_exponent(self):
        operations = ['^', '+', '-', '*']
        prioritizedOperations = [{'^': lambda a, b: a ** b},
                                 {'*': multiply},
                                 {'+': add,
                                  '-': subtract}]
        optimalLengthSolution = [6, '^', 3, '*', 2, '-', 5]
        self.solve(operations, prioritizedOperations, optimalLengthSolution)
sample output
...
4 * 3 ^ 4 - 3 - 1 - 7 + 4 * 2 * 7 * 2 + 2	979	0:00:00.030079
4 * 3 ^ 4 - 2 - 7 + 4 * 2 * 7 * 2	983	0:00:00.035093
6 * 7 * 5 * 2 + 5 + 3 + 3 - 4	985	0:00:00.070215
6 * 6 * 6 * 2 + 1 - 6	989	0:00:00.083221
7 * 5 * 6 * 2 + 7	991	0:00:00.108286
6 ^ 3 * 2 - 5	993	0:00:13.129462

Improve performance

Since it is starting to take tens of seconds to find the solution, let’s add a loop to mutate so that it can make a random number of changes instead of just one.

def mutate(genes, numbers, operations, minNumbers, maxNumbers, fnGetFitness):
    count = random.randint(1, 10)
    initialFitness = fnGetFitness(genes)
    while count > 0:
        count -= 1
        if fnGetFitness(genes) > initialFitness:
            return 
        numberCount = (1 + len(genes)) / 2
...
...
            genes.append(random.choice(numbers))
            continue 

        removing = numberCount > minNumbers and random.randint(0, 20) == 0
...
...
            del genes[index]
            continue 

        index = random.randrange(0, len(genes))
...

Now that we’re checking the fitness in mutate, we also need to pass in fnGetFitness:

    def solve(self, operations, prioritizedOperations,
              optimalLengthSolution):
...
        def fnMutate(child):
            mutate(child, numbers, operations, minNumbers, maxNumbers,
                   fnGetFitness)
...

Run

As expected, it finds the solution much faster.

sample output
2 + 3 ^ 4 * 5 + 3 * 6	425	0:00:00.014006
3 + 3 ^ 4 * 5 + 3 * 6	426	0:00:00.016041
4 + 3 ^ 4 * 5 + 3 * 6	989	0:00:00.024067
6 * 2 * 5 * 7 + 7	991	0:00:00.067148
2 * 6 ^ 3 - 5	993	0:00:00.567508

Fantastic!

Benchmarks

We’ll benchmark this project with the exponent test.

    def test_benchmark(self):
        genetic.Benchmark.run(self.test_exponent)
               Benchmark
average (seconds)  |  standard deviation
----------------------------------------
      0.49         |         0.36

Summary

This chapter was our first introduction to genetic programming and it opened up several really interesting problems for exploration. We also learned one way to work with both data and operations in a gene sequence. This gave us experience with symbols and grammar and "program" evaluation.

Final Code

The final code for this chapter is available from:

https://github.com/handcraftsman/GeneticAlgorithmsWithPython/tree/master/ch14