REGULAR EXPRESSIONS

The next project is to build a genetic algorithm that can evolve a regular expression (regex) that matches all items in a set of wanted strings without matching any items in a set of unwanted strings.

Test

As you probably know, meta characters in regular expressions have special meanings, for example * means repeat zero or more times, + means repeat at least once, etc. The base gene set for the genetic algorithm will include four of these, which will be defined in global variables within the genetic algorithm file.

regexTests.py
repeatMetas = {'?', '*', '+'}
startMetas = {'|'}
allMetas = repeatMetas | startMetas

Next, the strings the generated regex should match, and those it should not match, are defined in a test. The genetic algorithm is also given a specific regex length to achieve. This keeps it from simply using the | meta character to concatenate all the wanted strings.

import unittest

class RegexTests(unittest.TestCase):
    def test_two_digits(self):
        wanted = {"01", "11", "10"}
        unwanted = {"00", ""}
        self.find_regex(wanted, unwanted, 7)

Fitness

The fitness function first compiles the regex to see if it is valid or not. If not it returns a comparably low fitness value.

import re
...
def get_fitness(genes, wanted, unwanted):
    pattern = ''.join(genes)
    length = len(pattern)

    try:
        re.compile(pattern)
    except re.error:
        return Fitness(0, len(wanted), len(unwanted), length)

Otherwise, it determines the fitness of the generated regex by counting the number of wanted and unwanted strings the regex matches exactly.

    numWantedMatched = sum(1 for i in wanted if re.fullmatch(pattern, i))
    numUnwantedMatched = sum(1 for i in unwanted if re.fullmatch(pattern, i))
    return Fitness(numWantedMatched, len(wanted), numUnwantedMatched, length)

As usual the fitness function has a related helper function in the test harness:

    def find_regex(self, wanted, unwanted, expectedLength):
        def fnGetFitness(genes):
            return get_fitness(genes, wanted, unwanted)

This genetic algorithm uses a Fitness object because there are multiple objectives. They are:

  • to maximize the number of wanted strings that match,
  • to minimize the number of unwanted strings that match, and
  • to minimize the length of the regex
class Fitness:
    def __init__(self, numWantedMatched, totalWanted, numUnwantedMatched,
                 length):
        self.NumWantedMatched = numWantedMatched
        self._totalWanted = totalWanted
        self.NumUnwantedMatched = numUnwantedMatched
        self.Length = length

The comparison function first combines the number of wanted strings that were not matched with the number of unwanted strings that were matched. When that value differs, the algorithm should keep the chromosome with the smallest total. That achieves the first two objectives while allowing the matched wanted and unwanted counts to vary.

    def __gt__(self, other):
        combined = (self._totalWanted - self.NumWantedMatched) \
                   + self.NumUnwantedMatched
        otherCombined = (other._totalWanted - other.NumWantedMatched) \
                        + other.NumUnwantedMatched
        if combined != otherCombined:
            return combined < otherCombined

When the regex fails to match one or more wanted strings, or matches one or more unwanted strings, the algorithm should keep the newer one. This should prevent the algorithm from hanging on a particularly bad chromosome.

        success = combined == 0
        otherSuccess = otherCombined == 0
        if success != otherSuccess:
            return success
        if not success:
            return False

Otherwise the shortest regex is chosen.

        return self.Length < other.Length

The output of the str function makes the values easy to read when displayed.

    def __str__(self):
        return "matches: {} wanted, {} unwanted, len {}".format(
            "all" if self._totalWanted == self.NumWantedMatched else self.NumWantedMatched,
            self.NumUnwantedMatched,
            self.Length)

Display

import datetime
...
def display(candidate, startTime):
    timeDiff = datetime.datetime.now() - startTime
    print("{}\t{}\t{}".format(
        ''.join(candidate.Genes), candidate.Fitness, timeDiff))
expected output
01|11?*+	matches 0 wanted 2 unwanted, len 8	0:00:00

Next is the helper function for the test harness.

    def find_regex(self, wanted, unwanted, expectedLength):
        startTime = datetime.datetime.now()

        def fnDisplay(candidate):
            display(candidate, startTime)
...

Mutation

Since the genetic algorithm needs to be able to change the length of the regex, a custom mutation function is used. The mutation function has a loop that breaks early if it produces a gene sequence with an improved fitness.

import random
...
def mutate(genes, fnGetFitness, mutationOperators, mutationRoundCounts):
    initialFitness = fnGetFitness(genes)

The number of times the loop executes is adaptive based on previously successful loop counts.

    count = random.choice(mutationRoundCounts)
    for i in range(1, count + 2):

The mutation function receives an array of operators that can be used to modify the gene sequence. Each time through the loop it makes a temporary copy of the operators then picks one and uses it to try to modify the gene sequence. The operator returns a Boolean value indicating whether it was able to make a change or not. If not, that operator is removed from the temporary copy so it will not be tried again in that round.

        copy = mutationOperators[:]
        func = random.choice(copy)
        while not func(genes):
            copy.remove(func)
            func = random.choice(copy)

When an operator does make a change, the fitness of the changed genes is compared with the initial fitness. If the fitness improves then the adaptive loop count is updated and the loop halts.

        if fnGetFitness(genes) > initialFitness:
            mutationRoundCounts.append(i)
            return

Mutation Operators

The default mutation operators are: add, remove, replace, swap and move.

def mutate_add(genes, geneset):
    index = random.randrange(0, len(genes) + 1) if len(genes) > 0 else 0
    genes[index:index] = [random.choice(geneset)]
    return True
def mutate_remove(genes):
    if len(genes) < 1:
        return False
    del genes[random.randrange(0, len(genes))]
    if len(genes) > 1 and random.randint(0, 1) == 1:
        del genes[random.randrange(0, len(genes))]
    return True
def mutate_replace(genes, geneset):
    if len(genes) < 1:
        return False
    index = random.randrange(0, len(genes))
    genes[index] = random.choice(geneset)
    return True
def mutate_swap(genes):
    if len(genes) < 2:
        return False
    indexA, indexB = random.sample(range(len(genes)), 2)
    genes[indexA], genes[indexB] = genes[indexB], genes[indexA]
    return True
def mutate_move(genes):
    if len(genes) < 3:
        return False
    start = random.choice(range(len(genes)))
    stop = start + random.randint(1, 2)
    toMove = genes[start:stop]
    genes[start:stop] = []
    index = random.choice(range(len(genes)))
    if index == start:
        index += 1
    genes[index:index] = toMove
    return True

Notice that two of the functions have multiple parameters but the mutation function only calls them with genes. The function-specific parameters, if any, are provided when the list of mutation operators is created in the test harness, through the use of partial.

from functools import partial
...
    def find_regex(self, wanted, unwanted, expectedLength):
...
        mutationRoundCounts = [1]

        mutationOperators = [
            partial(mutate_add, geneset=fullGeneset),
            partial(mutate_replace, geneset=fullGeneset),
            mutate_remove,
            mutate_swap,
            mutate_move,
        ]

        def fnMutate(genes):
            mutate(genes, fnGetFitness, mutationOperators,
                   mutationRoundCounts)

Test Harness

The test harness starts by adding the unique letters from each of the wanted strings, as well as the wanted strings themselves, to the set of gene tokens the algorithm can use to build the regex. That way the algorithm doesn’t have to struggle to reassemble long sequences or words.

import genetic
...
    def find_regex(self, wanted, unwanted, expectedLength):
        startTime = datetime.datetime.now()
        textGenes = wanted | set(c for w in wanted for c in w)
        fullGeneset = [i for i in allMetas | textGenes]

Next come the helper functions, given previously, and finally the call to run the engine.

        optimalFitness = Fitness(len(wanted), len(wanted), 0, expectedLength)

        best = genetic.get_best(fnGetFitness,
                                max(len(i) for i in textGenes),
                                optimalFitness, fullGeneset, fnDisplay,
                                fnMutate, poolSize=10)
        self.assertTrue(not optimalFitness > best.Fitness)

Run

sample result
+10	matches 0 wanted 2 unwanted, len 3	0:00:00.001003
111	matches 0 wanted 0 unwanted, len 3	0:00:00.001003
10	matches 1 wanted 0 unwanted, len 2	0:00:00.001003
10?1?	matches 2 wanted 0 unwanted, len 5	0:00:00.017079
11|10|01	matches all wanted 0 unwanted, len 8	0:00:00.317880
1*0?10*	matches all wanted 0 unwanted, len 7	0:00:01.137309

It worked! It found a successful regex in a fraction of a second but then struggled for a comparatively long time to reduce it to the requested length. A benchmark function will show long it takes on average.

    def test_benchmark(self):
        genetic.Benchmark.run(self.test_two_digits)
            Sample Benchmark
average (seconds)  |  standard deviation
----------------------------------------
      0.70         |         1.02

Performance improvement

Improving the performance of this genetic algorithm requires revisiting the try..except block in the fitness function, the one that detects invalid regular expressions. The following temporary code change will make it possible to measure how often that happens.

total = invalid = 0

def get_fitness(genes, wanted, unwanted):
    global total
    total += 1
...
    except re.error:
        global invalid
        invalid += 1
        return Fitness(0, len(wanted), len(unwanted), length)
    def find_regex(self, wanted, unwanted, expectedLength):
...
        self.assertTrue(not optimalFitness > best.Fitness)
        print("{}% of {} generated regexes were invalid".format(
            int(100 * invalid / total), total
        ))

Running the test a few times gets results like the following:

sample results
18% of 82018 generated regexes were invalid
21% of 57325 generated regexes were invalid
23% of 212732 generated regexes were invalid
25% of 216453 generated regexes were invalid
29% of 24124 generated regexes were invalid
34% of 2734 generated regexes were invalid

Ouch. A lot of opportunities to find an improvement are lost due to the generated regex being invalid. What could make the regex invalid? To find out, replace the measurement code with code that captures the details of the error in a global variable, with preference for the shortest example of the given error.

regexErrorsSeen = {}



def get_fitness(genes, wanted, unwanted):
...
    except re.error as e:
        key = str(e)
        key = key[:key.index("at position")]
        info = [str(e),
                "genes = ['{}']".format("', '".join(genes)),
                "regex: " + pattern]
        if key not in regexErrorsSeen or len(info[1]) < len(
                regexErrorsSeen[key][1]):
            regexErrorsSeen[key] = info
        return Fitness(0, len(wanted), len(unwanted), length)
...

Then print all the errors at the end of the test harness.

    def find_regex(self, wanted, unwanted, expectedLength):
...
        for info in regexErrorsSeen.values():
            print("")
            print(info[0])
            print(info[1])
            print(info[2])

Now run the test to get the error samples:

sample errors
nothing to repeat at position 0
genes = ['?']
regex: ?

multiple repeat at position 2
genes = ['0', '?', '*']
regex: 0?*

Run the test a few times to see variations. There are two situations: The first is when the regex has a repeat-type meta character with no text before it. The other is when two repeat-type meta characters are adjacent. One solution is to repair the regex.

Regex repair

def get_fitness(genes, wanted, unwanted):
    pattern = repair_regex(genes)
    length = len(pattern)
...

The repair function can be built iteratively by running the test and adding code that detects and corrects the errors found. Eventually all the regexes can be repaired. Here’s one possible implementation:

def repair_regex(genes):
    result = []
    f = repair_ignore_repeat_metas
    for token in genes:
        f = f(token, result)
    return ''.join(result)
def repair_ignore_repeat_metas(token, result):
    if token in repeatMetas:
        return repair_ignore_repeat_metas
    result.append(token)
    return repair_ignore_repeat_metas_following_repeat_or_start_metas
def repair_ignore_repeat_metas_following_repeat_or_start_metas(token, result):
    last = result[-1]
    if token not in repeatMetas:
        result.append(token)
    elif last in startMetas:
        pass
    elif token == '?' and last == '?' and len(result) > 2 and result[
        -2] in repeatMetas:
        pass
    elif last in repeatMetas:
        pass
    else:
        result.append(token)
    return repair_ignore_repeat_metas_following_repeat_or_start_metas

Because the regex repair function does not change the original genes it must also be called from the display function.

def display(candidate, startTime):
    timeDiff = datetime.datetime.now() - startTime
    print("{}\t{}\t{}".format(
        repair_regex(candidate.Genes), candidate.Fitness, timeDiff))

The genetic algorithm now finds the solution much faster on average.

               Benchmark
average (seconds)  |  standard deviation
----------------------------------------
       0.30        |         0.63

A side effect of repairing the regex is the genes that are removed during the repair become latent. They can be activated if a mutation operator affects the gene to their left. This unexpectedly gives the genetic algorithm an additional tool.

Groups

The second test regex will require the use of the group-type meta characters '(' and ')' so support for those must be added.

startMetas = {'|', '('}
endMetas = {')'}
allMetas = repeatMetas | startMetas | endMetas

Repair

Now running test_two_digits produces examples of the next set of regex issues to repair:

sample errors
missing ), unterminated subpattern at position 0
genes = ['(']
regex: (

unbalanced parenthesis at position 0
genes = [')']
regex: )

The first issue can be solved by appending missing end-group ')' meta characters to the final regex.

def repair_regex(genes):
    result = []
    finals = [] 
    f = repair_ignore_repeat_metas
    for token in genes:
        f = f(token, result, finals) 
    result.extend(reversed(finals)) 
    return ''.join(result)
def repair_ignore_repeat_metas(token, result, finals):
    if token in repeatMetas or token in endMetas:
        return repair_ignore_repeat_metas
    if token == '(': 
        finals.append(')')
    result.append(token)
    return repair_ignore_repeat_metas_following_repeat_or_start_metas

The second issue can be resolved by prefixing the final regex with the missing start-group '(' meta character - (6) in the code listing below.

def repair_ignore_repeat_metas_following_repeat_or_start_metas(token, result, finals):
    last = result[-1]
    if token not in repeatMetas:
        if token == '(': 
            finals.append(')')
        elif token == ')':
            match = ''.join(finals).rfind(')')
            if match != -1:
                del finals[match]
            else:
                result[0:0] = ['('] 
        result.append(token)
    elif last in startMetas:
...

New test

Once all the regex problems have been repaired, the new test can be added and run.

    def test_grouping(self):
        wanted = {"01", "0101", "010101"}
        unwanted = {"0011", ""}
        self.find_regex(wanted, unwanted, 5)
sample result
01010101|0101()	matches 1 wanted 0 unwanted, len 15	0:00:00
(0101)|01+	matches 2 wanted 0 unwanted, len 10	0:00:00.003005
(01??11?|010101)+	matches all wanted 0 unwanted, len 17	0:00:00.106251
(01|)+01	matches all wanted 0 unwanted, len 8	0:00:00.108257
(01)+	matches all wanted 0 unwanted, len 5	0:00:00.130313

Nice!

Character-sets

The next test regex will require the use of the character-set-type meta characters '[' and ']'. To support those first add them to the meta global variables.

startMetas = {'|', '(', '['}
endMetas = {')', ']'}

As before, next run the existing tests to produce error samples.

sample errors
missing ), unterminated subpattern at position 0
genes = ['[', '*', ')', ']', '*', '0']
regex: ([)]*0

unbalanced parenthesis at position 5
genes = ['[', '(', ']', '*', '0']
regex: [(]*0)

unterminated character set at position 0
genes = ['[']
regex: [

The first two are caused by the group completion code added in the previous section completing groups that begin or end inside a character-set.

Repair

def repair_regex(genes):
    result = []
    finals = []
    f = repair_ignore_repeat_metas
    for token in genes:
        f = f(token, result, finals)
    if ']' in finals and result[-1] == '[': 
        del result[-1]
    result.extend(reversed(finals))
    return ''.join(result)
def repair_ignore_repeat_metas(token, result, finals):
    if token in repeatMetas or token in endMetas:
        return repair_ignore_repeat_metas
    if token == '(':
        finals.append(')')
    result.append(token)
    if token == '[': 
        finals.append(']')
        return repair_in_character_set
    return repair_ignore_repeat_metas_following_repeat_or_start_metas
def repair_ignore_repeat_metas_following_repeat_or_start_metas(token, result, finals):
    last = result[-1]
    if token not in repeatMetas:
        if token == '[': 
            result.append(token)
            finals.append(']')
            return repair_in_character_set
        if token == '(':
...
def repair_in_character_set(token, result, finals):
    if token == ']':
        if result[-1] == '[':
            del result[-1]
        result.append(token)
        match = ''.join(finals).rfind(']')
        if match != -1:
            del finals[match]
        return repair_ignore_repeat_metas_following_repeat_or_start_metas
    elif token == '[':
        pass
    else:
        result.append(token)
    return repair_in_character_set

New test

    def test_state_codes(self):
        wanted = {"NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND"}
        unwanted = {"N" + l for l in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                    if "N" + l not in wanted}
        self.find_regex(wanted, unwanted, 11)
sample result
...
NM|NC|NV|ND|NJ|NE|NY|NH	matches all wanted 0 unwanted, len 23	0:00:08.716123
NY|NE|NC|NH|NV?J*D*M*	matches all wanted 0 unwanted, len 21	0:00:15.928484
NH|NE|NV|NC*D?J*Y?M?	matches all wanted 0 unwanted, len 20	0:00:52.029423
N[D(ECYM??JVYHJD]	matches all wanted 0 unwanted, len 17	0:01:51.952601
N[D(ECYM?JVHYJD]	matches all wanted 0 unwanted, len 16	0:01:51.957615
N[DECYM?JJVHYD]	matches all wanted 0 unwanted, len 15	0:01:51.987693
N[VJYM?HDCYED]	matches all wanted 0 unwanted, len 14	0:01:52.168333
N[VMCDJYCHED]	matches all wanted 0 unwanted, len 13	0:01:52.249548
N[VMCJYHED]	matches all wanted 0 unwanted, len 11	0:01:52.254562

The genetic algorithm succeeds but it can take a long time to discover the character-set solution if it has already found a successful regex. That is because removing repeated items from a character-set, or moving wanted items into a character-set, usually only negatively impact the fitness at this stage. Conversely, those actions may not affect the fitness at all before a working solution is found. That means introducing a character-set that improves the fitness requires multiple sequential steps, that’s not something genetic algorithms do well. They are much more successful when they can find ways to make incremental improvements. It would be much more likely to succeed if it had access to a character-set specific operator. That kind of mutation operator might only be useful to particular regexes.

Supporting custom operators

First add an optional parameter to the find function and append its contents to the array of mutation operators.

    def find_regex(self, wanted, unwanted, expectedLength,
                   customOperators=None):
...
        mutationOperators = [
...
        ]
        if customOperators is not None:
            mutationOperators.extend(customOperators)

Then define the character-set operator.

def mutate_to_character_set_left(genes, wanted):
    if len(genes) < 4:
        return False
    ors = [i for i in range(-1, len(genes) - 3)
           if (i == -1 or genes[i] in startMetas) and
           len(genes[i + 1]) == 2 and
           genes[i + 1] in wanted and
           (len(genes) == i + 1 or genes[i + 2] == '|' or genes[
               i + 2] in endMetas)]
    if len(ors) == 0:
        return False
    lookup = {}
    for i in ors:
        lookup.setdefault(genes[i + 1][0], []).append(i)
    min2 = [i for i in lookup.values() if len(i) > 1]
    if len(min2) == 0:
        return False
...

It finds all the two-character wanted strings that have a | meta character, or end-of-array, on both sides. If there are at least two that have the same first character, for example MA| and |ME| then they become candidates to be replaced with a character-set, i.e. M[AE]. Pick a candidate, add the character-set and remove the parts from the regex.

    choice = random.choice(min2)
    characterSet = ['|', genes[choice[0] + 1][0], '[']
    characterSet.extend([genes[i + 1][1] for i in choice])
    characterSet.append(']')
    for i in reversed(choice):
        if i >= 0:
            genes[i:i + 2] = []
    genes.extend(characterSet)
    return True

Next use the new operator in the test.

...
        customOperators = [
            partial(mutate_to_character_set_left, wanted=wanted),
        ]
        self.find_regex(wanted, unwanted, 11, customOperators)

Now run the test again and it can usually find the regex in a few seconds. Additionally, the situation where it first finds the solution that does not use a character-set becomes rare, and when it does it occur the genetic algorithm can easily find the character-set solution.

sample result
NY|NM|NH*(V*(J|C*D*|E))	matches all wanted 0 unwanted, len 23	0:00:07.173342
NY|NM|NH*V*(J|C*D*|E)	matches all wanted 0 unwanted, len 21	0:00:07.252551
N[)EJ+YMCVDH+VV]	matches all wanted 0 unwanted, len 16	0:00:08.928685
N[)VYMCV)EVHJD]	matches all wanted 0 unwanted, len 15	0:00:08.997869
N[VMVC)HEJYD]	matches all wanted 0 unwanted, len 13	0:00:09.035971
N[VJVCHEMYD]	matches all wanted 0 unwanted, len 12	0:00:09.053016
N[JVCEHDMY]	matches all wanted 0 unwanted, len 11	0:00:09.183363

Repetition

The next regex will use a repetition-type meta token like {2} or {2,}. First add them to the meta global variable.

repeatMetas = {'?', '*', '+', '{2}', '{2,}'}

Run the existing tests and there should be no errors to correct.

New test

    def test_even_length(self):
        wanted = {"00", "01", "10", "11", "0000", "0001", "0010", "0011",
                  "0100", "0101", "0110", "0111", "1000", "1001", "1010",
                  "1011", "1100", "1101", "1110", "1111"}
        unwanted = {"0", "1", "000", "001", "010", "011", "100", "101",
                    "110", "111", ""}
        self.find_regex(wanted, unwanted, 10)
sample result
...
([1110]{2})?[10]{2}	matches all wanted 0 unwanted, len 19	0:00:13.188051
([10]{2})?[10]{2}	matches all wanted 0 unwanted, len 17	0:00:15.622529
([1101]{2}|11)+	matches all wanted 0 unwanted, len 15	0:00:24.155222
([110101]{2})+	matches all wanted 0 unwanted, len 14	0:00:26.304940
([01]{2})+	matches all wanted 0 unwanted, len 10 0:00:26.494444

The genetic algorithm is slow to find this regex and it can also stall if it finds this particular solution:

(00|10|01|11)+	matches all wanted 0 unwanted, len 14	0:00:12.010740

The following custom mutation operator resolves the problem. It finds all the cases where a '|' meta character has non-meta characters on both sides. If any are found it picks one and replaces the three genes with a character-set containing only the unique characters from the two non-meta character genes.

Example: ['00', '|', '01'] becomes ['[', '0', '1', ']'].

def mutate_to_character_set(genes):
    if len(genes) < 3:
        return False
    ors = [i for i in range(1, len(genes) - 1)
           if genes[i] == '|' and
           genes[i - 1] not in allMetas and
           genes[i + 1] not in allMetas]
    if len(ors) == 0:
        return False
    shorter = [i for i in ors
               if sum(len(w) for w in genes[i - 1:i + 2:2]) >
               len(set(c for w in genes[i - 1:i + 2:2] for c in w))]
    if len(shorter) == 0:
        return False
    index = random.choice(ors)
    distinct = set(c for w in genes[index - 1:index + 2:2] for c in w)
    sequence = ['['] + [i for i in distinct] + [']']
    genes[index - 1:index + 2] = sequence
    return True

Then change the test to use the new mutation operator.

    def test_even_length(self):
...
        customOperators = [
            mutate_to_character_set,
        ]
        self.find_regex(wanted, unwanted, 10, customOperators)

Run the test now and it can find the solution every time.

sample result
.....
([0*1000]{2})+	matches all wanted 0 unwanted, len 14	0:00:05.221497
([1000]{2})+	matches all wanted 0 unwanted, len 12	0:00:05.379950
([01][10])+	matches all wanted 0 unwanted, len 11	0:00:05.519291
([01]{2})+	matches all wanted 0 unwanted, len 10	0:00:06.595626

State codes

The final regex test in this chapter is to find a reasonably short regex for all 50 U.S. state codes.

    def test_50_state_codes(self):
        wanted = {"AL", "AK", "AZ", "AR", "CA",
                  "CO", "CT", "DE", "FL", "GA",
                  "HI", "ID", "IL", "IN", "IA",
                  "KS", "KY", "LA", "ME", "MD",
                  "MA", "MI", "MN", "MS", "MO",
                  "MT", "NE", "NV", "NH", "NJ",
                  "NM", "NY", "NC", "ND", "OH",
                  "OK", "OR", "PA", "RI", "SC",
                  "SD", "TN", "TX", "UT", "VT",
                  "VA", "WA", "WV", "WI", "WY"}
        unwanted = {a + b for a in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                    for b in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                    if a + b not in wanted} | \
                   set(i for i in "ABCDEFGHIJKLMNOPQRSTUVWXYZ")
        customOperators = [
            partial(mutate_to_character_set_left, wanted=wanted),
            mutate_to_character_set,
        ]
        self.find_regex(wanted, unwanted, 120, customOperators)
sample result
RI|DE|VA|CO|SC|CA|PA|LA|SD|TX|WY|GA|WI|HI|M[IASDETNO]|N[CMDJYMEVH]|VT+|CT|TN|UT|WA|WV|FL|A[RLZK]|K[YS]|O[RHK]|I[LNDA]	matches all wanted 0 unwanted, len 117	2:44:17.252814

It produces a good result but doing so can take hours.

Now add the following custom mutation operator.

def mutate_add_wanted(genes, wanted):
    index = random.randrange(0, len(genes) + 1) if len(genes) > 0 else 0
    genes[index:index] = ['|'] + [random.choice(wanted)]
    return True

It shortcuts the process of getting all the wanted sequences into the regex by inserting a | meta character followed by a random wanted string. That allows the algorithm to quickly switch its focus to reducing the length of the regex.

    def test_50_state_codes(self):
...
        customOperators = [
            partial(mutate_to_character_set_left, wanted=wanted),
            mutate_to_character_set,
            partial(mutate_add_wanted, wanted=[i for i in wanted]),
        ]
        self.find_regex(wanted, unwanted, 120, customOperators)

Run 2

Now the genetic algorithm can find a successful regex in a couple of minutes, but the new mutation operator makes it grow faster than it can be reduced. This results in very long regexes like the one below that then have to be reduced after the fact, a much slower process that can take tens of minutes.

|ILKYORNV|||RI|WA|[|AK{2,}{2}MTSLCNV]]LCTDEVA|ME||VA|AK|UT|HI|LA|[FNJ]LANV|MO|AZ|FL|TX|||GA|MA[SNCM]|AR||MA|MDLANVAZ|TN|MD|NE[HMSTN)SLD]X|PAN|SCI][NH||RIN|NEVHNV]VT|GA[J|CAJI]||MNUWY|CA|NC|WIVDENH]NC|WV|ME|MS|NE|VT|MT|ILLDE|IA|LAWILNV|WY|VA|PA|KY|UT|IL|WVTW|PCAMNNC|ID|||AL|VAOK[UV]|NJ|MI|CT|OK|]|NJ|IN||CO[D|UI|MDA]ME]|UTMNC|NV|KS|NY|CO|WI|CAMT[LNCNNKSHA|OH|||OWGA]CT+ND|SD|OHSDAAL|K]ARVTMAWICA[NYWVWIWYCOOHO|WIZLM|RINH*]ANDILMSN?TN||INAMSWV|MN|WVNJ[OICM][|AVFL]WIMDI]A|N[CMEEDEH]|SC|DE|O[HR]	matches all wanted 0 unwanted, len 493	0:02:41.872429

The growth of the regex can be controlled through the introduction of a static configuration setting on the Fitness object.

class Fitness:
    UseRegexLength = False
...

Then update the comparison function.

    def __gt__(self, other):
...
        if not success:
            return self.Length <= other.Length if Fitness.UseRegexLength else False
        return self.Length < other.Length

This works by controlling when the engine switches to a new genetic line. When the setting is True then, all other things being equal, the engine will only switch if the other regex is longer.

Finally, enable the setting in the test.

    def test_50_state_codes(self):
        UseGeneLength = True
...

Now the genetic algorithm can find the final regex in just a few minutes.

sample result
FL||LA|GA|HI|UT|CT|DE|OK|RI|A[LKRZ]|O[RH]|S[DC]|K[YS]|]|I[NL]|PA|V[AT]|C[OA]|I[AD]|M[SETNODIA]|W[VIAY]|T[XN]|N[VCYJEMDH]	matches all wanted 0 unwanted, len 120	0:03:13.662164

Excellent! Unsurprisingly test_state_codes benefits from the use of this setting too.

Exercise

Notice in the final regex above that there are still characters that could be removed, for example in |K[YS]|]|. Also, CT could be merged into C[OA]. Those would eventually be discovered by the genetic algorithm if a shorter goal length was used. However, there are other potential savings that it is unlikely to find. For example, LA, GA, and PA could be combined if there was a mutation operator for creating character-sets with a common ending letter. Try adding one.

This is a great project for experimentation and there are many more regex meta characters to play with.

Summary

This chapter introduced two new concepts. The first is the idea of repairing the chromosome, which is useful when some phenotype values are incompatible. The second concept is using the change in the number of genes to control chromosome growth.

Final Code

The final code for this chapter is available from:

https://github.com/handcraftsman/GeneticAlgorithmsWithPython/tree/master/ch17