GRAPH COLORING

In this chapter we will try a type of problem known as graph coloring. Variations involve using the fewest number of colors while making each node a unique color, trying to use an equal number of each color, etc. By experimenting we can see that a shape surrounded by an even number of neighbors can generally be filled with only 3 colors.

While a shape surrounded by an odd number of neighbors requires 4 colors.

We will to try to use only 4 colors to color a map of the United States with the constraint that no adjacent states have the same color. To do this by hand, color any state then use alternating colors on its neighboring states, and only introduce a new color when necessary.

Now when we do this in code we don’t care about the visual representation, just the physical relationships between the states, which we can encode as a graph, or more simply as a set of rules indicating which states cannot have the same color.

Data

We’ll start off with a file containing a list of states and the set of states that are adjacent to each. Each row of the file contains the information for one state. There are 2 fields on each line, separated by a comma. The first field contains a unique code for the state. The second field contains a semicolon-separated list of the codes for its neighboring states. I used 2 columns in the table below to keep the list together.

AK,                        | MT,ID;ND;SD;WY
AL,FL;GA;MS;TN             | NC,GA;SC;TN;VA
AR,LA;MO;MS;OK;TN;TX       | ND,MN;MT;SD
AZ,CA;NM;NV;UT             | NE,CO;IA;KS;MO;SD;WY
CA,AZ;NV;OR                | NH,MA;ME;VT
CO,KS;NE;NM;OK;UT;WY       | NJ,DE;NY;PA
CT,MA;NY;RI                | NM,AZ;CO;OK;TX
DC,MD;VA                   | NV,AZ;CA;ID;OR;UT
DE,MD;NJ;PA                | NY,CT;MA;NJ;PA;VT
FL,AL;GA                   | OH,IN;KY;MI;PA;WV
GA,AL;FL;NC;SC;TN          | OK,AR;CO;KS;MO;NM;TX
HI,                        | OR,CA;ID;NV;WA
IA,IL;MN;MO;NE;SD;WI       | PA,DE;MD;NJ;NY;OH;WV
ID,MT;NV;OR;UT;WA;WY       | RI,CT;MA
IL,IA;IN;KY;MO;WI          | SC,GA;NC
IN,IL;KY;MI;OH             | SD,IA;MN;MT;ND;NE;WY
KS,CO;MO;NE;OK             | TN,AL;AR;GA;KY;MO;MS;NC;VA
KY,IL;IN;MO;OH;TN;VA;WV    | TX,AR;LA;NM;OK
LA,AR;MS;TX                | UT,AZ;CO;ID;NV;WY
MA,CT;NH;NY;RI;VT          | VA,DC;KY;MD;NC;TN;WV
MD,DC;DE;PA;VA;WV          | VT,MA;NH;NY
ME,NH                      | WA,ID;OR
MI,IN;OH;WI                | WI,IA;IL;MI;MN
MN,IA;ND;SD;WI             | WV,KY;MD;OH;PA;VA
MO,AR;IA;IL;KS;KY;NE;OK;TN | WY,CO;ID;MT;NE;SD;UT
MS,AL;AR;LA;TN

Reading the file

We’ll use the csv module to read the file and split the lines on comma. Then we’ll manually split the adjacent state list on semicolon and link the state and its list of adjacent states in a key-value table (dict).

graphColoringTests.py

import csv

def load_data(localFileName):
    """ expects: AA,BB;CC where BB and CC are the
        initial column values in other rows
    """
    with open(localFileName, mode='r') as infile:
        reader = csv.reader(infile)
        lookup = {row[0]: row[1].split(';') for row in reader if row}
    return lookup

Rule

Now that we’ve read the data we need to build the rules. A Rule connects two states indicating that they are adjacent. When we create the rule we always sort the state and adjacent codes alphabetically. This makes it possible to eliminate duplicates.

class Rule:
    def __init__(self, node, adjacent):
        if node < adjacent:
            node, adjacent = adjacent, node
        self.Node = node
        self.Adjacent = adjacent

    def __eq__(self, other):
        return self.Node == other.Node and self.Adjacent == other.Adjacent

    def __hash__(self):
        return hash(self.Node) * 397 ^ hash(self.Adjacent)

We may also want to be able to display a rule so we’ll add a str implementation.

    def __str__(self):
        return self.Node + " -> " + self.Adjacent

State adjacency Rules

Next we will build the set of Rules. While we’re doing so we will perform a sanity check on the data as follows: Whenever a key state says it is adjacent to another state, the adjacent state’s rules should also say the key state is adjacent to it. We do that by keeping a count of the number of times we’ve seen each state pair. We should see each twice, so if we see it only once we know we have a data problem.

def build_rules(items):
    rulesAdded = {}

    for state, adjacent in items.items():
        for adjacentState in adjacent:
            if adjacentState == '':
                continue
            rule = Rule(state, adjacentState)
            if rule in rulesAdded:
                rulesAdded[rule] += 1
            else:
                rulesAdded[rule] = 1

    for k, v in rulesAdded.items():
        if v != 2:
            print("rule {} is not bidirectional".format(k))

    return rulesAdded.keys()

We now have the ability to convert a file of node relationships to a set of adjacency Rules.

Test class

Next we will build the code used by the genetic engine.

import unittest
import datetime
import genetic
...
class GraphColoringTests(unittest.TestCase):
    def test(self):

Test

We’ll start by loading the states from the file and building the rules. Since the expected optimal situation will be that all adjacent states have different colors, we can set the optimal value to the number of rules. Our Chromosome will have 50 genes, one for each state in alphabetical order. This lets us use the index into its genes as an index into a list of sorted state codes.

def test(self):
    states = load_data("adjacent_states.csv")
    rules = build_rules(states)
    optimalValue = len(rules)
    stateIndexLookup = {key: index
                        for index, key in enumerate(sorted(states))}
...

Genes

Next, since we want to four-color the 50 states, our genotype can be four color codes (the first letter of each color name).

...
    colors = ["Orange", "Yellow", "Green", "Blue"]
    colorLookup = {color[0]: color for color in colors}
    geneset = list(colorLookup.keys())
...

Now add the usual function definitions and call get_best

...
    startTime = datetime.datetime.now()

    def fnDisplay(candidate):
        display(candidate, startTime)

    def fnGetFitness(genes):
        return get_fitness(genes, rules, stateIndexLookup)

    best = genetic.get_best(fnGetFitness, len(states), optimalValue,
                            geneset, fnDisplay)
    self.assertTrue(not optimalValue > best.Fitness)
...

And at the end we can write out the color of each state.

...
    keys = sorted(states.keys())
    for index in range(len(states)):
        print(keys[index] + " is " + colorLookup[best.Genes[index]])

Display

As for display, it should be sufficient to output the genes because they are also color codes.

def display(candidate, startTime):
    timeDiff = datetime.datetime.now() - startTime
    print("{}\t{}\t{}".format(
        ''.join(map(str, candidate.Genes)),
        candidate.Fitness,
        timeDiff))

This results in output like the following. The number to the right of the gene sequence will indicate how many rules this gene sequence satisfies.

YGGBOOGOOBBYGGYYYYGBGYOOGBOYGGOOOYBOYBBGGOBYOGOGOGG	74	0:00:00.001000

Fitness

Finally, we need a fitness function that counts how many constraints the gene sequence passed.

def get_fitness(genes, rules, stateIndexLookup):
    rulesThatPass = sum(1 for rule in rules
                        if rule.IsValid(genes, stateIndexLookup))
    return rulesThatPass

This uses a new function in Rule:

class Rule:
...
    def IsValid(self, genes, nodeIndexLookup):
        index = nodeIndexLookup[self.Node]
        adjacentStateIndex = nodeIndexLookup[self.Adjacent]
        return genes[index] != genes[adjacentStateIndex]

Run

Now when we run our main test function we get output like the following:

...
GOGOYYGOBGYOYYOOOGOBYOGOYYGOBBYOGGYYBBGOBYBYBGOOBOO	103	0:00:00.014000
YOGOYYGOBGYYGYBOOGOBYGGOYYGOBBYOGGYYBBGOBYBYBYOOBOO	105	0:00:00.017001
YOGOYYGOBGYYGYBOOGOBYGGOYYGOBBYOGGYYBBGOBYBYBYOOYBO	106	0:00:00.017001
GOGOYYOGYBYGGYBOOGOBBBBOYYGGBBYOGGYYBBGYOYBYBYOOYOO	107	0:00:00.040002
AK is Green
AL is Orange
AR is Green
AZ is Orange
CA is Yellow
CO is Yellow
CT is Orange
DC is Green
DE is Yellow
FL is Blue
GA is Yellow
HI is Green
IA is Green
ID is Yellow
IL is Blue
IN is Orange
KS is Orange
KY is Green
LA is Orange
MA is Blue
MD is Blue
ME is Blue
MI is Blue
MN is Orange
MO is Yellow
MS is Yellow
MT is Green
NC is Green
ND is Blue
NE is Blue
NH is Yellow
NJ is Orange
NM is Green
NV is Green
NY is Yellow
OH is Yellow
OK is Blue
OR is Blue
PA is Green
RI is Yellow
SC is Orange
SD is Yellow
TN is Blue
TX is Yellow
UT is Blue
VA is Yellow
VT is Orange
WA is Orange
WI is Yellow
WV is Orange
WY is Orange

Benchmarking

To benchmark this project we’re going to use R100_1gb, one of a set of graph coloring problem sets available from https://turing.cs.hbg.psu.edu/txn131/graphcoloring.html The file format is line based, with each line prefixed with its type. The lines we’re interested in start with e, for edge, followed by the IDs of the two nodes connected by that edge, as follows:

e 16 10

If we change the format of our adjacent state file to the same format, then we’ll be able to reuse a lot of code. We’ll start by adding a utility test-function that reads our current file and outputs the contents in the new format.

Convert the state file

@staticmethod
def test_convert_file():
    states = load_data("adjacent_states.csv")
    output = []
    nodeCount = edgeCount = 0
    for state, adjacents in states.items():
        nodeCount += 1
        for adjacent in adjacents:
            if adjacent == '':
                output.append("n {} 0".format(state))
            else:
                output.append("e {} {}".format(state, adjacent))
                edgeCount += 1
    with open('./adjacent_states.col', mode='w+') as outfile:
        print("p edge {} {}".format(nodeCount, edgeCount), file=outfile)
        for line in sorted(output):
            print(line, file=outfile)

When we run this test we end up with a file whose content looks like the following:

p edge 51 214
e AL FL
e AL GA
...
e WY UT
n AK 0
n HI 0

Alaska and Hawaii have no edges, so the only way we know about them is from their (n)ode records.

The utility function and import csv can be removed after the file is converted.

Read the new file format

Next we need to update load_data to read the new file format.

def load_data(localFileName):
    """ expects: T D1 [D2 ... DN]
        where T is the record type
        and D1 .. DN are record-type appropriate data elements
    """
    rules = set()
    nodes = set()
    with open(localFileName, mode='r') as infile:
        content = infile.read().splitlines()
    for row in content:
        if row[0] == 'e':  # e aa bb, aa and bb are node ids
            nodeIds = row.split(' ')[1:3]
            rules.add(Rule(nodeIds[0], nodeIds[1]))
            nodes.add(nodeIds[0])
            nodes.add(nodeIds[1])
            continue
        if row[0] == 'n':  # n aa ww, aa is a node id, ww is a weight
            nodeIds = row.split(' ')
            nodes.add(nodeIds[1])
    return rules, nodes

Extract parameters

We’ll also take this opportunity to rename test to color and make the file name and color list into parameters. Plus we’ll also add a test_states function that uses the color function.

    def test_states(self):
        self.color("adjacent_states.col",
                   ["Orange", "Yellow", "Green", "Blue"])

    def color(self, file, colors):
        rules, nodes = load_data(file)
...

And remove the following line from the color function since the list of colors is now a parameter.

colors = ["Orange", "Yellow", "Green", "Blue"]

Node indexes

The next change to the color function affects how we get the node indexes.

    rules, nodes = load_data(file)
    optimalValue = len(rules)
    colorLookup = {color[0]: color for color in colors}
    geneset = list(colorLookup.keys())
    startTime = datetime.datetime.now()
    nodeIndexLookup = {key: index
                       for index, key in enumerate(sorted(nodes))} ①

    def fnDisplay(candidate):
        display(candidate, startTime)

    def fnGetFitness(genes):
        return get_fitness(genes, rules, nodeIndexLookup) ②

Update the final output

Finally, we have to update the way we write out the results at the end of the test function:

    best = genetic.get_best(fnGetFitness, len(nodes), optimalValue,
                            geneset, fnDisplay)
    self.assertTrue(not optimalValue > best.Fitness)

    keys = sorted(nodes) ③
    for index in range(len(nodes)): ④
        print(keys[index] + " is " + colorLookup[best.Genes[index]])

Be sure to run the state test to verify that it still works.

Add the benchmark test

Next we can add a test for R100_1gb as follows:

def test_R100_1gb(self):
    self.color("R100_1gb.col",
               ["Red", "Orange", "Yellow", "Green", "Blue", "Indigo"])

Benchmarks

When we run the test it is able to find a way to color the graph using 6 colors relatively quickly. It can also find a solution using 5 colors but that takes longer than we want to spend on a benchmark, so we’ll stay with 6 colors. We’ll use this test as our benchmark for graph coloring.

def test_benchmark(self):
    genetic.Benchmark.run(lambda: self.test_R100_1gb())

average (seconds)  |  standard deviation
----------------------------------------
      0.74         |         0.42

Summary

In this chapter we read from a file and built constraints that can be applied to each candidate to determine its fitness. This was also the first project where we used test data from a standard set. Reading about others' solutions to the same standard problem is a good way to learn advanced techniques and to get inspiration for improving your fitness function and genotype.

Final Code

The state files and final code for this chapter are available from:

https://github.com/handcraftsman/GeneticAlgorithmsWithPython/tree/master/ch05