But Does It Work?

Let’s see where we are. We’re implemented four GenServers and two supervisors. When the application starts, it will start the top-level supervisor, which in turn starts Results, PathFinder, WorkerSupervisor, and Gatherer.

When Gatherer starts (and it will start last), it tells the worker supervisor to start a number of workers. When each worker starts, it gets a path to process from PathFinder, hashes the corresponding file, and passes the result to Gatherer, which stores the path and the hash in the Results server. When there are no more files to process, each worker sends a :done message to the gatherer. When the last worker is done, the gatherer reports the results.

Everything seems to be wired up. Let’s try it:

 $ ​​mix​​ ​​run
 Compiling 7 files (.ex)
 Generated duper app
 $

Hmm…that’s strange. No output.

The first time this happened to me, I wasted most of a day working it out. And the problem is obvious once you know what’s happening.

The mix run command runs your application. Once it has it running, mix exits: mission accomplished.

But your application never finished; it just got started and mix went away. We have to tell mix not to exit.

 $ ​​mix​​ ​​run​​ ​​--no-halt
 Results:
 
 ["./_build/dev/lib/dir_walker/.compile.elixir_scm",
  "./_build/test/lib/dir_walker/.compile.elixir_scm"]
 ["./_build/dev/lib/dir_walker/.compile.elixir",
  "./_build/test/lib/dir_walker/.compile.elixir"]
 ["./_build/dev/lib/dir_walker/.compile.xref",
  "./_build/dev/lib/duper/.compile.xref",
  "./_build/test/lib/dir_walker/.compile.xref"]
 ["./deps/dir_walker/.fetch",
  "./_build/dev/lib/dir_walker/.compile.lock",
  "./_build/dev/lib/dir_walker/.compile.fetch",
  "./_build/test/lib/dir_walker/.compile.lock",
  "./_build/test/lib/dir_walker/.compile.fetch"]
 ["./_build/dev/lib/dir_walker/ebin/dir_walker.app",
  "./_build/test/lib/dir_walker/ebin/dir_walker.app"]
 $

Much better. Even inside our Elixir project we have duplicated files, mostly between the test and dev environments.

Let’s Play with Timing

Our lib/duper/application.ex file contains parameters that tell the app where to search and how many workers to use when searching. (We’ll see in the next chapter how to move those values out of code and onto the command line.)

Let’s change these parameters. My ~/Pictures folder used 30 GB to store about 6,000 old pictures from when I used iPhoto. Let’s look for duplicates in that folder with one worker, two workers, and so on, recording elapsed time.

Here are the parameters for using a single worker:

 children = [
  Duper.Results,
  { Duper.PathFinder, ​"​​/Users/dave/Pictures"​ },
  Duper.WorkerSupervisor,
  { Duper.Gatherer, 1 },
 ]

Run it:

 $ ​​time​​ ​​mix​​ ​​run​​ ​​--no-halt​​ ​​>dups
  87.57 real 58.81 user 23.44 sys
 
 $ ​​wc​​ ​​-l​​ ​​dups
  1869 dups

We found 1,900-odd duplicated photos in about 88 seconds. The Elixir runtime used about 98% of one of my cores during this process.

Let’s try with two workers. Alter application.ex, and run this:

 $ ​​time​​ ​​mix​​ ​​run​​ ​​--no-halt​​ ​​>dups
  48.58 real 58.33 user 17.98 sys

Nice! It ran almost twice as fast. It means that I’m successfully overlapping the IO and the hashing.

To cut a long story short, here are the results for 1..5, 10, and 50 workers.

images/duper_perf.png

As my machine has only two processors (four cores, but two are just hyperthreading), that’s about as good as I could expect.