Micro-profiling

When the slow function is found, it is sometimes necessary to do more profiling work that tests just a part of the program. This is done by manually instrumenting a part of the code in a speed test.

For instance, the cProfile module can be used in a form of decorator, as in following example:

import time
import tempfile
import cProfile
import pstats

def profile(column='time', list=3):
    def parametrized_decorator(function):
        def decorated(*args, **kw):
            s = tempfile.mktemp()

            profiler = cProfile.Profile()
            profiler.runcall(function, *args, **kw)
            profiler.dump_stats(s)

            p = pstats.Stats(s)
            print("=" * 5, f"{function.__name__}() profile", "=" * 5)
            p.sort_stats(column).print_stats(list)
        return decorated

    return parametrized_decorator

def medium():
    time.sleep(0.01)

@profile('time')
def heavy():
    for i in range(100):
        medium()
        medium()
    time.sleep(2)

@profile('time')
def main():
    for i in range(2):
        heavy()

if __name__ == '__main__':
    main()

This approach allows for testing only selected parts of the application and sharpens the statistics output. This way, you can collect many isolated and precisely targeted profiles on a single application run, as follows:

$ python3 cprofile_decorator.py
===== heavy() profile =====
Wed Apr 10 03:11:53 2019 /var/folders/jy/wy13kx0s7sb1dx2rfsqdvzdw0000gq/T/tmpyi2wejm5

         403 function calls in 4.330 seconds

   Ordered by: internal time
   List reduced from 4 to 3 due to restriction <3>

   ncalls tottime percall cumtime percall filename:lineno(function)
      201 4.327 0.022 4.327 0.022 {built-in method time.sleep}
      200 0.002 0.000 2.326 0.012 cprofile_decorator.py:24(medium)
        1 0.001 0.001 4.330 4.330 cprofile_decorator.py:28(heavy)

===== heavy() profile =====
Wed Apr 10 03:11:57 2019 /var/folders/jy/wy13kx0s7sb1dx2rfsqdvzdw0000gq/T/tmp8mubgwjw

         403 function calls in 4.328 seconds

   Ordered by: internal time
   List reduced from 4 to 3 due to restriction <3>

   ncalls tottime percall cumtime percall filename:lineno(function)
      201 4.324 0.022 4.324 0.022 {built-in method time.sleep}
      200 0.002 0.000 2.325 0.012 cprofile_decorator.py:24(medium)
        1 0.001 0.001 4.328 4.328 cprofile_decorator.py:28(heavy)


===== main() profile =====
Wed Apr 10 03:11:57 2019 /var/folders/jy/wy13kx0s7sb1dx2rfsqdvzdw0000gq/T/tmp6c0y2oxj

         62 function calls in 8.663 seconds

   Ordered by: internal time
   List reduced from 27 to 3 due to restriction <3>

   ncalls tottime percall cumtime percall filename:lineno(function)
        1 8.662 8.662 8.662 8.662 {method 'enable' of '_lsprof.Profiler' objects}
        1 0.000 0.000 0.000 0.000 {built-in method posix.lstat}
        8 0.000 0.000 0.000 0.000 /usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/random.py:224(_randbelow)

But at this stage, having a list of callees is probably not interesting, as the function has already been pointed out as the one to optimize. The only interesting information is to know how fast it is, and then enhance it.

timeit is a useful module that provides a simple way to measure the execution time of a small code snippet, with the best underlying timer the host system provides (time.time or time.clock), as shown in the following example:

>>> from myapp import light
>>> import timeit
>>> t = timeit.Timer('main()')
>>> t.timeit(number=5)
10000000 loops, best of 3: 0.0269 usec per loop
10000000 loops, best of 3: 0.0268 usec per loop
10000000 loops, best of 3: 0.0269 usec per loop
10000000 loops, best of 3: 0.0268 usec per loop
10000000 loops, best of 3: 0.0269 usec per loop
5.6196951866149902

This module allows you to repeat the call multiple times, and can be easily used to try out isolated code snippets. This is very useful outside the application context—in a prompt, for instance—but is not really handy to use within an existing application.

A deterministic profiler will provide results depending on what the computer is doing, and so results may vary each time. Repeating the same test multiple times and making averages provides more accurate results. Furthermore, some computers have special CPU features, such as SpeedStep, which might change the results if the computer is idling when the test is launched. So, continually repeating the test is good practice for small code snippets. There are also various caches to keep in mind, such as DNS caches or CPU caches.

The results of timeit should be used with caution. It is a very good tool to objectively compare two short snippets of code, but it also allows you to easily make dangerous mistakes that will lead you to confusing conclusions. Here, for example, is the comparison of two innocent snippets of code with the timeit module that could make you think that string concatenation by addition is faster than the str.join() method:

$ python3 -m timeit -s 'a = map(str, range(1000))' '"".join(a)'
1000000 loops, best of 3: 0.497 usec per loop

$ python3 -m timeit -s 'a = map(str, range(1000)); s=""' 'for i in a: s += i'
10000000 loops, best of 3: 0.0808 usec per loop

From Chapter 3, Modern Syntax Elements - Below the Class Level, we know that string concatenation by addition is not a good pattern. Despite some minor CPython micro-optimizations that were designed exactly for such use cases, it will eventually lead to quadratic runtime. The problem lies in nuances about the setup argument of the timeit() call (or the -s parameter in the command line) and how the range in Python 3 works. I won't discuss the details of the problem, but will leave it to you as an exercise. Anyway, here is the correct way to compare string concatenation in addition to the str.join() idiom under Python 3:

$ python3 -m timeit -s 'a = [str(i) for i in range(10000)]' 's="".join(a)'
10000 loops, best of 3: 128 usec per loop
$ python3 -m timeit -s 'a = [str(i) for i in range(10000)]' '
  s = ""
  for i in a:
      s += i
  '
1000 loops, best of 3: 1.38 msec per loop

Profiling memory usage is explained in the next section.