APPENDIX D

Performance Tips

The material in this appendix is based on the Julia performance tips detailed in the Julia manual1 and the material in Goldberg (1991).

D.1 Floating Point Numbers

D.1.1 Do Not Test for Equality

When comparing floating point numbers, check that the absolute value of their difference is less than some tolerance. Floating point operations often involve a tiny loss of precision, making two numbers unequal at the bit level when they are actually equal enough for practical purposes.

a = 3.1415926

tol = 1e-5

## Bad

if a == pi

 println("a is equal to $pi")

end

## Good

if abs(a - pi) < tol

 println("a is equal to $pi")

end

## Better

if isapprox(a, pi, rtol = tol)

 println("a is equal to $pi")

end

D.1.2 Use Logarithms for Division

When doing division with large numbers, it can be advantageous to replace them with their logarithms. The following code block shows how two large numbers, generated with the gamma() function, can be divided using their logarithms.

using SpecialFunctions

## Two very large numbers

/(gamma(210), gamma(190))

#NaN

exp(-(lgamma(210), lgamma(190)))

#9.89070132023e45

D.1.3 Subtracting Two Nearly Equal Numbers

If two numbers agree to b bits, b bits of precision can be lost if they are subtracted. If the two numbers have identical machine representations, their difference could be zero, despite this being mathematically false. A simple way to illustrate this is taking the limit of a function at a particular point. The following code block shows what happens when we take the limit of exp(1). As h → 0, the limit reaches the true value at h = 10−5 and remains there until h = 10−11. At h = 10−16, the two numbers in the numerator of the limit expression are equal to machine precision and the result is zero. When this occurs, it is often beneficial to reformulate the problem so that it does not require a subtraction.

## limit of exp(1)

lim_exp1(h) = /(-(exp(1. + h), exp(1.)), h)

## h from 10e-4 to 10e-16

for i in 4:16

 le1 = lim_exp1(10.0^(-i))

 println("$i : ", le1)

 println("derivative == exp(1): $(isapprox(exp(1), le1, rtol = tol))")

end

# 4 : 2.718417747082924

# derivative == exp(1): false

# 5 : 2.7182954199567173

# derivative == exp(1): true

# ..

# 10 : 2.7182833761685288

# derivative == exp(1): true

# 11 : 2.7183144624132183

# derivative == exp(1): false

# ..

# 16 : 0.0

# derivative == exp(1): false

D.2 Julia Performance

D.2.1 General Tips

The following three tips will almost without exception improve your Julia code:

1.  Do not use global variables.

2.  Write functions.

3.  Profile your code.

The first tip is important because the Julia compiler has a hard time optimizing global variables because their type can change at any point in the program. If they must be used, they should be defined as constants using the const keyword. Whenever possible, constants should have a type associated with them.

The second tip is there because of the way the compiler works; specifically, code inside functions is typically much faster. Functions have well-defined arguments and outputs which help the compiler make optimizations. Additionally, functions are helpful for the maintenance and testing of your software, both very important aspects of software development.

The third tip is crucial as it can help data scientists discover problems and improve the performance of code. The Julia package ecosystem has a few options for doing profiling, such as the profile module ProfileView.jl and BenchmarkTools.jl. We will briefly touch on the time macro, which is used in the following code block (Appendix D.2.2), where @time is added before a Julia expression and returns the time the expression took to execute, the number of allocations it made, and the total number of bytes the allocations used. Large memory allocation is often an indication of a problem in code. Memory allocation problems are often related to type instability or not using mutable data structures properly. Note that the first time @time is run, it is being compiled and so this timing should be ignored. Initially, we run @time two to three times before we start to take note of the information it provides.

D.2.2 Array Processing

In Julia, similar to R, arrays are stored in column major order. When processing two-dimensional arrays, it is much faster to process them by first iterating over the columns and then the rows. The following code block compares two functions that total the entries in a 10000 × 3000 array of random numbers. The function that uses an outer loop for the columns is roughly 70% faster.

using Random

Random.seed!(63)

## Processing arrays

A1 = rand(Float64, (10000,3000))

function outer_row(A)

 tot = 0

 m,n = size(A)

 for i = 1:m, j = 1:n

  tot += A[i, j]

 end

 return tot

end

function outer_col(A)

 tot = 0

 m,n = size(A)

 for j = 1:n, i = 1:m

  tot += A[i, j]

 end

 return tot

end

@time outer_row(A1)

# 0.725792 seconds (5 allocations: 176 bytes)

@time outer_col(A1)

# 0.207118 seconds (5 allocations: 176 bytes)

When writing functions that return an array, it is often advantageous to pre-allocate the memory for the return value. This will cut down on performance bottlenecks associated with memory allocation and garbage collection. Pre-allocation also has the advantage of providing some type control on the object being returned. The following code block details how one might take the mean of each column in an array. This is just for illustration and ignores the fact that the mean() function can complete the task by specifying the correct dimension in its arguments.

The first function colm() is very inefficient. The returned array is defined at the beginning as the res object. Because no type is specified, Julia assumes it can take entries of any type, making for poor optimizations. At each value of j, colm() makes a copy of the array slice, calculates its mean and concatenates it to the result array res. This results in a function call that does approximately 91,000 allocations and uses 267 MB of memory.

The second function colm2() allocates a 3000-element array of type Float64 to store the column means, as we know the number of columns in the input. Inside the loop, it takes advantage of array views, which returns a view into the array at the specified indices without making a copy of the slice. These modifications result in an 80% speed improvement and 97% fewer memory allocations.

using Random

Random.seed!(65)

## Processing arrays

A1 = rand(Float64, (10000,3000))

## Inefficient

function colm(A)

 res = Vector()

 m,n = size(A)

 for j = 1:n

  res = vcat(res, mean(A[:, j]))

 end

 return res

end

@time colm(A1) # Any[3000]

# 0.561565 seconds (90.89 k allocations: 266.552 MiB, 21.93% gc time)

## efficient

function colm2(A)

 m,n = size(A)

 res = Vector{Float64}(undef, n)

 for j = 1:n

  res[j] = mean(view(A, :, j))

 end

 return res

end

@time colm2(A1) # Float64[3000]

# 0.118919 seconds (3.01 k allocations: 164.297 KiB

D.2.3 Separate Core Computations

When writing Julia functions, it is recommended that complex and/or repeated computations be implemented in separate functions. This can help the compiler optimize the code and aids with debugging efforts. Another benefit is the increased opportunity for code re-use.

The concept is illustrated in the next code block. The first function matvec() does matrix-vector multiplication by expressing the result as a linear combination of the matrix’s columns, where the coefficients are the vector’s entries. The function does this in a for loop and returns the result as a dictionary. The function has two components, the first dealing with the dictionary initialization and population and the second concerning the multiplication. If we separate these two components into separate functions, mvmul() and matvec2(), we gain in performance with a 24% speed up, and also have more readable and maintainable code.

## sample matrix and vector

M1 = [1 2 3; 2 4 6] #2x3

V1 = [1,2,2] # 3x1

## column representation of the multiplication

function matvec(M, v)

 d1 = Dict{String, Vector{Real}}()

 m,n = size(M)

 res2 = zeros(m)

 for i = 1:n

  res2 = +(res2, *(view(M, :, i), v[i]))

 end

 return push!(d1, "M1xV1"=>res2)

end

@time matvec(M1, V1)

# 0.000021 seconds (21 allocations: 1.672 KiB)

## separate function to do the computation

## use dispatch so it accepts Int and Float arguments

function mvmul(M::Matrix{T}, v::Vector{T}) where {T <: Number}

 m,n = size(M)

 res2 = zeros(m)

 for i = 1:n

  res2 = +(res2, *(view(M, :, i), v[i]))

 end

 return res2

end

## calls the computation function

function matvec2(M, v)

 d1 = Dict{String, Vector{Real}}()

 v1 = mvmul(M, v)

 return push!(d1, "M1xV1"=>v1)

end

@time matvec2(M1, V1)

# 0.000016 seconds (21 allocations: 1.672 KiB)

1http://docs.julialang.org/en/stable/manual/performance-tips/