Performance Testing with Gatling

Now that we have a working web server, let’s test it out. If you’ve configured the Docker-based environment, port 8080 should be open, and you can run the server like this:

 :scala-native:httpserver ./target/scala-2.11/httpserver-out
 bind returned 0
 listen returned 0
 listening on port 8080

In this way, you should be able to access it at http://localhost:8080. We want to go beyond basic interactive testing, however. We want to stress test our server’s performance under realistic load.

Why Measure?

As I noted at the start of the chapter, implementing performant servers is hard. By testing the performance of our server, we can set a baseline for comparison to the more advanced approaches we’ll take in Part II of this book. We can also compare our performance to other languages and platforms to better understand how Scala Native sizes up.

That said, many sites now offer “speed tests” of various web servers, often plotted against one another in a giant bar graph. Personally, I am skeptical of this approach; in my experience, web application performance at scale is too complex to reduce to a single variable.

For the purposes of this book, we’ll instead focus on characterizing trade-offs between five key criteria:

Why Gatling?

To perform the stress test, we’ll use a Scala-based testing framework called Gatling.[29] Gatling excels at generating large amounts of HTTP traffic; it’s also quite good at complex scripting and scenario-based experiments. It doesn’t run on Scala Native at this time, but it’s the best tool for the job, and so I’ve included it in the build environment.

Gatling’s simulation DSL is mostly self-explanatory, but we’ll take a quick look at it here:

HTTPServer/load_simulation.scala
 import​ ​io.gatling.core.Predef._
 import​ ​io.gatling.http.Predef._
 import​ ​scala.concurrent.duration._
 
 class​ GenericSimulation ​extends​ Simulation {
 val​ url ​=​ System.getenv(​"GATLING_URL"​)
 val​ requests ​=​ Integer.parseInt(System.getenv(​"GATLING_REQUESTS"​))
 val​ users ​=​ Integer.parseInt(System.getenv(​"GATLING_USERS"​))
 val​ reqs_per_user ​=​ requests / users
 val​ rampTime ​=​ Integer.parseInt(System.getenv(​"GATLING_RAMP_TIME"​))
 val​ scn ​=​ scenario(​"Test scenario"​).repeat(reqs_per_user) {
  exec(
  http(​"Web Server"​)
  .get(url)
  .check(status.in(Seq(200,304)))
  )
  }
  setUp(scn.inject(rampUsers(users) over (rampTime seconds)))
 }

As you can see, it externalizes parameters like the number of requests, number of connections, and URL to request as environment variables.

We’ll also need to install Gatling. It’s a Scala app, but it has a complex UI and IDE components as well. I’ve included a helper script in the book’s code directory called install_gatling.sh that will install it locally. Once Gatling is installed, we just need to export the environment variables defined earlier and invoke Gatling, like this:

 $ export GATLING_URL=http://localhost:8080 GATLING_USERS=10
 $ export GATLING_REQUESTS=50 GATLING_RAMP_TIME=0
 $ gatling.sh http://localhost:8080 10 500

This will hit localhost with 500 requests from ten simultaneous connections, and should produce output like this:

 =============================================================================
 ---- Global Information -----------------------------------------------------
 > request count 500 (OK=500 KO=0 )
 > min response time 6 (OK=6 KO=- )
 > max response time 266 (OK=84 KO=- )
 > mean response time 42 (OK=43 KO=- )
 > std deviation 65 (OK=65 KO=- )
 > response time 50th percentile 35 (OK=3 KO=- )
 > response time 75th percentile 130 (OK=6 KO=- )
 > response time 95th percentile 140 (OK=20 KO=- )
 > response time 99th percentile 171 (OK=170 KO=- )
 > mean requests/sec 125 (OK=125 KO=- )
 ---- Response Time Distribution ---------------------------------------------
 > t < 800 ms 500 (100%)
 > 800 ms < t < 1200 ms 0 ( 0%)
 > t > 1200 ms 0 ( 0%)
 > failed 0 ( 0%)
 =============================================================================

All of this is interesting data, but with respect to the five performance qualities that we care about, we only need three: response time 50th percentile, response time 99th percentile, and the error rate by failed at the bottom. Then we can put them in a table like this:

# of users

request count

50th percentile

99th percentile

error rate

10

500

35

171

0

Now, by gradually increasing the number of simulated users and requests, we can collect enough data to see the large-scale trends:

# of users

request count

50th percentile

99th percentile

error rate

10

500

35

171

0

25

1250

71

161

0

50

2500

164

421

0

75

3750

222

641

0

100

5000

228

701

0

150

7500

324

2724

0

200

10000

472

4120

0

250

12500

506

2777

0

300

15000

625

1746

0.4%

350

17500

528

2592

14%

400

20000

545

2661

29%

450

22500

576

2997

29%

500

25000

424

4735

33%

750

37500

536

5137

34%

1000

50000

464

3861

44%

1500

75000

555

4485

44%

2000

100000

559

5484

52%

What can we see here? Real data are always noisy, but there are two clearly visible trends. Our typical performance at the 50th percentile (or median) starts out at 35 milliseconds, but increases almost linearly until it reaches 625ms at 300 users. After 300 users, our median response decreases slightly, and then plateaus around 500ms, but the error rate starts to shoot up very rapidly. If we were optimizing for maximum throughput with minimal errors, we could aim to run this server at a capacity of about 300 users.

However, there’s another trend visible in the 99th percentile response, which is relevant to our tail latency criteria. Especially in distributed systems, tail latencies are important—one slow call can have cascading effects in a complex transaction, and an especially long timeout can trigger the termination of an unresponsive server. Even under the lightest load, we have some ugly tail latencies of 171ms, compared to the median at 35ms. With 50 users, the 99th percentile shoots up to 421ms, and at 150 users it goes up to 2724ms.

Thus, we have two different ways we could optimize a deployment of this program at scale. If we were responsible for operating this server in a latency-sensitive situation, we could try to scale it to around 100 users per instance to keep the tail latencies under control. On the other hand, we could instead scale our deployment to about 300 if we needed to maximize throughput.