Modern Systems Programming with Scala Native

Performance Testing with Gatling

Now that we have a working web server, let’s test it out. If you’ve configured the Docker-based environment, port 8080 should be open, and you can run the server like this:

	:scala-native:httpserver ./target/scala-2.11/httpserver-out
	bind returned 0
	listen returned 0
	listening on port 8080

In this way, you should be able to access it at http://localhost:8080. We want to go beyond basic interactive testing, however. We want to stress test our server’s performance under realistic load.

Why Measure?

As I noted at the start of the chapter, implementing performant servers is hard. By testing the performance of our server, we can set a baseline for comparison to the more advanced approaches we’ll take in Part II of this book. We can also compare our performance to other languages and platforms to better understand how Scala Native sizes up.

That said, many sites now offer “speed tests” of various web servers, often plotted against one another in a giant bar graph. Personally, I am skeptical of this approach; in my experience, web application performance at scale is too complex to reduce to a single variable.

For the purposes of this book, we’ll instead focus on characterizing trade-offs between five key criteria:

Responsiveness—how quickly can the server return a response under light load?
Throughput—what is the maximum number of requests the server can process in a fixed amount of time?
Maximum load—under what load does the server produce maximum throughput? Is the error rate tolerable?
Tail latency—under the same load, what is the response time at the 99th percentile?
Error handling—under heavy load, are connections dropped promptly, or do they hang or timeout?

Why Gatling?

To perform the stress test, we’ll use a Scala-based testing framework called Gatling.^[29] Gatling excels at generating large amounts of HTTP traffic; it’s also quite good at complex scripting and scenario-based experiments. It doesn’t run on Scala Native at this time, but it’s the best tool for the job, and so I’ve included it in the build environment.

Gatling’s simulation DSL is mostly self-explanatory, but we’ll take a quick look at it here:

HTTPServer/load_simulation.scala

	import io.gatling.core.Predef._
	import io.gatling.http.Predef._
	import scala.concurrent.duration._

	class GenericSimulation extends Simulation {
	val url = System.getenv("GATLING_URL")
	val requests = Integer.parseInt(System.getenv("GATLING_REQUESTS"))
	val users = Integer.parseInt(System.getenv("GATLING_USERS"))
	val reqs_per_user = requests / users
	val rampTime = Integer.parseInt(System.getenv("GATLING_RAMP_TIME"))
	val scn = scenario("Test scenario").repeat(reqs_per_user) {
	exec(
	http("Web Server")
	.get(url)
	.check(status.in(Seq(200,304)))
	)
	}
	setUp(scn.inject(rampUsers(users) over (rampTime seconds)))
	}

As you can see, it externalizes parameters like the number of requests, number of connections, and URL to request as environment variables.

We’ll also need to install Gatling. It’s a Scala app, but it has a complex UI and IDE components as well. I’ve included a helper script in the book’s code directory called install_gatling.sh that will install it locally. Once Gatling is installed, we just need to export the environment variables defined earlier and invoke Gatling, like this:

	$ export GATLING_URL=http://localhost:8080 GATLING_USERS=10
	$ export GATLING_REQUESTS=50 GATLING_RAMP_TIME=0
	$ gatling.sh http://localhost:8080 10 500

This will hit localhost with 500 requests from ten simultaneous connections, and should produce output like this:

	=============================================================================
	---- Global Information -----------------------------------------------------
	> request count 500 (OK=500 KO=0 )
	> min response time 6 (OK=6 KO=- )
	> max response time 266 (OK=84 KO=- )
	> mean response time 42 (OK=43 KO=- )
	> std deviation 65 (OK=65 KO=- )
	> response time 50th percentile 35 (OK=3 KO=- )
	> response time 75th percentile 130 (OK=6 KO=- )
	> response time 95th percentile 140 (OK=20 KO=- )
	> response time 99th percentile 171 (OK=170 KO=- )
	> mean requests/sec 125 (OK=125 KO=- )

	---- Response Time Distribution ---------------------------------------------
	> t < 800 ms 500 (100%)
	> 800 ms < t < 1200 ms 0 ( 0%)
	> t > 1200 ms 0 ( 0%)
	> failed 0 ( 0%)
	=============================================================================

All of this is interesting data, but with respect to the five performance qualities that we care about, we only need three: response time 50th percentile, response time 99th percentile, and the error rate by failed at the bottom. Then we can put them in a table like this:

# of users	request count	50th percentile	99th percentile	error rate
10	500	35	171	0

Now, by gradually increasing the number of simulated users and requests, we can collect enough data to see the large-scale trends:

# of users	request count	50th percentile	99th percentile	error rate
10	500	35	171	0
25	1250	71	161	0
50	2500	164	421	0
75	3750	222	641	0
100	5000	228	701	0
150	7500	324	2724	0
200	10000	472	4120	0
250	12500	506	2777	0
300	15000	625	1746	0.4%
350	17500	528	2592	14%
400	20000	545	2661	29%
450	22500	576	2997	29%
500	25000	424	4735	33%
750	37500	536	5137	34%
1000	50000	464	3861	44%
1500	75000	555	4485	44%
2000	100000	559	5484	52%

What can we see here? Real data are always noisy, but there are two clearly visible trends. Our typical performance at the 50th percentile (or median) starts out at 35 milliseconds, but increases almost linearly until it reaches 625ms at 300 users. After 300 users, our median response decreases slightly, and then plateaus around 500ms, but the error rate starts to shoot up very rapidly. If we were optimizing for maximum throughput with minimal errors, we could aim to run this server at a capacity of about 300 users.

However, there’s another trend visible in the 99th percentile response, which is relevant to our tail latency criteria. Especially in distributed systems, tail latencies are important—one slow call can have cascading effects in a complex transaction, and an especially long timeout can trigger the termination of an unresponsive server. Even under the lightest load, we have some ugly tail latencies of 171ms, compared to the median at 35ms. With 50 users, the 99th percentile shoots up to 421ms, and at 150 users it goes up to 2724ms.

Thus, we have two different ways we could optimize a deployment of this program at scale. If we were responsible for operating this server in a latency-sensitive situation, we could try to scale it to around 100 users per instance to keep the tail latencies under control. On the other hand, we could instead scale our deployment to about 300 if we needed to maximize throughput.