In the last chapter, we wrote a program to extract the maximum count from a Google NGrams file in a fraction of the time a traditional Scala program would. Our goal for this chapter is to read the entire NGrams file, aggregate word occurrences by year such that the counts of each word for all years are grouped together, and output the twenty most frequent words in the file. Before we can do that, we’ll need to solve a few different subproblems, and we’ll need to explore a few more advanced Scala Native techniques, too.
We’ll start by looking at heap memory allocation, which offers a good alternative to the stack for bulk, long-term data, like our NGram file. I’ll show you how to use structs to model the multiple fields of our NGram records in a way that works well with C libraries and data structures. Then we’ll design a mechanism based on arrays for storing our data into a contiguous, growing region of memory. With those techniques under our belt, we can read and sort our whole data file. We can then add aggregation as a relatively small modification of our previous code, but with a very noticeable impact on performance.
To begin, let’s look at how memory works and how we can have different areas of memory with different properties.