Chapter 6

Using Bulk Data Operations with Collections

IN THIS CHAPTER

check Understanding basic stream operations

check Examining the stream interface

check Filtering and sorting streams

check Computing sums, averages, and other values

One of the most common things to do with a collection is to iterate over it, performing some type of operation on all of its elements. For example, you might use a for each loop to print all of the elements. The body of the foreach loop might contain an if statement to select which elements to print. Or it might perform a calculation such as accumulating a grand total or counting the number of elements that meet a given condition.

In Java, for each loops are easy to create and can be very powerful. However, they have one significant drawback: They iterate over the collection's elements one at a time, beginning with the first element and proceeding sequentially to the last element. As a result, a for each loop must be executed sequentially within a single tread.

That’s a shame, since modern computers have multicore processors that are capable of doing several things at once. Wouldn’t it be great if you could divide a for each loop into several parts, each of which can be run independently on one of the processor cores? For a small collection, the user probably wouldn’t notice the difference. But if the collection is extremely large (say, a few million elements), unleashing the power of a multicore processor could make the program run much faster.

While you can do that with earlier versions of Java, the programming is tricky. You have to master one of the more difficult aspects of programming in Java: working with threads, which are like separate sections of your program that can be executed simultaneously. You’ll learn the basics of working with threads in Book 5, Chapter 1. For now, take my word for it: Writing programs that work with large collections of data and take advantage of multiple threads was a difficult undertaking. At least until Java 8.

With Java 8 or later, you can use a feature called bulk data operations that’s designed specifically to attack this very problem. When you use bulk data operations, you do not directly iterate over the collection data using a for each loop. Instead, you simply provide the operations that will be done on each of the collection's elements and let Java take care of the messy details required to spread the work over multiple threads.

At the heart of the bulk data operations feature is a new type of object called a stream, defined by the Stream interface. A stream is simply a sequence of elements of any data type which can be processed sequentially or in parallel. The Stream interface provides methods that let you perform various operations such as filtering the elements or performing an operation on each of the elements.

Tip File streams are used to read and write data to disk files. The streams described in this chapter are used to process data extracted from collection classes.

Streams rely on the use of lambda expressions to pass the operations that are performed on stream elements. In fact, the primary reason Java’s developers introduced lambda expressions into the Java language was to facilitate streams. If you haven’t yet read Book 3, Chapter 7, I suggest you do so now, before reading further into this chapter. Otherwise you’ll find yourself hopelessly confused by the peculiar syntax of the lambda expressions used throughout this chapter.

In this chapter, you learn the basics of using streams to perform simple bulk data operations.

Looking At a Basic Bulk Data Operation

Suppose you have a list of spells used by a certain wizard who, for copyright purposes, we’ll refer to simply as HP. The spells are represented by a class named Spell, which is defined as follows:

public class Spell

{

public String name;

public SpellType type;

public String description;

public enum SpellType {SPELL, CHARM, CURSE}

public Spell(String spellName, SpellType spellType,

String spellDescription)

{

name = spellName;

type = spellType;

description = spellDescription;

}

public String toString()

{

return name;

}

}

As you can see, the Spell class has three public fields that represent the spell’s name, type (SPELL, CHARM, or CURSE), and description, as well as a constructor that lets you specify the name, type, and description for the spell. Also, the toString() method is overridden to return simply the spell name.

Let’s load a few of HP’s spells into an ArrayList:

ArrayList<Spell> spells = new ArrayList<>();

spells.add(new Spell("Aparecium", Spell.SpellType.SPELL,

"Makes invisible ink appear."));

spells.add(new Spell("Avis", Spell.SpellType.SPELL,

"Launches birds from your wand."));

spells.add(new Spell("Engorgio", Spell.SpellType.CHARM,

"Enlarges something."));

spells.add(new Spell("Fidelius", Spell.SpellType.CHARM,

"Hides a secret within someone."));

spells.add(new Spell("Finite Incatatum", Spell.SpellType.SPELL,

"Stops all current spells."));

spells.add(new Spell("Locomotor Mortis", Spell.SpellType.CURSE,

"Locks an opponent's legs."));

Now, suppose you want to list the name of each spell on the console. You could do that using a for each loop like this:

for (Spell spell : spells)

System.out.println(spell.name);

Written with streams, the code would look like this:

spells.stream().forEach(s -> System.out.println(s));

Here, I first use the stream method of the ArrayList class to convert the ArrayList to a stream. All of the classes that inherit from java.Collection implement a stream method that returns a Stream object. That includes not only ArrayList, but also LinkedList and Stack.

Next, I use the stream's forEach method to iterate the stream, passing a lambda expression that calls System.out.println for each item in the stream. The forEach method processes the entire stream, writing each element to the console.

Suppose you want to list just the spells, not the charms or curses. Using a traditional for each loop, you'd do it like this:

for (Spell spell : spells)

{

if (spell.type == Spell.SpellType.SPELL)

System.out.println(spell.name);

}

Here an if statement selects just the spells so that the charms and curses aren’t listed.

Here’s the same thing using streams:

spells.stream()

.filter(s -> s.type == Spell.SpellType.SPELL)

.forEach(s -> System.out.println(s));

In this example, the stream method converts the ArrayList to a stream. Then the stream’s filter method is used to select just the SPELL items. Finally, the forEach method sends the selected items to the console. Notice that lambda expressions are used in both the forEach method and the filter method.

The filter method of the Stream class returns a Stream object. Thus, it is possible to apply a second filter to the result of the first filter, like this:

spells.parallelStream()

.filter(s -> s.type == Spell.SpellType.SPELL)

.filter(s -> s.name.toLowerCase().startsWith("a"))

.forEach(s -> System.out.println(s));

In this example, just the spells that start with the letter A are listed.

Technical stuff The term pipeline is often used to describe a sequence of method calls that start by creating a stream, then manipulate the stream in various ways by calling methods such as filter, and finally end by calling a method that does not return another stream object, such as forEach.

Looking Closer at the Stream Interface

The Stream interface defines about 40 methods. In addition, three related interfaces — DoubleStream, IntStream, and LongStream — extend the Stream interface to define operations that are specific to a single data type: double, int, and long. Table 6-1 lists the most commonly used methods of these interfaces.

TABLE 6-1 The Stream and Related Interfaces

Methods that Return Streams

Explanation

Stream distinct()

Returns a stream consisting of distinct elements of the input stream. In other words, duplicates are removed.

Stream limit(long maxSize)

Returns a stream having no more than maxSize elements derived from the input stream.

Stream filter(Predicate<? super T> predicate)

Returns a stream consisting of those elements in the input stream that match the conditions of the predicate.

Stream sorted()

Returns the stream elements in sorted order using the natural sorting method for the stream's data type.

Stream sorted(Comparator<? super T> comparator)

Returns the stream elements in sorted order using the specified Comparator function. The Comparator interface accepts two parameters and returns a negative value if the first is less than the second, zero if they are equal, and a positive value if the first is greater than the second.

Mapping Methods

Explanation

<R> Stream<R> map(Function<? super T,? extends R> mapper

Returns a stream created by applying the mapper function to each element of the input stream.

DoubleStream mapToDouble(ToDoubleFunction<? super T> mapper)

Returns a DoubleStream created by applying the mapper function to each element of the input stream.

IntStream mapToInt(ToIntFunction<? super T> mapper)

Returns an IntStream created by applying the mapper function to each element of the input stream.

LongStream mapToLong(ToLongFunction<? super T> mapper)

Returns a LongStream created by applying the mapper function to each element of the input stream.

Terminal and Aggregate Methods

Explanation

void forEach(Consumer<? super T> action)

Executes the action against each element of the input stream.

void forEachOrdered (Consumer<? super T> action)

Executes the action against each element of the input stream, ensuring that the elements of the input stream are processed in order.

long count()

Returns the number of elements in the stream.

Optional<T> max(Comparator<? super T> comparator)

Returns the largest element in the stream.

Optional<T> min(Comparator<? super T> comparator)

Returns the smallest element in the stream.

OptionalDouble average()

Returns the average value of the elements in the stream. Valid only for DoubleStream, IntStream, and Longstream.

resultType sum()

Returns the sum of the elements in the stream. Result type is double for DoubleStream, int for IntStream, and long for LongStream.

resultType summaryStatistics()

Returns a summary statistics object that includes property methods named getCount, getSum, getAverage, getMax, and getMmin of the elements in the stream. The result type is IntSummaryStatistics for an IntStream, DoubleSummaryStatistics for a DoubleStream, and LongSummaryStatistics for a LongStream.

The first group of methods in Table 6-1 define methods that return other Stream objects. Each of these methods manipulates the stream in some way, then passes the altered stream down the pipeline to be processed by another operation.

The filter method is one of the most commonly used stream methods. It's argument, called a predicate, is a function that returns a boolean value. The function is called once for every element in the stream and is passed a single argument that contains the element under question. If the method returns true, the element is passed on to the result stream. If it returns false, the element is not passed on.

The easiest way to implement a filter predicate is to use a lambda expression that specifies a conditional expression. For example, the following lambda expression inspects the name field of the stream element and returns true if it begins with the letter a (upper- or lowercase):

s -> s.name.toLowerCase().startsWith("a")

The other methods in the first group let you limit the number of elements in a stream or sort the elements of the stream. To sort a stream, you can use either the element’s natural sorting order, or you can supply your own comparator, either as a function or as an object that implements the Comparator interface.

The second group of methods in Table 6-1 are called mapping methods because they convert a stream whose elements are of one type to a stream whose elements are of another type. The mapping function, which you must pass as a parameter, is responsible for converting the data from the first type to the second.

One common use for mapping methods is to convert a stream of complex types to a stream of simple numeric values of type double, int, or long, which you can then use to perform an aggregate calculation such as sum or average. For example, suppose HP's spells were for sale and the Spell class included a public field named price. To calculate the average price of all the spells, you would first have to convert the stream of Spell objects to a stream of doubles. To do that, you use the mapToDouble method. The mapping function would simply return the price field:

.mapToDouble(s -> s.price)

Methods in the last group in Table 6-1 are called terminal methods because they do not return another stream. As a result, they are always the last methods called in stream pipelines. Note that if you don’t call a terminal method, no data from the stream will be processed — the terminal method is what gets the ball rolling.

You have already seen the forEach method in action; it provides a function that is called once for each element in the stream. Note that in the examples so far, the function to be executed on each element has consisted of just a single method call, so I’ve included it directly in the lambda expression. If the function is more complicated, you can isolate it in its own method. Then the lambda expression should call the method that defines the function.

Aggregate methods perform a calculation on all of the elements in the stream, then return the result. Of the aggregate methods, count is straightforward: It simply returns the number of elements in the stream. The other aggregate methods need a little explanation because they return an optional data type. An optional data type is an object that might contain a value, or it might not.

For example, the average method calculates the average value of a stream of integers, longs, or doubles and returns the result as an OptionalDouble. If the stream was empty, the average is undefined, so the OptionalDouble contains no value. You can determine if the OptionalDouble contains a value by calling its isPresent method, which returns true if there is a value present. If there is a value, you can get it by calling the getAsDouble method.

Warning Note that getAsDouble will throw an exception if no value is present, so you should always call isPresent before you call getAsDouble.

Here’s an example that calculates the average price of spells:

OptionalDouble avg = spells.stream()

.mapToDouble(s -> s.price)

.average();

Here is how you would write the average price to the console:

if (avg.isPresent())

{

System.out.println("Average = "

+ avg.getAsDouble());

}

Using Parallel Streams

Streams come in two basic flavors: sequential and parallel. Elements in a sequential stream are produced by the stream method and create streams that are processed one element after the next. Parallel streams, in contrast, can take full advantage of multicore processors by breaking its elements into two or more smaller streams, performing operations on them, and then recombining the separate streams to create the final result stream. Each of the intermediate streams can be processed by a separate thread, which can improve performance for large streams.

By default, streams are sequential. But creating a parallel stream is easy: Just use the parallelStream method instead of the stream method at the beginning of the pipeline.

For example, to print all of HP’s spells using a parallel stream, use this code:

spells.parallelStream()

.forEach(s -> System.out.println(s));

Note that when you use a parallel stream, you can’t predict the order in which each element of the stream is processed. That’s because when the stream is split and run on two or more threads, the order in which the processor executes the threads is not predictable.

To demonstrate this point, consider this simple example:

System.out.println("First Parallel stream: ");

spells.parallelStream()

.forEach(s -> System.out.println(s));

System.out.println("\nSecond Parallel stream: ");

spells.parallelStream()

.forEach(s -> System.out.println(s));

When you execute this code, the results will look something like this:

First parallel stream:

Fidelius

Finite Incatatum

Engorgio

Locomotor Mortis

Aparecium

Avis

Second parallel stream:

Fidelius

Engorgio

Finite Incatatum

Locomotor Mortis

Avis

Aparecium

Notice that although the same spells are printed for each of the streams, they are printed in a different order.