Chapter 1. Data I/O

Events happen all around us, continuously. Occasionally, we make a record of a discrete event at a certain point in time and space. We can then define data as a collection of records that someone (or something) took the time to write down or present in any format imaginable. As data scientists, we work with data in files, databases, web services, and more. Usually, someone has gone through a lot of trouble to define a schema or data model that precisely denotes the names, types, tolerances, and inter-relationships of all the variables. However, it is not always possible to enforce a schema during data acquisition. Real data (even in well-designed databases) often has missing values, misspellings, incorrectly formatted types, duplicate representations for the same value, and the worst: several variables concatenated into one. Although you are probably excited to implement machine-learning algorithms and create stunning graphics, the most important and time-consuming aspect of data science is preparing the data and ensuring its integrity.

What Is Data, Anyway?

Your ultimate goal is to retrieve data from its source, reduce the data via statistical analysis or learning, and then present some kind of knowledge about what was learned, usually in the form of a graph. However, even if your result is a single value such as the total revenue, most engaged user, or a quality factor, you still follow the same protocol: input datareductive analysisoutput data.

Considering that practical data science is driven by business questions, it will be to your advantage to examine this protocol from right to left. First, formalize the question you are trying to answer. For example, do you require a list of top users by region, a prediction of daily revenue for the next week, or a plot of the distribution of similarities between items in inventory? Next, explore the chain of analyses that can answer your questions. Finally, now that you have decided on your approach, exactly what data will you need to accomplish this goal? You may be surprised to find that you do not have the data required. Often you will discover that a much simpler set of analysis tools (than you originally envisioned) will be adequate to achieve the desired output.

In this chapter, you will explore the finer details of reading and writing data from a variety of sources. It is important to ask yourself what data model is required for any subsequent steps. Perhaps it will suffice to build a series of numerical array types (e.g., double[][], int[], String[]) to contain the data. On the other hand, you may benefit from creating a container class to hold each data record, and then populating a List or Map with those objects. Still another useful data model is to formulate each record as a set of key-value pairs in a JavaScript Object Notation (JSON) document. The decision of what data model to choose rests largely on the input requirements of the subsequent data-consuming processes.

Data Models

What form is the data in, and what form do you need to transform it to so you can move forward? Suppose somefile.txt contained rows of id, year, and city data.

Dealing with Real Data

Real data is messy, incomplete, incorrect, and sometimes incoherent. If you are working with a “perfect” dataset, it’s because someone else spent a great deal of time and effort in getting it that way. It is also possible that your data is, in fact, not perfect, and you are unwittingly performing analyses on junk data. The only way to be sure is to get data from the source and process it yourself. This way, if there is a mistake, you know who to blame.

Outliers

Now that our data is cleaned and parsed, we can check whether the value is acceptable given our requirements. If we were expecting a value of either 0 or 1 and we get a 2, the value is clearly out of range and we can designate this data point as an outlier. As in the case with nulls and blanks, we can perform a Boolean test on the value to determine whether it is within an acceptable range of values. This is good for numeric types as well as strings and date times.

In the case of checking ranges with numeric types, we need to know the minimum and maximum acceptable values and whether they are inclusive or exclusive. For example, if we set minValue = 1.0 and minValueInclusive = true, all values greater than or equal to 1.0 will pass the test. If we set minValueInclusive = false, only values greater than 1.0 will pass the test. Here is the code:

public boolean checkRange(double value) {
    boolean minBit = (minValueInclusive) ? value >= minValue : value > minValue;
    boolean maxBit = (maxValueInclusive) ? value <= maxValue : value < maxValue;
    return minBit && maxBit;
}

Similar methods can be written for integer types.

We can also check whether a string value is in an acceptable range by setting an enumeration of valid strings. This can be done by creating a Set instance of valid strings called, for example, validItems, where the Set.contains() method can be used to test the validity of an input value:

private boolean checkRange(String value) {
    return validItems.contains(value);
}

For DateTime objects, we can check whether a date is after a minimum date and before a maximum date. In this case, we define the min and max as OffsetDateTime objects and then test whether the input date time is between the min and max. Note that OffsetDateTime.isBefore() and OffsetDateTime.isAfter() are exclusive. If the input date time is equal to either the min or max, the test will fail. Here is the code:

private boolean checkRange(OffsetDateTime odt) {
    return odt.isAfter(minDate) && odt.isBefore(maxDate);
}

Managing Data Files

This is where the art of data science begins! How you choose to build a dataset is not only a matter of efficiency, but also one of flexibility. There are many options for reading and writing files. As a bare minimum, the entire contents of the file can be read into a String type by using a FileReader instance, and then the String can be parsed into the data model. For large files, I/O errors are avoided by using a Buffered​Reader to read each line of the file separately. The strategy presented here is to parse each line as it is read, keeping only the values that are required and populating a data structure with those records. If there are 1,000 variables per line, and only three are required, there is no need to keep all of them. Likewise, if the data in a particular line does not meet certain criteria, there is also no need to keep it. For large datasets, this conserves resources compared to reading all the lines into a string array (String[]) and parsing it later. The more consideration you put into this step of managing data files, the better off you will be. Every step you take afterward, whether it’s statistics, learning, or plotting, will rely on your decisions when building a dataset. The old adage of “garbage in, garbage out” definitely applies.

Understanding File Contents First

Data files come in a bewildering array of configurations, with some undesirable features as a result. Recall that ASCII files are just a collection of ASCII characters printed to each line. There is no guarantee on the format or precision of a number, the use of single or double quotes, or the inclusion (or exclusion) of numerous space, null, and newline characters. In short, despite your assumptions as to the contents of the file, there can be almost anything on each line. Before reading in the file with Java, take a look at it in a text editor or with the command line. Note the number, position, and type of each item in a line. Pay close attention to how missing or null values are represented. Also note the type of delimiter and any headers describing the data. If the file is small enough, you can scan it visually for missing or incorrectly formatted lines. For example, say we look at the file somefile.txt with the Unix command less in a bash shell:

bash$ less somefile.txt

"id","year","city"
1,2015,"San Francisco"
2,2014,"New York"
3,2012,"Los Angeles"
...

We see a comma-separated values (CSV) dataset with the columns id, year, and city. We can quickly check the number of lines in the file:

bash$ wc -l somefile.txt
1025

This indicates that there are 1,024 lines of data plus one line more for the header. Other formats are possible, such as tab-separated values (TSV), a “big string” format in which all the values are concatenated together, and JSON. For large files, you may want to take the first 100 or so lines and redirect them to an abridged file for purposes of developing your application:

bash$ head -100 filename > new_filename

In some cases, the data file is just too big for a pair of eyes to scan it for structure or errors. Clearly, you would have trouble examining a data file with 1,000 columns of data! Likewise, you are unlikely to find an error in formatting by scrolling through one million lines of data. In this case it is essential that you have an existing data dictionary that describes the format of the columns and the data types (e.g., integer, float, text) that are expected for each column. You can programmatically check each line of data as you parse the file via Java; exceptions can be thrown, and, perhaps, the entire contents of the offending line printed out so you can examine what went wrong.

Reading from a Text File

The general approach for reading a text file is to create a FileReader instance surrounded by a BufferedReader that enables reading each line. Here, FileReader takes the argument of String filename, but FileReader can also take a File object as its argument. The File object is useful when filenames and paths are dependent on the operating system. This is the generic form for reading files line by line with a Buffered​Reader:

try(BufferedReader br = new BufferedReader(new FileReader("somefile.txt")) ) {
    String columnNames = br.readline(); // ONLY do this if it exists
    String line;
    while ((line = br.readLine()) != null) {
        /* parse each line */
        // TODO
    }
} catch (Exception e) {
    System.err.println(e.getMessage()); // or log error
}

We can do the exact same thing if the file exists somewhere remotely:

URL url = new URL("http://storage.example.com/public-data/somefile.txt");        
try(BufferedReader br = new BufferedReader(
    new InputStreamReader(url.openStream())) ) {
    String columnNames = br.readline(); // ONLY do this if it exists
    String line;
    while ((line = br.readLine()) != null) {
        // TODO parse each line
    }
} catch (Exception e) {
    System.err.println(e.getMessage()); // or log error
}

We just have to worry about how to parse each line.

Parsing delimited strings

Considering the popularity of spreadsheets and database dumps, it is highly likely you will be given a CSV dataset at some point. Parsing this kind of file could not be easier! Consider the data in our example formatted as a CSV file:

1,2015,"San Francisco"
2,2014,"New York"
3,2012,"Los Angeles"

Then all we need to do is parse with String.split(",") and utilize String.trim() to remove any pesky leading or trailing whitespaces. It also will be necessary to remove any quotes around strings with String.replace("\"", ""):

/* parse each line */
String[] s = line.split(","); 
int id = Integer.parseInt(s[0].trim());
int year = Integer.parseInt(s[1].trim());
String city = s[2].trim().replace("\"", "");

In the next example, the data in somefile.txt has been separated by tabs:

1        2015        "San Francisco"
2        2014        "New York"
3        2012        "Los Angeles"

Splitting tab-delimited data is achieved by replacing code for String.split(",") in the preceding example with this:

String[] s = line.split("\t");

At some point, you will undoubtedly come across CSV files with fields that contain commas. One example is text taken from a user blog. Yet another example occurs when denormalized data is put into a column—for example, “San Francisco, CA” instead of having separate columns for city and state. This is quite tricky to parse and requires regex. Instead, why not use the Apache Commons CSV parser library?

/* parse each line */
CSVParser parser = CSVParser.parse(line, CSVFormat.RFC4180);
for(CSVRecord cr : parser) {
    int id = cr.get(1); // columns start at 1 not 0 !!!
    int year = cr.get(2);
    String city = cr.get(3);
}

The Apache Commons CSV library also handles common formats including CSVFormat.EXCEL, CSVFormat.MYSQL, and CSVFormat.TDF.

Parsing JSON strings

JSON is a protocol for serializing JavaScript objects and can be extended to data of all types. This compact, easy-to-read format is ubiquitous in Internet data APIs (in particular, RESTful services) and is the standard format for many NoSQL solutions such as MongoDB and CouchDB. As of version 9.3, the PostgreSQL database offers a JSON data type and can query native JSON fields. The clear advantage is human readability; the structure of the data is readily visible, and with “pretty print,” even more so. In terms of Java, JSON is nothing more than a collection of HashMaps and ArrayLists, in any nested configuration imaginable. Each line of the data from the prior examples can be formatted as a JSON string by placing the values into key-value pairs; strings are in double quotes (not single quotes), and no trailing commas are allowed:

{"id":1, "year":2015, "city":"San Francisco"}
{"id":2, "year":2014, "city":"New York"}
{"id":3, "year":2012, "city":"Los Angeles"}

Note that the entire file itself is not technically a JSON object, and parsing the whole file as such will fail. To be valid JSON format, each line would need to be separated by a comma and then the entire group enclosed with square brackets. This would comprise a JSON array. However, writing this kind of structure would be inefficient and not useful. It is much more convenient and usable as is: a line-by-line stack of JSON objects in string representation. Note that the JSON parser does not know the type of the values in the key-value pairs. So get the String representation and then parse it to its primitive type by using the boxed methods. It is straightforward to build our dataset now, using org.simple.json:

/* create JSON parser outside while loop */
JSONParser parser = new JSONParser();
...

/* create an object by casting the parsed string */
JSONObject obj = (JSONObject) parser.parse(line);
int id = Integer.parseInt(j.get("id").toString());
int year = Integer.parseInt(j.get("year").toString());
String city = j.get("city").toString();

Reading from an Image File

When using images as input for learning, we need to convert from the image format (e.g., PNG) to a data structure that is appropriate, such as a matrix or vector. There are several points to consider here. First, an image is a two-dimensional array with coordinates, {x1, x2}, and a set of associated color or intensity values, {y1…}, that may be stored as a single, integer value. If all we want is the raw value stored in a 2D integer array (labeled data here), we read in the buffered image with this:

BufferedImage img = null;
try {
    img = ImageIO.read(new File("Image.png"));
    int height = img.getHeight();
    int width = img.getWidth();
    int[][] data = new int[height][width];
    for (int i = 0; i < height; i++) {
        for (int j = 0; j < width; j++) {
            int rgb = img.getRGB(i, j); // negative integers
            data[i][j] = rgb;
        }
    }
} catch (IOException e) {
    // handle exception
}

We may want to convert the integer into its RGB (red, green, blue) components by bit shifting the integer:

int blue = 0x0000ff & rgb;
int green = 0x0000ff & (rgb >> 8);
int red = 0x0000ff & (rgb >> 16);
int alpha = 0x0000ff & (rgb >> 24);

However, we can get this information natively from the raster with this:

byte[] pixels = ((DataBufferByte) img.getRaster().getDataBuffer()).getData();
for (int i = 0; i < pixels.length / 3 ; i++) {
    int blue = Byte.toUnsignedInt(pixels[3*i]);
    int green = Byte.toUnsignedInt(pixels[3*i+1]);
    int red = Byte.toUnsignedInt(pixels[3*i+2]);
}

Color may not be important. Perhaps grayscale is really all that’s needed:

//convert rgb to grayscale (0 to 1) where colors are on a scale of 0 to 255
double gray = (0.2126 * red + 0.7152 * green + 0.0722 * blue) / 255.0

Also, in some cases the 2D representation is not necessary. We convert the matrix to a vector by concatenating each row of the matrix onto the new vector such that xn = x1, x2, ..., where the length n of the vector is m × p of the matrix, the number of rows times the number of columns. In the well-known MNIST dataset of handwritten images, the data has already been corrected (centered and cropped) and then converted into a binary format. So reading in that data requires a special format (see Appendix A), but it is already in vector (1D) as opposed to matrix (2D) format. Learning techniques on the MNIST dataset usually involve this vectorized format.

Writing to a Text File

Writing data to files has a general form of using the FileWriter class, but once again the recommended practice is to use the BufferedWriter to avoid any I/O errors. The general concept is to format all the data you want to write to file as a single string. For the three variables in our example, we can do this manually with a delimiter of choice (either a comma or \t):

/* for each instance Record record */
String output = Integer.toString(record.id) + "," + 
Integer.toString(record.year) + "," + record.city;

When using Java 8, the method String.join(delimiter, elements) is convenient!

/* in Java 8 */
String newString = String.join(",", {"a", "b", "c"});

/* or feed in an Iterator */
String newString = String.join(",", myList);

Otherwise, you can instead use the Apache Commons Lang StringUtils.join(elements, delimiter) or the native StringBuilder class in a loop:

/* in Java 7 */
String[] strings = {"a", "b", "c"};

/* create a StringBuilder and add the first member */
StringBuilder sb;
sb.append(strings[0]);

/* skip the first string since we already have it */
for(int i = 1; i < strings.length, i++){
    /* choose a delimiter here ... could also be a \t for tabs */
    sb.append(",");
    sb.append(strings[i]);
}

String newString = sb.toString();

Note that successively using myString += myString_part calls the StringBuilder class, so you might as well use StringBuilder anyway (or not). In any case, the strings are written line by line. Keep in mind that the method BufferedWriter.write(String) does not write a new line! You will have to include a call to BufferedWriter.newLine() if you would like each data record to be on its own line:

try(BufferedWriter bw = new BufferedWriter(new FileWriter("somefile.txt")) ) {
    for(String s : myStringList){
        bw.write(s);
        /* don't forget to append a new line! */
        bw.newLine();
    }
} catch (Exception e) {
    System.out.println(e.getMessage());
}

The preceding code overwrites all existing data in the file designated by filename. In some situations, you will want to append data to an existing file. The FileWriter class takes an optional Boolean field append that defaults to false if it is excluded. To open a file for appending to the next available line, use this:

/* setting FileWriter append bit keeps existing data and appends new data */
try(BufferedWriter bw = new BufferedWriter(
    new FileWriter("somefile.txt", true))) {
    for(String s : myStringList){
        bw.write(s);
        /* don't forget to append a new line! */
        bw.newLine();
    }
} catch (Exception e) {
    System.out.println(e.getMessage());
}

Still another option is to use the PrintWriter class, which wraps around the BufferedWriter. PrintWriter and has a method println() that uses the native newline character of whatever operating system you are on. So the \n can be excluded in the code. This has the advantage that you don’t have to worry about adding those pesky newline characters. This could also be useful if you are generating text files on your own computer (and therefore OS) and will be consuming these files yourself. Here is an example using PrintWriter:

try(PrintWriter pw = new PrintWriter(new BufferedWriter(
    new FileWriter("somefile.txt"))) ) {
    for(String s : myStringList){
        /* adds a new line for you! */
        pw.println(s);
    }
} catch (Exception e) {
    System.out.println(e.getMessage());
}

Any of these methods work just fine with JSON data. Convert each JSON object to a String with the JSONObject.toString() method and write the String. If you are writing one JSON object, such as a configuration file, then it is as simple as this:

JSONObject obj = ...

try(BufferedWriter bw = new BufferedWriter(new FileWriter("somefile.txt")) ) {
    bw.write(obj.toString());
}
} catch (Exception e) {
    System.out.println(e.getMessage());
}

When creating a JSON data file (a stack of JSON objects), loop through your collection of JSONObjects:

List<JSONObject> dataList = ...

try(BufferedWriter bw = new BufferedWriter(new FileWriter("somefile.txt")) ) {
    for(JSONObject obj : dataList){
        bw.write(obj.toString());
        /* don't forget to append a new line! */
        bw.newLine();
    }
} catch (Exception e) {
    System.out.println(e.getMessage());
}

Don’t forget to set the append-bit in FileWriter if this file is accumulative! You can add more JSON records to the end of this file simply by setting the append-bit in the FileWriter:

try(BufferedWriter bw = new BufferedWriter(
    new FileWriter("somefile.txt", true)) ) {
...
}

Mastering Database Operations

The robustness and flexibility of relational databases such as MySQL make them the ideal technology for a wide range of use cases. As a data scientist, you will most likely interact with relational databases in connection to a larger application, or perhaps you will generate tables of condensed and organized data specific to the tasks of the data science group. In either case, mastering the command line, Structured Query Language (SQL), and Java Database Connectivity (JDBC) are critical skills.

Command-Line Clients

The command line is a great environment for managing the database as well as performing queries. As an interactive shell, the client enables rapid iteration of commands useful for exploring the data. After you work out queries on the command line, you can later transfer the SQL to your Java program, where the query can be parameterized for more flexible use. All of the popular databases such as MySQL, PostgreSQL, and SQLite have command-line clients. On systems where MySQL has been installed for development purposes (e.g., your personal computer), you should be able to connect with an anonymous login with an optional database name:

bash$ mysql <database>

However, you might not be able to create a new database. You can log in as the database administrator:

bash$ mysql -u root <database>

Then you can have full access and privileges. In all other cases (e.g., you are connecting to a production machine, remote instance, or cloud-based instance), you will need the following:

bash$ mysql -h host -P port -u user -p password <database>

Upon connecting, you will be greeted with the MySQL shell, where you can make queries for showing all the databases you have access to, the name of the database you are connected to, and the username:

mysql> SHOW DATABASES;

To switch databases to a new database, the command is USE dbname:

mysql> USE myDB;

You can create tables now:

mysql> CREATE TABLE my_table(id INT PRIMARY KEY, stuff VARCHAR(256));

Even better, if you have those table creation scripts stored away as files, the following will read in and execute the file:

mysql> SOURCE <filename>;

Of course, you may want to know what tables are in your database:

mysql> SHOW TABLES;

You may also want to get a detailed description of a table, including column names, data types, and constraints:

mysql> DESCRIBE <tablename>;

Structured Query Language

Structured Query Language (SQL) is a powerful tool for exploring data. While object-relational mapping (ORM) frameworks have a place in enterprise software applications, you may find them too restrictive for the kinds of tasks you will face as a data scientist. It is a good idea to brush up on your SQL skills and be comfortable with the basics presented here.

Java Database Connectivity

The Java Database Connectivity (JDBC) is a protocol connecting Java applications with any SQL-compliant database. The JDBC drivers for each database vendor exist as a separate JAR that must be included in build and runtime. The JDBC technology strives for a uniform layer between applications and databases regardless of the vendor.

Connections

Connecting to a database with JDBC is extremely easy and convenient. All you need is a properly formed URI for the database that takes this general form:

String uri = "jdbc:<dbtype>:[location]/<dbname>?<parameters>"

The DriverManager.getConnection() method will throw an exception, and you have two choices for dealing with this. The modern Java way is to put the connection inside the try statement, known as a try with resource. In this way, the connection will be automatically closed when the block is done executing, so you do not have to explicitly put in a call to Connection.close(). Remember that if you decide to put the connection statement in the actual try block, you will need to explicitly close the connection, probably in a finally block:

String uri = "jdbc:mysql://localhost:3306/myDB?user=root";
try(Connection c = DriverManager.getConnection(uri)) {
    // TODO do something here
} catch (SQLException e) {
    System.err.println(e.getMessage());
}

Now that you have a connection, you need to ask yourself two questions:

Start by assuming that you will create a Statement. If the Statement will take a variable (e.g., if the SQL will be appended to by an application variable), then use a PreparedStatement instead. If you do not expect any results back, you are OK. If you are expecting results to come back, you need to use ResultSets to contain and process the results.

Prepared statements

You will probably not be hardcoding all your data into an SQL statement. Likewise, you may create a generic update statement for updating a record’s city column given an id by using an SQL WHERE clause. Although you may be tempted to build SQL strings by concatenating them, this is not a recommended practice. Anytime external input is substituted into an SQL expression, there is room for an SQL injection attack. The proper method is to use placeholders (as question marks) in the SQL statement and then use the class PreparedStatement to properly quote the input variables and execute the query. Prepared statements not only have a security advantage but one of speed as well. The PreparedStatement is compiled one time, and for a large number of inserts, this makes the process extremely efficient compared to compiling a new SQL statement for each and every insertion. The preceding INSERT statement, with corresponding Java can be written as follows:

String insertSQL = "INSERT INTO data(id, yr, city) VALUES(?, ?, ?)";
PreparedStatement ps = c.prepareStatement(insertSQL);
/* set the value for each placeholder ? starting with index = 1 */            
ps.setInt(1, 1);
ps.setInt(2, 2015);
ps.setString(3, "San Francisco");
ps.execute();
ps.close();

But what if you have a lot of data and need to loop through a list? This is where you execute in batch mode. For example, suppose you have a List of Record objects obtained from an import of CSV:

String insertSQL = "INSERT INTO data(id, yr, city) VALUES(?, ?, ?)";
PreparedStatement ps = c.prepareStatement(insertSQL);
List<Record> records = FileUtils.getRecordsFromCSV();
for(Record r: records) {
    ps.setInt(1, r.id);
    ps.setInt(2, r.year);
    ps.setString(3, r.city);
    ps.addBatch();
}
ps.executeBatch();
ps.close();

Visualizing Data with Plots

Data visualization is an important and exciting component of data science. The combination of broadly available, interesting data and interactive graphical technologies has led to stunning visualizations, capable of telling complex stories. Many times, our visualizations are the eye candy that everyone has been anticipating. Of utmost importance is to realize that the same source of data can be used to tell completely different stories depending on not only the segment of the data you choose to show, but also the graphical styling utilized.

Keeping in mind that data visualization should always take into consideration the audience, there are roughly three kinds of consumers of a visualization. The first is yourself, the all-knowing expert who is most likely iterating quickly on an analysis or algorithm development. Your requirements are to see the data as plainly and quickly as possible. Things such as setting plot titles, axis labels, smoothing, legends, or date formatting might not be important, because you are intimately aware of what you are looking at. In essence, we often plot data to get a quick overview of the data landscape, without concerning ourselves with how others will view it.

The second consumer of data visualizations is the industry expert. After you have solved a data science problem and you think it’s ready to share, it’s essential to fully label the axis, put a meaningful, descriptive title on it, make sure any series of data are described by a legend, and ensure that the graphic you have created can mostly tell a story on its own. Even if it’s not visually stunning, your colleagues and peers will probably not be concerned with eye candy, but rather the message you are trying to convey. In fact, it will be much easier to make a scientific evaluation on the merits of the work if the visualization is clear of graphical widgets and effects. Of course, this format is also essential for archiving your data. One month later, you will not remember what those axes are if you don’t label them now!

The third category of visualization consumer is everybody else. This is the time to get creative and artistic, because a careful choice of colors and styles can make good data seem great. Be cautious, however, of the tremendous amount of time and effort you will spend preparing graphics at this level of consumer. An added advantage of using JavaFX is the interactivity allowed via mouse options. This enables you to build a graphical application similar to many of the web-based dashboards you are accustomed to.

Creating Simple Plots

Java contains native graphics capabilities in the JavaFX package. Since version 1.8, scientific plotting is enabled with charts of many types such as scatter, line, bar, stacked bar, pie, area, stacked area, or bubble via the javafx.scene.chart package. A Chart object is contained in a Scene object, which is contained in a Stage object. The general form is to extend an executable Java class with Application and place all the plotting directives in the overridden method Application.start(). The Application.launch() method must be called in the main method to create and display the chart.

Scatter plots

An example of a simple plot is a scatter chart, which plots a set of x-y pairs of numbers as points on a grid. These charts utilize the javafx.scene.chart.XYChart.Data and javafx.scene.chart.XYChart.Series classes. The Data class is a container that holds any dimension of mixed types of data, and the Series class contains an ObservableList of Data instances. There are factory methods in the javafx.collections.FXCollections class for creating instances of ObservableList directly, should you prefer that route. However, for scatter, line, area, bubble, and bar charts, this is unnecessary because they all utilize the Series class:

public class BasicScatterChart extends Application {

    public static void main(String[] args) {
        launch(args);   
    }

    @Override
    public void start(Stage stage) throws Exception {
        int[] xData = {1, 2, 3, 4, 5};
        double[] yData = {1.3, 2.1, 3.3, 4.0, 4.8};

        /* add Data to a Series */
        Series series = new Series();
        for (int i = 0; i < xData.length; i++) {
            series.getData().add(new Data(xData[i], yData[i]));
        }
        
        /* define the axes */
        NumberAxis xAxis = new NumberAxis();
        xAxis.setLabel("x");
        NumberAxis yAxis = new NumberAxis();
        yAxis.setLabel("y");
        
        /* create the scatter chart */
        ScatterChart<Number,Number> scatterChart = 
            new ScatterChart<>(xAxis, yAxis);
        scatterChart.getData().add(series);

        
        /* create a scene using the chart */
        Scene scene  = new Scene(scatterChart, 800, 600);

        /* tell the stage what scene to use and render it! */
        stage.setScene(scene);
        stage.show();
    }
    
}

Figure 1-1 depicts the default graphics window that is displayed when rendering a JavaFX chart for a simple set of data.

Figure 1-1. Scatter plot example

The ScatterChart class can readily be replaced with LineChart, AreaChart, or BubbleChart in the preceding example.

Bar charts

As an x-y chart, the bar chart utilizes the Data and Series classes. In this case, however, the only difference is that the x-axis must be a string type (as opposed to a numeric type) and utilizes the CategoryAxis class instead of the NumberAxis class. The y-axis remains as a NumberAxis. Typically, the categories in a bar chart are something like days of the week or market segments. Note that the BarChart class takes a String, Number pair of types inside the diamonds. These are useful for making histograms, and we show one in Chapter 3:

public class BasicBarChart extends Application {

    public static void main(String[] args) {
        launch(args);   
    }

    @Override
    public void start(Stage stage) throws Exception {

        String[] xData = {"Mon", "Tues", "Wed", "Thurs", "Fri"};
        double[] yData = {1.3, 2.1, 3.3, 4.0, 4.8};

        /* add Data to a Series */
        Series series = new Series();
        for (int i = 0; i < xData.length; i++) {
            series.getData().add(new Data(xData[i], yData[i]));
        }
        
        /* define the axes */
        CategoryAxis xAxis = new CategoryAxis();
        xAxis.setLabel("x");
        NumberAxis yAxis = new NumberAxis();
        yAxis.setLabel("y");
        
        /* create the bar chart */
        BarChart<String,Number> barChart = new barChart<>(xAxis, yAxis);
        barChart.getData().add(series);

        
        /* create a scene using the chart */
        Scene scene  = new Scene(barChart, 800, 600);

        /* tell the stage what scene to use and render it! */
        stage.setScene(scene);
        stage.show();
    }
    
}

Plotting Mixed Chart Types

Often we want to display multiple plot types in one graphic—for example, when you want to display the data points as an x-y scatter plot and then overlay a line plot of the best fitted model. Perhaps you will also want to include two more lines to represent the boundary of the model, probably one, two, or three multiples of the standard deviation σ, or the confidence interval 1.96 × σ. Currently, JavaFX does not allow multiple plots of the different types to be displayed simultaneously on the same scene. There is a workaround, however! We can use a LineChart class to plot multiple series of LineChart instances and then use CSS to style one of the lines to show only points, one to only show a solid line, and two to show only a dashed line. Here is the CSS:

.default-color0.chart-series-line {
    -fx-stroke: transparent;
}

.default-color1.chart-series-line {
    -fx-stroke: blue; -fx-stroke-width: 1;
}

.default-color2.chart-series-line {
    -fx-stroke: blue;
    -fx-stroke-width: 1;
    -fx-stroke-dash-array: 1 4 1 4;
}

.default-color3.chart-series-line { 
    -fx-stroke: blue;
    -fx-stroke-width: 1;
    -fx-stroke-dash-array: 1 4 1 4;
}

/*.default-color0.chart-line-symbol { 
    -fx-background-color: white, green; 
}*/

.default-color1.chart-line-symbol { 
    -fx-background-color: transparent, transparent;
}

.default-color2.chart-line-symbol { 
    -fx-background-color: transparent, transparent; 
}

.default-color3.chart-line-symbol { 
    -fx-background-color: transparent, transparent; 
}

The plot looks like Figure 1-2.

Figure 1-2. Plot of mixed line types with CSS