We’ve looked at sequences, such as lists, tuples and arrays. In this section, we’ll discuss time series, which are sequences of values (called observations) associated with points in time. Some examples are daily closing stock prices, hourly temperature readings, the changing positions of a plane in flight, annual crop yields and quarterly company profits. Perhaps the ultimate time series is the stream of time-stamped tweets coming from Twitter users worldwide. In the “Data Mining Twitter” chapter, we’ll study Twitter data in depth.
In this section, we’ll use a technique called simple linear regression to make predictions from time series data. We’ll use the 1895 through 2018 January average high temperatures in New York City to predict future average January high temperatures and to estimate the average January high temperatures for years preceding 1895.
In the “Machine Learning” chapter, we’ll revisit this example using the scikit-learn library. In the “Deep Learning” chapter, we’ll use recurrent neural networks (RNNs) to analyze time series.
In later chapters, we’ll see that time series are popular in financial applications and with the Internet of Things (IoT), which we’ll discuss in the “Big Data: Hadoop, Spark, NoSQL and IoT” chapter.
In this section, we’ll display graphs with Seaborn and pandas, which both use Matplotlib, so launch IPython with Matplotlib support:
ipython --matplotlib
The data we’ll use is a time series in which the observations are ordered by year. Univariate time series have one observation per time, such as the average of the January high temperatures in New York City for a particular year. Multivariate time series have two or more observations per time, such as temperature, humidity and barometric pressure readings in a weather application. Here, we’ll analyze a univariate time series.
Two tasks often performed with time series are:
Time series analysis, which looks at existing time series data for patterns, helping data analysts understand the data. A common analysis task is to look for seasonality in the data. For example, in New York City, the monthly average high temperature varies significantly based on the seasons (winter, spring, summer or fall).
Time series forecasting, which uses past data to predict the future.
We’ll perform time series forecasting in this section.
Using a technique called simple linear regression, we’ll make predictions by finding a linear relationship between the months (January of each year) and New York City’s average January high temperatures. Given a collection of values representing an independent variable (the month/year combination) and a dependent variable (the average high temperature for that month/year), simple linear regression describes the relationship between these variables with a straight line, known as the regression line.
To understand the general concept of a linear relationship, consider Fahrenheit and Celsius temperatures. Given a Fahrenheit temperature, we can calculate the corresponding Celsius temperature using the following formula:
c = 5 / 9 * (f - 32)
In this formula, f
(the Fahrenheit temperature) is the independent variable, and c
(the Celsius temperature) is the dependent variable—each value of c
depends on the value of f
used in the calculation.
Plotting Fahrenheit temperatures and their corresponding Celsius temperatures produces a straight line. To show this, let’s first create a lambda
for the preceding formula and use it to calculate the Celsius equivalents of the Fahrenheit temperatures 0–100 in 10-degree increments. We store each Fahrenheit/Celsius pair as a tuple in temps
:
In [1]: c = lambda f: 5 / 9 * (f - 32)
In [2]: temps = [(f, c(f)) for f in range(0, 101, 10)]
Next, let’s place the data in a DataFrame
, then use its plot
method to display the linear relationship between the Fahrenheit and Celsius temperatures. The plot
method’s style
keyword argument controls the data’s appearance. The period in the string '.-'
indicates that each point should appear as a dot, and the dash indicates that lines should connect the dots. We manually set the y-axis label to 'Celsius'
because the plot
method shows 'Celsius'
only in the graph’s upper-left corner legend, by default.
In [3]: import pandas as pd
In [4]: temps_df = pd.DataFrame(temps, columns=['Fahrenheit', 'Celsius'])
In [5]: axes = temps_df.plot(x='Fahrenheit', y='Celsius', style='.-')
In [6]: y_label = axes.set_ylabel('Celsius')
The points along any straight line (in two dimensions) like those shown in the preceding graph can be calculated with the equation:
y = mx + b
where
m is the line’s slope,
b is the line’s intercept with the y-axis (at x = 0),
x is the independent variable (the date in this example), and
y is the dependent variable (the temperature in this example).
In simple linear regression, y is the predicted value for a given x.
linregress
from the SciPy’s stats
ModuleSimple linear regression determines the slope (m) and intercept (b) of a straight line that best fits your data. Consider the following diagram, which shows a few of the time-series data points we’ll process in this section and a corresponding regression line. We added vertical lines to indicate each data point’s distance from the regression line:
The simple linear regression algorithm iteratively adjusts the slope and intercept and, for each adjustment, calculates the square of each point’s distance from the line. The “best fit” occurs when the slope and intercept values minimize the sum of those squared distances. This is known as an ordinary least squares calculation.21
The SciPy (Scientific Python) library is widely used for engineering, science and math in Python. This library’s linregress
function (from the scipy.stats
module) performs simple linear regression for you. After calling linregress
, you’ll plug the resulting slope and intercept into the y = mx + b equation to make predictions.
In the three previous Intro to Data Science sections, you used pandas to work with data. You’ll continue using pandas throughout the rest of the book. In this example, we’ll load the data for New York City’s 1895–2018 average January high temperatures from a CSV file into a DataFrame
. We’ll then format the data for use in this example.
We’ll use Seaborn to plot the DataFrame
’s data with a regression line that shows the average high-temperature trend over the period 1895–2018.
Let’s get the data for our study. The National Oceanic and Atmospheric Administration (NOAA)22 offers lots of public historical data including time series for average high temperatures in specific cities over various time intervals.
We obtained the January average high temperatures for New York City from 1895 through 2018 from NOAA’s “Climate at a Glance” time series at:
https:/ / www.ncdc.noaa.gov/ cag/
On that web page, you can select temperature, precipitation and other data for the entire U.S., regions within the U.S., states, cities and more. Once you’ve set the area and time frame, click Plot
to display a diagram and view a table of the selected data. At the top of that table are links for downloading the data in several formats including CSV, which we discussed in the “Files and Exceptions” chapter. NOAA’s maximum date range available at the time of this writing was 1895–2018. For your convenience, we provided the data in the ch10
examples folder in the file ave_hi_nyc_jan_1895-2018.csv
. If you download the data on your own, delete the rows above the line containing "Date,Value,Anomaly"
.
This data contains three columns per observation:
Date
—A value of the form 'YYYYMM’
(such as '201801'
). MM
is always 01
because we downloaded data for only January of each year.
Value
—A floating-point Fahrenheit temperature.
Anomaly
—The difference between the value for the given date and average values for all dates. We do not use the Anomaly
value in this example, so we’ll ignore it.
DataFrame
Let’s load and display the New York City data from ave_hi_nyc_jan_1895-2018.csv
:
In [7]: nyc = pd.read_csv('ave_hi_nyc_jan_1895-2018.csv')
We can look at the DataFrame
’s head
and tail
to get a sense of the data:
In [8]: nyc.head()
Out[8]:
Date Value Anomaly
0 189501 34.2 -3.2
1 189601 34.7 -2.7
2 189701 35.5 -1.9
3 189801 39.6 2.2
4 189901 36.4 -1.0
In [9]: nyc.tail()
Out[9]:
Date Value Anomaly
119 201401 35.5 -1.9
120 201501 36.1 -1.3
121 201601 40.8 3.4
122 201701 42.8 5.4
123 201801 38.7 1.3
We’ll soon use Seaborn to graph the Date
-Value
pairs and a regression line. When plotting data from a DataFrame
, Seaborn labels a graph’s axes using the DataFrame
’s column names. For readability, let’s rename the 'Value'
column as 'Temperature'
:
In [10]: nyc.columns = ['Date', 'Temperature', 'Anomaly']
In [11]: nyc.head(3)
Out[11]:
Date Temperature Anomaly
0 189501 34.2 -3.2
1 189601 34.7 -2.7
2 189701 35.5 -1.9
Seaborn labels the tick marks on the x-axis with Date
values. Since this example processes only January temperatures, the x-axis labels will be more readable if they do not contain 01
(for January), we’ll remove it from each Date
. First, let’s check the column’s type:
In [12]: nyc.Date.dtype
Out[12]: dtype('int64')
The values are integers, so we can divide by 100 to truncate the last two digits. Recall that each column in a DataFrame
is a Series
. Calling Series
method floordiv
performs integer division on every element of the Series
:
In [13]: nyc.Date = nyc.Date.floordiv(100)
In [14]: nyc.head(3)
Out[14]:
Date Temperature Anomaly
0 1895 34.2 -3.2
1 1896 34.7 -2.7
2 1897 35.5 -1.9
For some quick statistics on the dataset’s temperatures, call describe
on the Temperature column. We can see that there are 124 observations, the mean value of the observations is 37.60, and the lowest and highest observations are 26.10 and 47.60 degrees, respectively:
In [15]: pd.set_option('precision', 2)
In [16]: nyc.Temperature.describe()
Out[16]:
count 124.00
mean 37.60
std 4.54
min 26.10
25% 34.58
50% 37.60
75% 40.60
max 47.60
Name: Temperature, dtype: float64
The SciPy (Scientific Python) library is widely used for engineering, science and math in Python. Its stats
module provides function linregress
, which calculates a regression line’s slope and intercept for a given set of data points:
In [17]: from scipy import stats
In [18]: linear_regression = stats.linregress(x=nyc.Date,
...: y=nyc.Temperature)
...:
Function linregress
receives two one-dimensional arrays23 of the same length representing the data points’ x- and y-coordinates. The keyword arguments x
and y
represent the independent and dependent variables, respectively. The object returned by linregress
contains the regression line’s slope
and intercept
:
In [19]: linear_regression.slope
Out[19]: 0.00014771361132966167
In [20]: linear_regression.intercept
Out[20]: 8.694845520062952
We can use these values with the simple linear regression equation for a straight line, y = mx + b, to predict the average January temperature in New York City for a given year. Let’s predict the average Fahrenheit temperature for January of 2019. In the following calculation, linear_regression.slope
is m, 2019
is x (the date value for which you’d like to predict the temperature), and linear_regression.intercept
is b:
In [21]: linear_regression.slope * 2019 + linear_regression.intercept
Out[21]: 38.51837136113298
We also can approximate what the average temperature might have been in the years before 1895. For example, let’s approximate the average temperature for January of 1890:
In [22]: linear_regression.slope * 1890 + linear_regression.intercept
Out[22]: 36.612865774980335
For this example, we had data for 1895–2018. You should expect that the further you go outside this range, the less reliable the predictions will be.
Next, let’s use Seaborn’s regplot
function to plot each data point with the dates on the x-axis and the temperatures on the y-axis. The regplot
function creates the scatter plot or scattergram below in which the scattered blue dots represent the Temperature
s for the given Date
s, and the straight line displayed through the points is the regression line:
First, close the prior Matplotlib window if you have not done so already—otherwise, regplot
will use the existing window that already contains a graph. Function regplot
’s x
and y
keyword arguments are one-dimensional arrays24 of the same length representing the x-y coordinate pairs to plot. Recall that pandas automatically creates attributes for each column name if the name can be a valid Python identifier:25
In [23]: import seaborn as sns
In [24]: sns.set_style('whitegrid')
In [25]: axes = sns.regplot(x=nyc.Date, y=nyc.Temperature)
The regression line’s slope (lower at the left and higher at the right) indicates a warming trend over the last 124 years. In this graph, the y-axis represents a 21.5-degree temperature range between the minimum of 26.1 and the maximum of 47.6, so the data appears to be spread significantly above and below the regression line, making it difficult to see the linear relationship. This is a common issue in data analytics visualizations. When you have axes that reflect different kinds of data (dates and temperatures in this case), how do you reasonably determine their respective scales? In the preceding graph, this is purely an issue of the graph’s height—Seaborn and Matplotlib auto-scale the axes, based on the data’s range of values. We can scale the y-axis range of values to emphasize the linear relationship. Here, we scaled the y-axis from a 21.5-degree range to a 60-degree range (from 10 to 70 degrees):
In [26]: axes.set_ylim(10, 70)
Out[26]: (10, 70)
Here are some popular sites where you can find time series to use in your studies:
https:/ This is the U.S. government’s open data portal. Searching for “time series” yields over 7200 time-series datasets. |
https:/ The National Oceanic and Atmospheric Administration (NOAA) Climate at a Glance portal provides both global and U.S. weather-related time series. |
https:/ NOAA’s Earth System Research Laboratory (ESRL) portal provides monthly and seasonal climate-related time series. |
https:/ Quandl provides hundreds of free financial-related time series, as well as fee-based time series. |
https://datamarket.com/data/list/?q=provider:tsdl The Time Series Data Library (TSDL) provides links to hundreds of time series datasets across many industries. |
http://archive.ics.uci.edu/ml/datasets.html The University of California Irvine (UCI) Machine Learning Repository contains dozens of time-series datasets for a variety of topics. |
http:/ The University of Maryland’s EconData service provides links to thousands of economic time series from various U.S. government agencies. |
(Fill-In) Time series _________ looks at existing time series data for patterns, helping data analysts understand the data. Time series _________ uses data from the past to predict the future.
Answer: analysis, forecasting.
(True/False) In the formula, c
=
5
/
9
*
(f
-
32)
, f
(the Fahrenheit temperature) is the independent variable and c
(the Celsius temperature) is the dependent variable.
Answer: True.
(IPython Session) Assuming that this linear trend continues, based on the slope and intercept values calculated in this section’s interactive session, in what year might the average January temperature in New York City reach 40 degrees Fahrenheit.
Answer:
In [27]: year = 2019
In [28]: slope = linear_regression.slope
In [29]: intercept = linear_regression.intercept
In [30]: temperature = slope * year + intercept
In [31]: while temperature < 40.0:
...: year += 1
...: temperature = slope * year + intercept
...:
In [32]: year
Out[32]: 2120