Exercises

SQL and RDBMS Exercises

17.1 (Books Database) In an IPython session, perform each of the following tasks on the books database from Section 17.2:
1. Select all authors’ last names from the authors table in descending order.
2. Select all book titles from the titles table in ascending order.
3. Use an INNER JOIN to select all the books for a specific author. Include the title, copyright year and ISBN. Order the information alphabetically by title.
4. Insert a new author into the authors table.
5. Insert a new title for an author. Remember that the book must have an entry in the author_ISBN table and an entry in the titles table.
17.2 (Cursor Method fetchall and Attribute description) When you use a sqlite3 Cursor’s execute method to perform a query, the query’s results are stored in the Cursor object. The Cursor attribute description contains metadata about the results stored as a tuple of tuples. Each nested tuple’s first value is a column name in the query results. Cursor method fetchall returns the query result’s data as a list of tuples. Investigate the description attribute and fetchall method. Open the books database and use Cursor method execute to select all the data in the titles table, then use description and fetchall to display the data in tabular format.
17.3 (Contacts Database) Study the books.sql script provided in the ch17 examples folder’s sql subfolder. Save the script as addressbook.sql and modify it to create a single table named contacts. The table should contain an auto-incremented id column and text columns for a person’s first name, last name and phone number. In an IPython session, insert contacts into the database, query the database to list all the contacts and contacts with a specific last name, update a contact and delete a contact.
17.4 (Project: DB Browser for SQLite) Investigate the open source DB Browser for SQLite (https://sqlitebrowser.org/). This tool provides a graphical user interface in which you can view and interact with a SQLite database. Use the tool to open the books.db database and view the contents of the authors table. In IPython, add a new author and remove it so you can see the table update live in DB Browser for SQLite.
17.5 (Project: MariaDB) Research the MariaDB relational database management system and its Python support, then use it to create a database and reimplement the IPython session in Section 17.2. You may need to update the SQL script that creates the database tables, as some features like auto-incremented integer primary keys vary by relational database management system.

NoSQL Database Exercises

17.6 (MongoDB Twitter Example Modification: Sentiment Analysis Enhancement) Using the sentiment analysis techniques you learned in the “Natural Language Processing” chapter, modify Section 17.4’s case study as follows. Enable the user to select a senator, then use a pandas DataFrame to show a summary of the positive, negative and neutral tweets for that senator by state. Create a choropleth that colors each state by positive, negative and neutral sentiment. The popup map markers should show the number of tweets of each sentiment for that state.
17.7 (Project: Six Degrees of Separation with Neo4j NoSQL Graph Database) The famous “six degrees of separation” problem says that any two people in the world are connected to one another by six or fewer acquaintance connections.⁶³ A game based on this is called “Six degrees of Kevin Bacon”⁶⁴ in which any two movie stars in Hollywood can be connected to Kevin Bacon via the roles they’ve played in films (because he has appeared in so many films). Neo4j’s Cypher language is used to query Neo4j databases. In their Guide to Cypher Basics (https://neo4j.com/developer/guide-cypher-basics/), they implement “Six degrees of Kevin Bacon” using a movie database. Install the Neo4j database on your system and implement their solution.
63. https://en.wikipedia.org/wiki/Six_degrees_of_separation.
64. https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon.

Hadoop Exercises

17.8 (Project: Hadoop on the Desktop with Hortonworks HDP Sandbox) Hortonworks Sandbox (https://hortonworks.com/products/sandbox/) is an open-source desktop platform for Hadoop, Spark and related technologies. Install a desktop version of the Hortonworks Data Platform (HDP) Sandbox, then use it to execute this chapter’s Hadoop MapReduce example. Caution: Before installing HDP Sandbox, ensure that your system meets the substantial disk and memory requirements.
17.9 (Project: Hadoop on the Desktop with Cloudera CDH Quickstart VM) Cloudera CDH is an open-source desktop platform for Hadoop, Spark and related technologies. Install a Cloudera desktop Quick Start VM (search for “Cloudera CDH Quickstart VM” online), then use it to execute this chapter’s Hadoop MapReduce example. Caution: Before installing a Cloudera CDH Quickstart VM, ensure that your system meets the substantial disk and memory requirements.
17.10 (Research Project: Apache Tez) Investigate Apache Tez—a high-performance replacement for MapReduce. How is it that Tez achieves its performance improvement over MapReduce?

Spark Exercises

17.11 (Project: Spark on the Desktop with Hortonworks HDP Sandbox) Hortonworks Sandbox (https://hortonworks.com/products/sandbox/) is an open-source desktop platform for Hadoop, Spark and related technologies. Install a desktop version of the Hortonworks Data Platform (HDP) Sandbox, then use it to execute this chapter’s Spark examples. Caution: Before installing HDP Sandbox, ensure that your system meets the substantial disk and memory requirements.
17.12 (Project: Spark on the Desktop with Cloudera Quickstart VM) Cloudera CDH is an open-source desktop platform for Hadoop, Spark and related technologies. Install a Cloudera desktop Quick Start VM (search for “Cloudera CDH Quickstart VM” online), then use it to execute this chapter’s Spark examples. Caution: Before installing a Quickstart VM, ensure that your system meets the substantial disk and memory requirements.
17.13 (Project: Spark ML) The “Machine Learning” chapter presented several popular machine-learning algorithms. These and many other algorithms are available in Spark via Spark ML and the PySpark library. Research Spark ML in PySpark, then reimplement one of the “Machine Learning” chapter’s examples using the Jupyter pyspark-notebook Docker container.
17.14 (Project: IBM’s Apache Spark Service) Investigate IBM Watson’s Apache Spark service (https://console.bluemix.net/catalog/services/apache-spark), which provides free Lite tier support for Spark streaming and Spark MLlib, then use it to implement one of the machine-learning studies from Chapter15 .

IoT and Pub/Sub Exercises

17.15 (Watson IoT Platform) Investigate the free Lite tier of the Watson IoT Platform (https://console.bluemix.net/catalog/services/internet-of-things-platform). They provide a live stream demonstration that receives sensor data directly from your smartphone, provides a 3D visualization of your phone and shows the sensor data. The visualization updates in real time as you move your phone. See https://developer.ibm.com/iotplatform/2017/12/07/use-device-simulator-watson-iot-platform for more information.
17.16 (Raspberry Pi and Internet of Things) IOTIFY is an IoT simulation service. Research IOTIFY, then follow their Hello IoT tutorial, which uses a simulated Raspberry Pi device.
17.17 (Streaming Stock Prices Dashboard with IEX, PubNub and Freeboard.io) Investigate the free stock-quote API provided by IEX (https://iextrading.com/) and Python modules on GitHub that enable you to use their APIs in Python applications. Create a Python IEX client that receives quotes for specific companies (you can look up their stock ticker symbols online). Research how to publish to a PubNub channel and publish the quotes to your channel. Use Freeboard.io to create a dashboard that subscribes to the PubNub channel you created and visualizes the stock prices as they arrive.
17.18 (Project: Dweet.io and Dweepy) Use Dweet.io and Dweepy to implement a text-based chat client script. Each person running the script would specify their own username. By default all clients will publish and subscribe to the same channel. As an enhancement, enable the user to choose the channel to use.
17.19 (Project: Freeboard on GitHub) Freeboard.io provides a free open-source version (with fewer options) on GitHub. Locate this version, install it on your system and use it to implement the dashboards we showed in Section 17.8.
17.20 (Project: PubNub and Bokeh) The Bokeh visualization library enables you to create dashboard visualizations from Python. In addition, it provides streaming support for dyanamically updating visualizations. Investigate Bokeh’s streaming capabilities, then use them with PubNub’s simulated sensor stream to create a Python client that visualizes the sensor data.
17.21 (Research Project: IoT for the Entrepreneur) If you’re inclined to start a company, IoT presents many opportunities. Research IoT opportunities for entrepreneurs and create and describe an original idea for a business.
17.22 (Research Project: Smart Watches and Activity Trackers) Research the wearable IoT devices Apple Watch and Fitbit. List the sensors they provide and what they’re able to monitor, and the dashboards they provide to help you monitor your health.
17.23 (Research Project: Kafka Publish/Subscribe Messaging) In this chapter you studied streaming and publish/subscribe messaging. Apache Kafka (https://kafka.apache.org) supports real-time messaging, stream processing and storage, and is typically used to transform and process high-volume streaming data, such as website activity and streaming IoT data. Research the applications of Apache Kafka and the platforms that use it.

Platform Exercises

17.24 (Project: Spark with Databricks Community Edition) Databricks⁶⁵ is an analytics platform created by the people who originally created Spark at U.C. Berkeley. In addition to being available through Amazon AWS and Microsoft Azure, they provide a free cloud-based Databricks Community Edition (https://databricks.com/product/faq/community-edition), which runs on AWS⁶⁶ and enables you to learn about and experiment with Spark without having to install any software locally. In fact, they implemented all the examples in their book Spark: The Definitive Guide using the free Databricks Community Edition.
65. http://databricks.com.
66. https://databricks.com/product/faq/community-edition.

Investigate the Databricks Community Edition’s capabilities and follow their Getting Started with Apache Spark tutorial at https://databricks.com/spark/getting-started-with-apache-spark. Their notebook format and commands are similar but not identical to Jupyter’s. Next, reimplement the Spark examples in Sections 17.6–17.7 using the Databricks Community Edition. To install Python modules into your Databricks cluster, follow the instructions at https://docs.databricks.com/user-guide/libraries.html. Like many of the data-science libraries we’ve used in the book, Databricks includes popular datasets you can use when learning Spark:
```
https://docs.databricks.com/user-guide/faq/databricks-datasets.html
```
17.25 (Project: IBM Watson Analytics Engine) You can access Hadoop, Spark and other tools in the Hadoop ecosystem via IBM’s Watson Analytics Engine. To get started, the Watson Lite tier lets you create one cluster per 30-day period and use it for a maximum of 50 node hours⁶⁷ so that you can evaluate the platform or test Hadoop and Spark tasks. IBM also provides a separate Apache Spark service and various other big-data-related services. Research Watson Analytics Engine, then use it to implement and run this chapter’s Hadoop and Spark examples. For a complete list of IBM’s services, see their catalog at: https://console.bluemix.net/catalog/
67. https://console.bluemix.net/docs/services/AnalyticsEngine/faq.html#how-does-the-lite-plan-work-.

Other Exercises

17.26 (Research Project: Big Data in Baseball) Big data analytics techniques have been employed by some baseball teams and are credited with helping the 2004 Red Sox and the 2016 Cubs win World Series after long droughts. The books Moneyball⁶⁸ and Big Data Baseball⁶⁹ chronicle the data analytics successes of the 2002 Oakland Athletics and the 2013 Pittsburgh Pirates, respectively. On the downside, the Wall Street Journal reported that as a result of using data analytics, baseball games have become longer on average with less action.⁷⁰ Read either or both of those books to gain insights into how big-data analytics are used in sports.
68. Lewis, M., Moneyball: The Art of Winning an Unfair Game. W. W. Norton & Company. 2004.
69. Sawchik, T., Big Data Baseball: Math, Miracles, and the End of a 20-Year Losing Streak. Flatiron Books. 2016.
70. “Baseball learns data’s downside—analytics leads to longer games with less action,” October 3, 2017. https://www.wsj.com/articles/the-downside-of-baseballs-data-revolutionlong-games-less-action-1507043924.
17.27 (Research Project: NewSQL Databases) Research the NewSQL databases VoltDB, MemSQL, Apache Ignite and Google Spanner and discuss their key features.
17.28 (Research Project: CRISPR Gene Editing) Research how big data is being used with CRISPR gene editing. Research and discuss ethical and moral issues raised by CRISPR gene editing.
17.29 (Research: Big-Data Ethics Conundrum) Suppose big-data analytics predicts that a person with no criminal record has a significant chance of committing a serious crime. Should the police arrest that person? Investigate ethics issues with respect to big data.
17.30 (Research Project: Privacy and Data Integrity Legislation) In the chapter, we mentioned HIPAA (Health Insurance Portability and Accountability Act) in the United States and GDPR (General Data Protection Regulation) for the European Union. Laws like these are becoming more common and stricter. Investigate each of these laws and how they affect big-data analytics thinking.
17.31 (Research Project: Cross-Referencing Databases) Investigate and comment on the privacy issues caused by cross-referencing facts about individuals among various databases.
17.32 (Research Project: Personally Identifiable Information) Protecting users personally identifiable information (PII) is an important aspect of privacy. Research and comment on this issue in the context of big data.
17.33 (Research Project: Wikipedia as a Big-Data Source) Wikipedia is a popular big-data source. Investigate the capabilities they offer for accessing their information. Be sure to check out the wikipedia Python module and build an application that uses Wikipedia data.