17.4 Case Study: A MongoDB JSON Document Database

MongoDB is a document database capable of storing and retrieving JSON documents. Twitter’s APIs return tweets to you as JSON objects, which you can write directly into a MongoDB database. In this section, you’ll:

You’ll use a free cloud-based MongoDB Atlas cluster, which requires no installation and currently allows you to store up to 512MB of data. To store more, you can download the MongoDB Community Server from:

https://www.mongodb.com/download-center/community

and run it locally or you can sign up for MongoDB’s paid Atlas service.

Installing the Python Libraries Required for Interacting with MongoDB

You’ll use the pymongo library to interact with MongoDB databases from your Python code. You’ll also need the dnspython library to connect to a MongoDB Atlas Cluster. To install these libraries, use the following commands:


conda install -c conda-forge pymongo
conda install -c conda-forge dnspython

keys.py

The ch17 examples folder’s TwitterMongoDB subfolder contains this example’s code and keys.py file. Edit this file to include your Twitter credentials and your OpenMapQuest key from the “Data Mining Twitter” chapter. After we discuss creating a MongoDB Atlas cluster, you’ll also need to add your MongoDB connection string to this file.

17.4.1 Creating the MongoDB Atlas Cluster

To sign up for a free account go to

https://mongodb.com

then enter your email address and click Get started free. On the next page, enter your name and create a password, then read their terms of service. If you agree, click Get started free on this page and you’ll be taken to the screen for setting up your cluster. Click Build my first cluster to get started.

They walk you through the getting started steps with popup bubbles that describe and point you to each task you need to complete. They provide default settings for their free Atlas cluster (M0 as they refer to it), so just give your cluster a name in the Cluster Name section, then click Create Cluster. At this point, they’ll take you to the Clusters page and begin creating your new cluster, which takes several minutes.

Next, a Connect to Atlas popup tutorial will appear, showing a checklist of additional steps required to get you up and running:

  • Create your first database user—This enables you to log into your cluster.

  • Whitelist your IP address—This is a security measure which ensures that only IP addresses you verify are allowed to interact with your cluster. To connect to this cluster from multiple locations (school, home, work, etc.), you’ll need to whitelist each IP address from which you intend to connect.

  • Connect to your cluster—In this step, you’ll locate your cluster’s connection string, which will enable your Python code to connect to the server.

Creating Your First Database User

In the popup tutorial window, click Create your first database user to continue the tutorial, then follow the on-screen prompts to view the cluster’s Security tab and click + ADD NEW USER. In the Add New User dialog, create a username and password. Write these down—you’ll need them momentarily. Click Add User to return to the Connect to Atlas popup tutorial.

Whitelist Your IP Address

In the popup tutorial window, click Whitelist your IP address to continue the tutorial, then follow the on-screen prompts to view the cluster’s IP Whitelist and click + ADD IP ADDRESS. In the Add Whitelist Entry dialog, you can either add your computer’s current IP address or allow access from anywhere, which they do not recommend for production databases, but is OK for learning purposes. Click ALLOW ACCESS FROM ANYWHERE then click Confirm to return to the Connect to Atlas popup tutorial.

Connect to Your Cluster

In the popup tutorial window, click Connect to your cluster to continue the tutorial, then follow the on-screen prompts to view the cluster’s Connect to YourClusterName dialog. Connecting to a MongoDB Atlas database from Python requires a connection string. To get your connection string, click Connect Your Application, then click Short SRV connection string. Your connection string will appear below Copy the SRV address. Click COPY to copy the string. Paste this string into the keys.py file as mongo_connection_string’s value. Replace "<PASSWORD>" in the connection string with your password, and replace the database name "test" with "senators", which will be the database name in this example. At the bottom of the Connect to YourClusterName, click Close. You’re now ready to interact with your Atlas cluster.

17.4.2 Streaming Tweets into MongoDB

First we’ll present an interactive IPython session that connects to the MongoDB database, downloads current tweets via Twitter streaming and summarizes the top-10 senators by tweet count. Next, we’ll present class TweetListener, which handles the incoming tweets and stores their JSON in MongoDB. Finally, we’ll continue the IPython session by creating an interactive Folium map that displays information from the tweets we stored.

Use Tweepy to Authenticate with Twitter

First, let’s use Tweepy to authenticate with Twitter:


In [1]: import tweepy, keys

In [2]: auth = tweepy.OAuthHandler(
   ...:     keys.consumer_key, keys.consumer_secret)
   ...: auth.set_access_token(keys.access_token,
   ...:     keys.access_token_secret)
   ...:

Next, configure the Tweepy API object to wait if our app reaches any Twitter rate limits.


In [3]: api = tweepy.API(auth, wait_on_rate_limit=True,
   ...:                   wait_on_rate_limit_notify=True)
   ...:

Loading the Senators’ Data

We’ll use the information in the file senators.csv (located in the ch17 examples folder’s TwitterMongoDB subfolder) to track tweets to, from and about every U.S. senator. The file contains the senator’s two-letter state code, name, party, Twitter handle and Twitter ID.

Twitter enables you to follow specific users via their numeric Twitter IDs, but these must be submitted as string representations of those numeric values. So, let’s load senators.csv into pandas, convert the TwitterID values to strings (using Series method astype) and display several rows of data. In this case, we set 6 as the maximum number of columns to display. Later we’ll add another column to the DataFrame and this setting will ensure that all the columns are displayed, rather than a few with … in between:


In [4]: import pandas as pd

In [5]: senators_df = pd.read_csv('senators.csv')

In [6]: senators_df['TwitterID'] = senators_df['TwitterID'].astype(str)

In [7]: pd.options.display.max_columns = 6

In [8]: senators_df.head()
Out[8]:
  State            Name Party    TwitterHandle           TwitterID
0    AL  Richard Shelby     R        SenShelby            21111098
1    AL      Doug Jomes     D     SenDougJones  941080085121175552
2    AK  Lisa Murkowski     R    lisamurkowski            18061669
3    AK    Dan Sullivan     R   SenDanSullivan          2891210047
4    AZ         Jon Kyl     R        SenJonKyl            24905240

Configuring the MongoClient

To store the tweet’s JSON as documents in a MongoDB database, you must first connect to your MongoDB Atlas cluster via a pymongo MongoClient, which receives your cluster’s connection string as its argument:


In [9]: from pymongo import MongoClient

In [10]: atlas_client = MongoClient(keys.mongo_connection_string)

Now, we can get a pymongo Database object representing the senators database. The following statement creates the database if it does not exist:


In [11]: db = atlas_client.senators

Setting up Tweet Stream

Let’s specify the number of tweets to download and create the TweetListener. We pass the db object representing the MongoDB database to the TweetListener so it can write the tweets into the database. Depending on the rate at which people are tweeting about the senators, it may take minutes to hours to get 10,000 tweets. For testing purposes, you might want to use a smaller number:


In [12]: from tweetlistener import TweetListener

In [13]: tweet_limit = 10000

In [14]: twitter_stream = tweepy.Stream(api.auth,
    ...:     TweetListener(api, db, tweet_limit))
    ...:

Starting the Tweet Stream

Twitter live streaming allows you to track up to 400 keywords and follow up to 5,000 Twitter IDs at a time. In this case, let’s track the senators’ Twitter handles and follow the senator’s Twitter IDs. This should give us tweets from, to and about each senator. To show you progress, we display the screen name and time stamp for each tweet received, and the total number of tweets so far. To save space, we show here only one of those tweet outputs and replace the user’s screen name with XXXXXXX:


In [15]: twitter_stream.filter(track=senators_df.TwitterHandle.tolist(),
    ...:     follow=senators_df.TwitterID.tolist())
    ...:
    Screen name: XXXXXXX
     Created at: Sun Dec 16 17:19:19 +0000 2018
Tweets received: 1
...

Class TweetListener

For this example, we slightly modified class TweetListener from the “Data Mining Twitter” chapter. Much of the Twitter and Tweepy code shown below is identical to the code you saw previously, so we’ll focus on only the new concepts here:


1 # tweetlistener.py
2 """TweetListener downloads tweets and stores them in MongoDB."""
3 import json
4 import tweepy
5
6 class TweetListener(tweepy.StreamListener):
7     """Handles incoming Tweet stream."""
8
9      def __init__(self, api, database, limit=10000):
10         """Create instance variables for tracking number of tweets."""
11         self.db = database
12         self.tweet_count = 0
13         self.TWEET_LIMIT = limit # 10,000 by default
14         super().__init__(api) # call superclass's init
15
16     def on_connect(self):
17         """Called when your connection attempt is successful, enabling
18         you to perform appropriate application tasks at that point."""
19         print('Successfully connected to Twitter\n')
20
21     def on_data(self, data):
22         """Called when Twitter pushes a new tweet to you."""
23         self.tweet_count += 1 # track number of tweets processed
24         json_data = json.loads(data) # convert string to JSON
25         self.db.tweets.insert_one(json_data) # store in tweets collection
26         print(f' Screen name: {json_data["user"]["name"]}')
27         print(f' Created at: {json_data["created_at"]}')
28         print(f'Tweets received: {self.tweet_count}')
29
30         # if TWEET_LIMIT is reached, return False to terminate streaming
31         return self.tweet_count != self.TWEET_LIMIT
32
33     def on_error(self, status):
34         print(status)
35         return True

Previously, TweetListener overrode method on_status to receive Tweepy Status objects representing tweets. Here, we override the on_data method instead (lines 21–31). Rather than Status objects, on_data receives each tweet object’s raw JSON. Line 24 converts the JSON string received by on_data into a Python JSON object. Each MongoDB database contains one or more Collections of documents. In line 25, the expression

self.db.tweets

accesses the Database object db’s tweets Collection, creating it if it does not already exist. Line 25 uses the tweets Collection’s insert_one method to store the JSON object in the tweets collection.

Counting Tweets for Each Senator

Next, we’ll perform a full-text search on the collection of tweets and count the number of tweets containing each senator’s Twitter handle. To text search in MongoDB, you must create a text index for the collection.16 This specifies which document field(s) to search. Each text index is defined as a tuple containing the field name to search and the index type ('text'). MongoDB’s wildcard specifier ($**) indicates that every text field in a document (a JSON tweet object in our case) should be indexed for a full-text search:


In [16]: db.tweets.create_index([('$**', 'text')])
Out[16]: '$**_text'

Once the index is defined, we can use the Collection’s count_documents method to count the total number of documents in the collection that contain the specified text. Let’s search the database’s tweets collection for every twitter handle in the senators_df Data-Frame’s TwitterHandle column:


In [17]: tweet_counts = []

In [18]: for senator in senators_df.TwitterHandle:
    ...:     tweet_counts.append(db.tweets.count_documents(
    ...:         {"$text": {"$search": senator}}))
    ...:

The JSON object passed to count_documents in this case indicates that we’re using the index named text to search for the value of senator.

Show Tweet Counts for Each Senator

Let’s create a copy of the DataFrame senators_df that contains the tweet_counts as a new column, then display the top-10 senators by tweet count:


In [19]: tweet_counts_df = senators_df.assign(Tweets=tweet_counts)

In [20]: tweet_counts_df.sort_values(by='Tweets',
    ...:     ascending=False).head(10)
    ...:
Out[20]:
   State                Name Party    TwitterHandle    TwitterID  Tweets
78    SC      Lindsey Graham     R  LindseyGrahamSC    432895323    1405
41    MA    Elizabeth Warren     D        SenWarren    970207298    1249
8     CA    Dianne Feinstein     D     SenFeinstein    476256944    1079
20    HI        Brian Schatz     D      brianschatz     47747074     934
62    NY       Chuck Schumer     D       SenSchumer     17494010     811
24    IL     Tammy Duckworth     D     SenDuckworth   1058520120     656
13    CT  Richard Blumenthal     D    SenBlumenthal    278124059     646
21    HI        Mazie Hirono     D      maziehirono     92186819     628
86    UT         Orrin Hatch     R    SenOrrinHatch    262756641     506
77    RI  Sheldon Whitehouse     D    SenWhitehouse    242555999     350

Get the State Locations for Plotting Markers

Next, we’ll use the techniques you learned in the “Data Mining Twitter” chapter to get each state’s latitude and longitude coordinates. We’ll soon use these to place on a Folium map popup markers that contain the names and numbers of tweets mentioning each state’s senators.

The file state_codes.py contains a state_codes dictionary that maps two-letter state codes to their full state names. We’ll use the full state names with geopy’s Open-MapQuest geocode function to look up the location of each state.17 First, let’s import the libraries we need and the state_codes dictionary:


In [21]: from geopy import OpenMapQuest

In [22]: import time

In [23]: from state_codes import state_codes

Next, let’s get the geocoder object to translate location names into Location objects:

In [24]: geo = OpenMapQuest(api_key=keys.mapquest_key)

There are two senators from each state, so we can look up each state’s location once and use the Location object for both senators from that state. Let’s get the unique state names, then sort them into ascending order:


In [25]: states = tweet_counts_df.State.unique()

In [26]: states.sort()

The next two snippets use code from the “Data Mining Twitter” chapter to look up each state’s location. In snippet [28], we call the geocode function with the state name followed by ', USA' to ensure that we get United States locations,18 since there are places outside the United States. with the same names as U.S. states. To show progress, we display each new Location object’s string:


In [27]: locations = []

In [28]: for state in states:
    ...:     processed = False
    ...:     delay = .1
    ...:     while not processed:
    ...:         try:
    ...:             locations.append(
    ...:                 geo.geocode(state_codes[state] + ', USA'))
    ...:             print(locations[-1])
    ...:             processed = True
    ...:          except: # timed out, so wait before trying again
    ...:             print('OpenMapQuest service timed out. Waiting.')
    ...:             time.sleep(delay)
    ...:             delay += .1
    ...:
Alaska, United States of America
Alabama, United States of America
Arkansas, United States of America
...

Grouping the Tweet Counts by State

We’ll use the total number of tweets for the two senators in a state to color that state on the map. Darker colors will represent the states with higher tweet counts. To prepare the data for mapping, let’s use the pandas DataFrame method groupby to group the senators by state and calculate the total tweets by state:


In [30]: tweets_counts_by_state = tweet_counts_df.groupby(
    ...:     'State', as_index=False).sum()
    ...:

In [31]: tweets_counts_by_state.head()
Out[31]:
  State  Tweets
0    AK      27
1    AL       2
2    AR      47
3    AZ      47
4    CA    1135

The as_index=False keyword argument in snippet [30] indicates that the state codes should be values in a column of the resulting GroupBy object, rather than the indices for the rows. The GroupBy object’s sum method totals the numeric data (the tweets by state). Snippet [31] displays several rows of the GroupBy object so you can see some of the results.

Creating the Map

Next, let’s create the map. You may want to adjust the zoom. On our system, the following snippet creates a map in which we initially can see only the continental United States. Remember that Folium maps are interactive, so once the map is displayed, you can scroll to zoom in and out or drag to see different areas, such as Alaska or Hawaii:


In [32]: import folium

In [33]: usmap = folium.Map(location=[39.8283, -98.5795],
   ...:                     zoom_start=4, detect_retina=True,
   ...:                     tiles='Stamen Toner')
   ...:

Creating a Choropleth to Color the Map

A choropleth shades areas in a map using the values you specify to determine color. Let’s create a choropleth that colors the states by the number of tweets containing their senators’ Twitter handles. First, save Folium’s us-states.json file at

https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/us-states.json

to the folder containing this example. This file contains a JSON dialect called GeoJSON (Geographic JSON) that describes the boundaries of shapes—in this case, the boundaries of every U.S. state. The choropleth uses this information to shade each state. You can learn more about GeoJSON at http://geojson.org/.19 The following snippets create the choropleth, then add it to the map:


In [34]: choropleth = folium.Choropleth(
    ...:     geo_data='us-states.json',
    ...:     name='choropleth',
    ...:     data=tweets_counts_by_state,
    ...:     columns=['State', 'Tweets'],
    ...:     key_on='feature.id',
    ...:     fill_color='YlOrRd',
    ...:     fill_opacity=0.7,
    ...:     line_opacity=0.2,
    ...:     legend_name='Tweets by State'
    ...: ).add_to(usmap)
    ...:
    
In [35]: layer = folium.LayerControl().add_to(usmap)

In this case, we used the following arguments:

  • geo_data='us-states.json'—This is the file containing the GeoJSON that specifies the shapes to color.

  • name='choropleth'—Folium displays the Choropleth as a layer over the map. This is the name for that layer that will appear in the map’s layer controls, which enable you to hide and show the layers. These controls appear when you click the layers icon (layer icon) on the map

  • data=tweets_counts_by_state—This is a pandas DataFrame (or Series) containing the values that determine the Choropleth colors.

  • columns=['State', 'Tweets']—When the data is a DataFrame, this is a list of two columns representing the keys and the corresponding values used to color the Choropleth.

  • key_on='feature.id'—This is a variable in the GeoJSON file to which the Choropleth binds the values in the columns argument.

  • fill_color='YlOrRd'—This is a color map specifying the colors to use to fill in the states. Folium provides 12 colormaps: 'BuGn', 'BuPu', 'GnBu', 'OrRd', 'PuBu', 'PuBuGn', 'PuRd', 'RdPu', 'YlGn', 'YlGnBu', 'YlOrBr' and 'YlOrRd'. You should experiment with these to find the most effective and eye-pleasing ones for your application(s).

  • fill_opacity=0.7—A value from 0.0 (transparent) to 1.0 (opaque) specifying the transparency of the fill colors displayed in the states.

  • line_opacity=0.2—A value from 0.0 (transparent) to 1.0 (opaque) specifying the transparency of lines used to delineate the states.

  • legend_name='Tweets by State'—At the top of the map, the Choropleth displays a color bar (the legend) indicating the value range represented by the colors. This legend_name text appears below the color bar to indicate what the colors represent.

The complete list of Choropleth keyword arguments is documented at:

http://python-visualization.github.io/folium/modules.html#folium.features.Choropleth

Creating the Map Markers for Each State

Next, we’ll create Markers for each state. To ensure that the senators are displayed in descending order by the number of tweets in each state’s Marker, let’s sort tweet_counts_df in descending order by the 'Tweets' column:


In [36]: sorted_df = tweet_counts_df.sort_values(
    ...:     by='Tweets', ascending=False)
    ...:

The loop in the following snippet creates the Markers. First,

sorted_df.groupby('State')

groups sorted_df by 'State'. A DataFrame’s groupby method maintains the original row order in each group. Within a given group, the senator with the most tweets will be first, because we sorted the senators in descending order by tweet count in snippet [36]:


In [37]: for index, (name, group) in
enumerate(sorted_df.groupby('State')):
    ...:      strings = [state_codes[name]] # used to assemble popup text
    ...:
    ...:      for s in group.itertuples():
    ...:          strings.append(
    ...:              f'{s.Name} ({s.Party}); Tweets: {s.Tweets}')
    ...:
    ...:      text = '<br>'.join(strings)
    ...:      marker = folium.Marker(
    ...:          (locations[index].latitude, locations[index].longitude),
    ...:           popup=text)
    ...:      marker.add_to(usmap)
    ...:
    ...:

We pass the grouped DataFrame to enumerate, so we can get an index for each group, which we’ll use to look up each state’s Location in the locations list. Each group has a name (the state code we grouped by) and a collection of items in that group (the two senators for that state). The loop operates as follows:

  • We look up the full state name in the state_codes dictionary, then store it in the strings list—we’ll use this list to assemble the Marker’s popup text.

  • The nested loop walks through the items in the group collection, returning each as a named tuple that contains a given senator’s data. We create a formatted string for the current senator containing the person’s name, party and number of tweets, then append that to the strings list.

  • The Marker text can use HTML for formatting. We join the strings list’s elements, separating each from the next with an HTML <br> element which creates a new line in HTML.

  • We create the Marker. The first argument is the Marker’s location as a tuple containing the latitude and longitude. The popup keyword argument specifies the text to display if the user clicks the Marker.

  • We add the Marker to the map.

Displaying the Map

Finally, let’s save the map into an HTML file

In [38]: usmap.save('SenatorsTweets.html')

Open the HTML file in your web browser to view and interact with the map. Recall that you can drag the map to see Alaska and Hawaii. Here we show the popup text for the South Carolina marker:

An interactive map of the U S showing blue markers in each state. The popup text for the South Carolina marker reads, South Carolina, Lindsey Graham, R, Tweets: 1405, Tim Scott, R, Tweets, 11.

An exercise at the end of the chapter asks you to use the sentiment-analysis techniques you learned in previous chapters to rate as positive, neutral or negative the sentiment expressed by people who send tweets (“tweeters”) mentioning each senator’s handle.

Self Check for Section 17.4

  1. (Write a Statement) Assuming that atlas_client is a pymongo MongoClient that’s connected to a MongoDB Atlas cluster, write a statement that creates a new database with the name football_players and stores the resulting object in db.
    Answer: db = atlas_client.football_players.

  2. (Fill-In) A pymongo Collection object’s       method inserts a new document into the Collection.
    Answer: insert_one.

  3. (True/False) Once you’ve inserted documents in a Collection, you can immediately text-search their contents.
    Answer: False. To perform a text search, you must first create a text index for the collection specifying which document field(s) to search.

  4. (Fill-In) A folium       uses GeoJSON to add color to a map.
    Answer: Choropleth.