MongoDB is a document database capable of storing and retrieving JSON documents. Twitter’s APIs return tweets to you as JSON objects, which you can write directly into a MongoDB database. In this section, you’ll:
use Tweepy to stream tweets about the 100 U.S. senators and store them into a MongoDB database,
use pandas to summarize the top 10 senators by tweet activity and
display an interactive Folium map of the United States with one popup marker per state that shows the state name and both senators’ names, their political parties and tweet counts.
You’ll use a free cloud-based MongoDB Atlas cluster, which requires no installation and currently allows you to store up to 512MB of data. To store more, you can download the MongoDB Community Server from:
https:/ / www.mongodb.com/ download-center/ community
and run it locally or you can sign up for MongoDB’s paid Atlas service.
You’ll use the pymongo
library to interact with MongoDB databases from your Python code. You’ll also need the dnspython
library to connect to a MongoDB Atlas Cluster. To install these libraries, use the following commands:
conda install -c conda-forge pymongo
conda install -c conda-forge dnspython
keys.py
The ch17
examples folder’s TwitterMongoDB
subfolder contains this example’s code and keys.py
file. Edit this file to include your Twitter credentials and your OpenMapQuest key from the “Data Mining Twitter” chapter. After we discuss creating a MongoDB Atlas cluster, you’ll also need to add your MongoDB connection string to this file.
To sign up for a free account go to
then enter your email address and click Get started free
. On the next page, enter your name and create a password, then read their terms of service. If you agree, click Get started free
on this page and you’ll be taken to the screen for setting up your cluster. Click Build my first cluster
to get started.
They walk you through the getting started steps with popup bubbles that describe and point you to each task you need to complete. They provide default settings for their free Atlas cluster (M0
as they refer to it), so just give your cluster a name in the Cluster Name
section, then click Create Cluster
. At this point, they’ll take you to the Clusters
page and begin creating your new cluster, which takes several minutes.
Next, a Connect to Atlas
popup tutorial will appear, showing a checklist of additional steps required to get you up and running:
Create your first database user—This enables you to log into your cluster.
Whitelist your IP address—This is a security measure which ensures that only IP addresses you verify are allowed to interact with your cluster. To connect to this cluster from multiple locations (school, home, work, etc.), you’ll need to whitelist each IP address from which you intend to connect.
Connect to your cluster—In this step, you’ll locate your cluster’s connection string, which will enable your Python code to connect to the server.
In the popup tutorial window, click Create your first database user
to continue the tutorial, then follow the on-screen prompts to view the cluster’s Security
tab and click + ADD NEW USER
. In the Add New User
dialog, create a username and password. Write these down—you’ll need them momentarily. Click Add User
to return to the Connect to Atlas
popup tutorial.
In the popup tutorial window, click Whitelist your IP address
to continue the tutorial, then follow the on-screen prompts to view the cluster’s IP Whitelist
and click + ADD IP ADDRESS
. In the Add Whitelist Entry
dialog, you can either add your computer’s current IP address or allow access from anywhere, which they do not recommend for production databases, but is OK for learning purposes. Click ALLOW ACCESS FROM ANYWHERE
then click Confirm
to return to the Connect to Atlas
popup tutorial.
In the popup tutorial window, click Connect to your cluster
to continue the tutorial, then follow the on-screen prompts to view the cluster’s Connect to YourClusterName
dialog. Connecting to a MongoDB Atlas database from Python requires a connection string. To get your connection string, click Connect Your Application
, then click Short SRV connection string
. Your connection string will appear below Copy the SRV address
. Click COPY
to copy the string. Paste this string into the keys.py
file as mongo_connection_string
’s value. Replace "<PASSWORD>"
in the connection string with your password, and replace the database name "test"
with "senators"
, which will be the database name in this example. At the bottom of the Connect to YourClusterName
, click Close
. You’re now ready to interact with your Atlas cluster.
First we’ll present an interactive IPython session that connects to the MongoDB database, downloads current tweets via Twitter streaming and summarizes the top-10 senators by tweet count. Next, we’ll present class TweetListener
, which handles the incoming tweets and stores their JSON in MongoDB. Finally, we’ll continue the IPython session by creating an interactive Folium map that displays information from the tweets we stored.
First, let’s use Tweepy to authenticate with Twitter:
In [1]: import tweepy, keys
In [2]: auth = tweepy.OAuthHandler(
...: keys.consumer_key, keys.consumer_secret)
...: auth.set_access_token(keys.access_token,
...: keys.access_token_secret)
...:
Next, configure the Tweepy API
object to wait if our app reaches any Twitter rate limits.
In [3]: api = tweepy.API(auth, wait_on_rate_limit=True,
...: wait_on_rate_limit_notify=True)
...:
We’ll use the information in the file senators.csv
(located in the ch17
examples folder’s TwitterMongoDB
subfolder) to track tweets to, from and about every U.S. senator. The file contains the senator’s two-letter state code, name, party, Twitter handle and Twitter ID.
Twitter enables you to follow specific users via their numeric Twitter IDs, but these must be submitted as string representations of those numeric values. So, let’s load senators.csv
into pandas, convert the TwitterID values to strings (using Series
method astype
) and display several rows of data. In this case, we set 6
as the maximum number of columns to display. Later we’ll add another column to the DataFrame
and this setting will ensure that all the columns are displayed, rather than a few with … in between:
In [4]: import pandas as pd
In [5]: senators_df = pd.read_csv('senators.csv')
In [6]: senators_df['TwitterID'] = senators_df['TwitterID'].astype(str)
In [7]: pd.options.display.max_columns = 6
In [8]: senators_df.head()
Out[8]:
State Name Party TwitterHandle TwitterID
0 AL Richard Shelby R SenShelby 21111098
1 AL Doug Jomes D SenDougJones 941080085121175552
2 AK Lisa Murkowski R lisamurkowski 18061669
3 AK Dan Sullivan R SenDanSullivan 2891210047
4 AZ Jon Kyl R SenJonKyl 24905240
MongoClient
To store the tweet’s JSON as documents in a MongoDB database, you must first connect to your MongoDB Atlas cluster via a pymongo
MongoClient
, which receives your cluster’s connection string as its argument:
In [9]: from pymongo import MongoClient
In [10]: atlas_client = MongoClient(keys.mongo_connection_string)
Now, we can get a pymongo
Database
object representing the senators
database. The following statement creates the database if it does not exist:
In [11]: db = atlas_client.senators
Let’s specify the number of tweets to download and create the TweetListener
. We pass the db
object representing the MongoDB database to the TweetListener
so it can write the tweets into the database. Depending on the rate at which people are tweeting about the senators, it may take minutes to hours to get 10,000 tweets. For testing purposes, you might want to use a smaller number:
In [12]: from tweetlistener import TweetListener
In [13]: tweet_limit = 10000
In [14]: twitter_stream = tweepy.Stream(api.auth,
...: TweetListener(api, db, tweet_limit))
...:
Twitter live streaming allows you to track up to 400 keywords and follow up to 5,000 Twitter IDs at a time. In this case, let’s track the senators’ Twitter handles and follow the senator’s Twitter IDs. This should give us tweets from, to and about each senator. To show you progress, we display the screen name and time stamp for each tweet received, and the total number of tweets so far. To save space, we show here only one of those tweet outputs and replace the user’s screen name with XXXXXXX
:
In [15]: twitter_stream.filter(track=senators_df.TwitterHandle.tolist(),
...: follow=senators_df.TwitterID.tolist())
...:
Screen name: XXXXXXX
Created at: Sun Dec 16 17:19:19 +0000 2018
Tweets received: 1
...
TweetListener
For this example, we slightly modified class TweetListener
from the “Data Mining Twitter” chapter. Much of the Twitter and Tweepy code shown below is identical to the code you saw previously, so we’ll focus on only the new concepts here:
1 # tweetlistener.py
2 """TweetListener downloads tweets and stores them in MongoDB."""
3 import json
4 import tweepy
5
6 class TweetListener(tweepy.StreamListener):
7 """Handles incoming Tweet stream."""
8
9 def __init__(self, api, database, limit=10000):
10 """Create instance variables for tracking number of tweets."""
11 self.db = database
12 self.tweet_count = 0
13 self.TWEET_LIMIT = limit # 10,000 by default
14 super().__init__(api) # call superclass's init
15
16 def on_connect(self):
17 """Called when your connection attempt is successful, enabling
18 you to perform appropriate application tasks at that point."""
19 print('Successfully connected to Twitter\n')
20
21 def on_data(self, data):
22 """Called when Twitter pushes a new tweet to you."""
23 self.tweet_count += 1 # track number of tweets processed
24 json_data = json.loads(data) # convert string to JSON
25 self.db.tweets.insert_one(json_data) # store in tweets collection
26 print(f' Screen name: {json_data["user"]["name"]}')
27 print(f' Created at: {json_data["created_at"]}')
28 print(f'Tweets received: {self.tweet_count}')
29
30 # if TWEET_LIMIT is reached, return False to terminate streaming
31 return self.tweet_count != self.TWEET_LIMIT
32
33 def on_error(self, status):
34 print(status)
35 return True
Previously, TweetListener
overrode method on_status
to receive Tweepy Status
objects representing tweets. Here, we override the on_data
method instead (lines 21–31). Rather than Status
objects, on_data
receives each tweet object’s raw JSON. Line 24 converts the JSON string received by on_data
into a Python JSON object. Each MongoDB database contains one or more Collection
s of documents. In line 25, the expression
self.db.tweets
accesses the Database
object db
’s tweets
Collection
, creating it if it does not already exist. Line 25 uses the tweets
Collection
’s insert_one
method to store the JSON object in the tweets
collection.
Next, we’ll perform a full-text search on the collection of tweets and count the number of tweets containing each senator’s Twitter handle. To text search in MongoDB, you must create a text index for the collection.16 This specifies which document field(s) to search. Each text index is defined as a tuple containing the field name to search and the index type ('text'
). MongoDB’s wildcard specifier ($**
) indicates that every text field in a document (a JSON tweet object in our case) should be indexed for a full-text search:
In [16]: db.tweets.create_index([('$**', 'text')])
Out[16]: '$**_text'
Once the index is defined, we can use the Collection
’s count_documents
method to count the total number of documents in the collection that contain the specified text. Let’s search the database’s tweets
collection for every twitter handle in the senators_df Data-Frame
’s TwitterHandle
column:
In [17]: tweet_counts = []
In [18]: for senator in senators_df.TwitterHandle:
...: tweet_counts.append(db.tweets.count_documents(
...: {"$text": {"$search": senator}}))
...:
The JSON object passed to count_documents
in this case indicates that we’re using the index named text
to search
for the value of senator
.
Let’s create a copy of the DataFrame
senators_df that contains the tweet_counts
as a new column, then display the top-10 senators by tweet count:
In [19]: tweet_counts_df = senators_df.assign(Tweets=tweet_counts)
In [20]: tweet_counts_df.sort_values(by='Tweets',
...: ascending=False).head(10)
...:
Out[20]:
State Name Party TwitterHandle TwitterID Tweets
78 SC Lindsey Graham R LindseyGrahamSC 432895323 1405
41 MA Elizabeth Warren D SenWarren 970207298 1249
8 CA Dianne Feinstein D SenFeinstein 476256944 1079
20 HI Brian Schatz D brianschatz 47747074 934
62 NY Chuck Schumer D SenSchumer 17494010 811
24 IL Tammy Duckworth D SenDuckworth 1058520120 656
13 CT Richard Blumenthal D SenBlumenthal 278124059 646
21 HI Mazie Hirono D maziehirono 92186819 628
86 UT Orrin Hatch R SenOrrinHatch 262756641 506
77 RI Sheldon Whitehouse D SenWhitehouse 242555999 350
Next, we’ll use the techniques you learned in the “Data Mining Twitter” chapter to get each state’s latitude and longitude coordinates. We’ll soon use these to place on a Folium map popup markers that contain the names and numbers of tweets mentioning each state’s senators.
The file state_codes.py
contains a state_codes
dictionary that maps two-letter state codes to their full state names. We’ll use the full state names with geopy
’s Open-MapQuest
geocode
function to look up the location of each state.17 First, let’s import the libraries we need and the state_codes
dictionary:
In [21]: from geopy import OpenMapQuest
In [22]: import time
In [23]: from state_codes import state_codes
Next, let’s get the geocoder
object to translate location names into Location
objects:
In [24]: geo = OpenMapQuest(api_key=keys.mapquest_key)
There are two senators from each state, so we can look up each state’s location once and use the Location
object for both senators from that state. Let’s get the unique state names, then sort them into ascending order:
In [25]: states = tweet_counts_df.State.unique()
In [26]: states.sort()
The next two snippets use code from the “Data Mining Twitter” chapter to look up each state’s location. In snippet [28]
, we call the geocode
function with the state name followed by ',
USA'
to ensure that we get United States locations,18 since there are places outside the United States. with the same names as U.S. states. To show progress, we display each new Location
object’s string:
In [27]: locations = []
In [28]: for state in states:
...: processed = False
...: delay = .1
...: while not processed:
...: try:
...: locations.append(
...: geo.geocode(state_codes[state] + ', USA'))
...: print(locations[-1])
...: processed = True
...: except: # timed out, so wait before trying again
...: print('OpenMapQuest service timed out. Waiting.')
...: time.sleep(delay)
...: delay += .1
...:
Alaska, United States of America
Alabama, United States of America
Arkansas, United States of America
...
We’ll use the total number of tweets for the two senators in a state to color that state on the map. Darker colors will represent the states with higher tweet counts. To prepare the data for mapping, let’s use the pandas DataFrame
method groupby
to group the senators by state and calculate the total tweets by state:
In [30]: tweets_counts_by_state = tweet_counts_df.groupby(
...: 'State', as_index=False).sum()
...:
In [31]: tweets_counts_by_state.head()
Out[31]:
State Tweets
0 AK 27
1 AL 2
2 AR 47
3 AZ 47
4 CA 1135
The as_index=False
keyword argument in snippet [30]
indicates that the state codes should be values in a column of the resulting GroupBy
object, rather than the indices for the rows. The GroupBy
object’s sum
method totals the numeric data (the tweets by state). Snippet [31]
displays several rows of the GroupBy
object so you can see some of the results.
Next, let’s create the map. You may want to adjust the zoom. On our system, the following snippet creates a map in which we initially can see only the continental United States. Remember that Folium maps are interactive, so once the map is displayed, you can scroll to zoom in and out or drag to see different areas, such as Alaska or Hawaii:
In [32]: import folium
In [33]: usmap = folium.Map(location=[39.8283, -98.5795],
...: zoom_start=4, detect_retina=True,
...: tiles='Stamen Toner')
...:
A choropleth shades areas in a map using the values you specify to determine color. Let’s create a choropleth that colors the states by the number of tweets containing their senators’ Twitter handles. First, save Folium’s us-states.json
file at
https:/ / raw.githubusercontent.com/ python-visualization/ folium/ master/ examples/ data/ us-states.json
to the folder containing this example. This file contains a JSON dialect called GeoJSON (Geographic JSON) that describes the boundaries of shapes—in this case, the boundaries of every U.S. state. The choropleth uses this information to shade each state. You can learn more about GeoJSON at http:/
.19 The following snippets create the choropleth, then add it to the map:
In [34]: choropleth = folium.Choropleth(
...: geo_data='us-states.json',
...: name='choropleth',
...: data=tweets_counts_by_state,
...: columns=['State', 'Tweets'],
...: key_on='feature.id',
...: fill_color='YlOrRd',
...: fill_opacity=0.7,
...: line_opacity=0.2,
...: legend_name='Tweets by State'
...: ).add_to(usmap)
...:
In [35]: layer = folium.LayerControl().add_to(usmap)
In this case, we used the following arguments:
geo_data='us-states.json'
—This is the file containing the GeoJSON that specifies the shapes to color.
name='choropleth'
—Folium displays the Choropleth as a layer over the map. This is the name for that layer that will appear in the map’s layer controls, which enable you to hide and show the layers. These controls appear when you click the layers icon () on the map
data=tweets_counts_by_state
—This is a pandas DataFrame
(or Series
) containing the values that determine the Choropleth colors.
columns=['State',
'Tweets']
—When the data
is a DataFrame
, this is a list of two columns representing the keys and the corresponding values used to color the Choropleth.
key_on='feature.id'
—This is a variable in the GeoJSON file to which the Choropleth binds the values in the columns
argument.
fill_color='YlOrRd'
—This is a color map specifying the colors to use to fill in the states. Folium provides 12 colormaps: 'BuGn'
, 'BuPu'
, 'GnBu'
, 'OrRd'
, 'PuBu'
, 'PuBuGn'
, 'PuRd'
, 'RdPu'
, 'YlGn'
, 'YlGnBu'
, 'YlOrBr'
and 'YlOrRd'
. You should experiment with these to find the most effective and eye-pleasing ones for your application(s).
fill_opacity=0.7
—A value from 0.0 (transparent) to 1.0 (opaque) specifying the transparency of the fill colors displayed in the states.
line_opacity=0.2
—A value from 0.0 (transparent) to 1.0 (opaque) specifying the transparency of lines used to delineate the states.
legend_name='Tweets by State'
—At the top of the map, the Choropleth
displays a color bar (the legend) indicating the value range represented by the colors. This legend_name
text appears below the color bar to indicate what the colors represent.
The complete list of Choropleth keyword arguments is documented at:
http:/ / python-visualization.github.io/ folium/ modules.html#folium.features.Choropleth
Next, we’ll create Marker
s for each state. To ensure that the senators are displayed in descending order by the number of tweets in each state’s Marker
, let’s sort tweet_counts_df
in descending order by the 'Tweets'
column:
In [36]: sorted_df = tweet_counts_df.sort_values(
...: by='Tweets', ascending=False)
...:
The loop in the following snippet creates the Marker
s. First,
sorted_df.groupby('State')
groups sorted_df
by 'State'
. A DataFrame
’s groupby
method maintains the original row order in each group. Within a given group, the senator with the most tweets will be first, because we sorted the senators in descending order by tweet count in snippet [36]
:
In [37]: for index, (name, group) in
enumerate(sorted_df.groupby('State')):
...: strings = [state_codes[name]] # used to assemble popup text
...:
...: for s in group.itertuples():
...: strings.append(
...: f'{s.Name} ({s.Party}); Tweets: {s.Tweets}')
...:
...: text = '<br>'.join(strings)
...: marker = folium.Marker(
...: (locations[index].latitude, locations[index].longitude),
...: popup=text)
...: marker.add_to(usmap)
...:
...:
We pass the grouped DataFrame
to enumerate
, so we can get an index for each group, which we’ll use to look up each state’s Location
in the locations
list. Each group has a name (the state code we grouped by) and a collection of items in that group (the two senators for that state). The loop operates as follows:
We look up the full state name in the state_codes
dictionary, then store it in the strings
list—we’ll use this list to assemble the Marker
’s popup text.
The nested loop walks through the items in the group
collection, returning each as a named tuple that contains a given senator’s data. We create a formatted string for the current senator containing the person’s name, party and number of tweets, then append that to the strings
list.
The Marker
text can use HTML for formatting. We join the strings
list’s elements, separating each from the next with an HTML <br>
element which creates a new line in HTML.
We create the Marker
. The first argument is the Marker
’s location as a tuple containing the latitude and longitude. The popup
keyword argument specifies the text to display if the user clicks the Marker
.
We add the Marker
to the map.
Finally, let’s save the map into an HTML file
In [38]: usmap.save('SenatorsTweets.html')
Open the HTML file in your web browser to view and interact with the map. Recall that you can drag the map to see Alaska and Hawaii. Here we show the popup text for the South Carolina marker:
An exercise at the end of the chapter asks you to use the sentiment-analysis techniques you learned in previous chapters to rate as positive, neutral or negative the sentiment expressed by people who send tweets (“tweeters”) mentioning each senator’s handle.
(Write a Statement) Assuming that atlas_client
is a pymongo
MongoClient
that’s connected to a MongoDB Atlas cluster, write a statement that creates a new database with the name football_players
and stores the resulting object in db
.
Answer: db = atlas_client.football_players
.
(Fill-In) A pymongo
Collection
object’s method inserts a new document into the Collection
.
Answer: insert_one
.
(True/False) Once you’ve inserted documents in a Collection
, you can immediately text-search their contents.
Answer: False. To perform a text search, you must first create a text index for the collection specifying which document field(s) to search.
(Fill-In) A folium
uses GeoJSON to add color to a map.
Answer: Choropleth
.