Connected is the word that describes life in the 21st century. Though various factors contribute to the term connected, there's one aspect which has played a pivotal role. It's called the Web. The Web, which has made distance an irrelevant metric and blurred socio-economic boundaries, is a world in itself and we all are a part of it. The Web or Internet in particular has been a central entity in this data-driven revolution. As we have seen in our previous chapters, for most modern day problems, it is the Web/Internet (henceforth used interchangeably) that acts as a source of data. Be it e-commerce platforms or financial domain, the Internet provides us with huge amounts of data every second. There's another ocean of data within this virtual world which touches our lives at a very personal level. Social networks, or social media, is a behemoth of information and the topic for this chapter.

In the previous chapter, we covered the financial domain, where we analyzed and predicted credit risk for customers of a certain bank. We now shift gears and move into the realm of social media and see how machine learning and R empower us to uncover insights from this ocean of data.

In this chapter, we will cover the following topics:

We all use social networks day in and day out. There are numerous social networks catering to all sorts of ideologies and philosophies, but Facebook and Twitter (barring a couple more) have become synonymous with the term social network itself. These two social networks enjoy popularity not only because of their uniqueness and the quality of service but because of the way they enable us to interact in a very intuitive way. As we saw with recommendation engines used in e-commerce websites (see Chapter 4, Building a Product Recommendation System), social networks have existed long before Facebook, Twitter, or even the Internet.

Social networks have interested scientists and mathematicians alike. It is an interdisciplinary topic which spans but is not limited to sociology, psychology, biology, economics, communication studies, and information science. Various theories have been developed to analyze social networks and their impact on human lives in the form of factors influencing economics, demographics, health, language, literacy, crime, and more.

Studies done as early as the late 1800s form the basis of what we today refer to as social networks. A social network, as the word itself says, is a sort of connection/network between nodes or entities represented by humans and elements affecting social life. More formally, it is a network depicting relationships and interactions. Hence, it is not surprising to see various graph theories and algorithms being employed to understand social networks. Where the 19th and 20th centuries were limited to theoretical models and painstaking social experiments, the 21st century's technology has opened the doors for these theories to be tested, fine tuned, and modeled to help understand the dynamics of social interactions. Though testing these theories by some social networks (called social experiments) have been caught in controversies, such topics are beyond the scope of this book. We shall limit ourselves to the algorithmic/data science space and leave the controversies for the experts to discuss.

Before we jump into the specifics, let us try and understand the reason behind choosing Twitter as our point of analysis for this and the upcoming chapter. Let us begin with understanding what Twitter is and why is it so popular with both end users and data scientists alike.

Twitter, as we all know, is a social network/micro-blogging service that enables its users to send and receive tweets of a maximum of 140 characters. But what makes Twitter so popular is the way it caters to the basic human instincts. We, humans, are curious creatures with an incessant need to be heard. It is important for us to have someone or some place to voice our opinions. We love to share our experiences, feats, failures, and ideas. At some level or other, we also want to know what our peers are up to, what's keeping celebrities busy, or simply what's on the news. Twitter addresses just that.

With multiple social networks existing long before Twitter came into existence, it wasn't some other service which Twitter replaced. In our view, it was the way Twitter organized the information and its users that clicked. Its unique Follow model of relationship caters to our hunger for curiosity, while its short, free, and high-speed communication platform enables the users to speak out and be heard globally. By allowing users to follow a person or an entity of interest, it enables us to keep up with their latest happenings without the other user following us back. The Follow model tips Twitter's relationships towards more of an interest graph rather than the friendship model usually found in social networks such as Facebook.

Twitter is known and used across the globe for the super-fast spread of information (and rumors). It has been innovatively used in certain circumstances unimaginable before, such as finding people in times of natural calamities such as earthquakes or typhoons. It has been used to spread information so far and deep that it takes viral proportions. The asymmetric relationships and high speed information exchange aid in making Twitter such a dynamic entity. If we closely analyze and study the data and dynamics of this social network we can uncover many insights. Hence, it is the topic for this chapter.

Let's apply some data science to tweets using #RMachineLearningByExample!