Data mining @social networks

We have traveled quite a distance so far through the chapters of this book, understanding various concepts and learning some amazing algorithms. We have even worked on projects that have applications in our daily lives. In short, we have done data mining without using the term explicitly. Let us now take this opportunity to formally define data mining.

Mining, in the classical sense of the word, refers to the extraction of useful minerals from the Earth (such as coal mining). Put in the context of the information age, mining refers to the extraction of useful information from large pools of data. Thus, if we look carefully, Knowledge Mining or Knowledge Discovery from Data (KDD) seems to be a better representation than the term data mining. As is the case with many keywords, short and sweet catches the attention. Thus, you may find in many places the terms Knowledge Discovery from Data and data mining being used interchangeably, which is rightly so. The process of data mining, analogous to the mining of minerals, involves the following steps:

Data cleansing to remove noise and unwanted data
Data transformation to transform the data into relevant form for analysis
Data/pattern evaluation to uncover interesting insights
Data presentation to visualize knowledge in a useful form

Note

Data mining isn't about using a search engine to get information, say regarding snakes. Rather it is about uncovering hidden insights like snakes are the only creatures found on every continent except Antarctica!

If we take a minute to understand the preceding steps, we can see that we used exactly the same process across our projects. Please keep in mind that we have simply formalized and presented the process we have been following across chapters and not missed or modified any step done in previous chapters.

Mining social network data

Now that we have formally defined data mining and seen the steps involved in transforming data to knowledge, let us focus on data from social networks. While data mining methodology is independent of the source of data, there are certain things to be kept in mind which could lead to better processing and improved results.

Like the mining of any other type of data, domain knowledge is definitely a plus for mining social network data. Even though social network analysis is an interdisciplinary subject (as discussed in the previous section), it primarily involves the analysis of data pertaining to users or entities and their interactions.

In previous chapters, we have seen all sorts of data from e-commerce platforms to banks to data related to the characteristics of flowers. The data we have seen has had different attributes and characteristics. But if we look carefully, the data was a result of some sort of measurement or event capture.

Coming onto the social network's domain, the playground is a little, if not completely different. Unlike what we have seen so far, data from social media platforms is extremely dynamic. When we say dynamic, we refer to the actual content on a data point and not its structure. The data point itself may (or may not) be structured, but the content itself is not.

Let us be specific and talk about data contained in a tweet. A sample tweet looks something like this:

Image source: https://twitter.com/POTUS/status/680464195993911296

A tweet, as we all know, is a 140 character message. Since the message is generated by a user (usually), the actual message may be of a different length, language, and or it may contain images, links, videos, and more. Thus, a tweet is a structured data point which contains the handle of the user (@POTUS), the name of the user (President Obama), the message (From the Obama family...), along with information related to when was it tweeted (26 Dec 2015), the number of likes, and the number of retweets. A tweet may also contain hashtags, hyperlinks, images, and videos embedded within the message. As we will see in the coming sections, a tweet contains tons of metadata (data about the data) apart from the attributes discussed preceding. Similarly, data from other social networks also contains a lot more information than what usually meets the eye.

This much information from a single tweet coupled with millions of users tweeting frantically every second across the globe presents a huge amount of data with interesting patterns waiting to be discovered.

In its true sense, Twitter's data (and of social networks in general) represents the 3 Vs (Volume, Variety, and Velocity) of big data very well.

Note

143,199 tweets per second is a record achieved during the airing of the film Castle in the Sky in Japan on August 3, 2013. The average tweets per second is usually around 5700; the record multiplied it 25 times! Read more about it on the Twitter blog: https://blog.twitter.com/2013/new-tweets-per-second-record-and-how

Thus, the mining of data from a social network involves understanding the structure of the data point, the underlying philosophy or use of the social network (Twitter is used for quick exchange of information, while LinkedIn is used for professional networking), the velocity and volume of the data being generated, along with the thinking cap of a data scientist.

Towards the end of the chapter, we will also touch upon the challenges presented by social networks to the usual mining methodology.