Chapter 13-
Data Analysis with Python
In 2001, Gartner defined Big data as “Data that contains greater variety arriving in increasing volumes and with ever higher velocity.” This led to the formulation of the “three V’s.” Big data refers to an avalanche of structured and unstructured data that is endlessly flooding and from a variety of endless data sources. These data sets are too large to be analyzed with traditional analytical tools and technologies but have a plethora of valuable insights hiding underneath.
The “Vs” of Big data
Volume – To be classified as big data, the size of the given data set must be substantially larger than traditional data sets. These data sets are primarily composed of unstructured data with limited structured and semi structured data. The unstructured data or the data with unknown value can be collected from input sources such as web pages, search history, mobile applications, and
social media platforms. The size and customer base of the company is usually proportional to the volume of the data acquired by the company.
Velocity – The speed at which data can be gathered and acted upon the first to the velocity of big data. Companies are increasingly using combination of on premise and cloud-based servers to increase the speed of their data collection. The modern day “Smart Products and Devices” require real-time access to consumer data, in order to be able to provide them a more engaging and enhanced user experience.
Variety – Traditionally a data set would contain majority of structured data with low volume of unstructured and semi-structured data, but the advent of big data has given rise to new informal data types such as video, text, audio that require sophisticated tools and technologies to clean and process these data types to extract meaningful insights from them.
Veracity – Another “V” that must be considered for big data analysis is veracity. This refers to the “trustworthiness or the quality” of the data. For example, social media platforms like Facebook and Twitter with blogs and posts containing hashtags, acronyms and all kinds of typing errors can significantly reduce the reliability and accuracy of the data sets.
Value – Data has evolved as a currency of its own
with intrinsic value. Just like traditional monetary currencies, the ultimate cost of the big data is directly proportional to the insight gathered from it.
History of Big Data
The origin of large volumes of data can be traced back to the 1960s and 1970s when the Third Industrial Revolution had just started to kick in, and the development of relational databases had begun along with construction of data centers. But the concept of big data has recently taken center stage primarily since the availability of free search engines like Google and Yahoo, free online entertainment services like YouTube and social media platforms like Facebook. In 2005, businesses started to recognize the incredible amount of user data being generated through these platforms and services, and in the same year and opensource framework called “Hadoop,” was developed to gather and analyze these astronomical data dumps available to the companies. During the same period nonrelational or distributed database called “NoSQL,” started to gain popularity due to its ability to store and extract unstructured data. “Hadoop” made it possible for the companies to work with big data with high ease and at a relatively low cost.
Today with the rise of cutting edge technology, not only humans but machines also generating data. The smart device technologies like “Internet
of things” (IoT) and “Internet of systems” (IoS) have skyrocketed the volume of big data. Our everyday household objects and smart devices are connected to the Internet and able to track and record our usage patterns as well as our interactions with these products and feeds all this data directly into the big data. The advent of machine learning technology has further increased the volume of data generated on a daily basis. It is estimated that by 2020, “1.7 MB of data will be generated per second per person.” As the big data will continue to grow, it usability still has many horizons to cross.
Importance of big data
To gain reliable and trustworthy information from a data set, it is essential to have a complete data set which has been made possible with the use of big data technology. The more data we have, the more information and details can be extracted out of it. To gain a 360 view of a problem and its underlying solutions, the future of big data is auspicious. Here are some examples of the use of big data:
Product development – Large and small e-commerce businesses are increasingly relying upon big data to understand customer demands and expectations. Companies can develop predictive models to launch new products and services by using primary characteristics of their past and existing products and services and generating a model describing the relationship
of those characteristics with commercial success of those products and services. For example, a leading fast manufacturing commercial goods company Procter & Gamble extensively uses big data gathered from the social media websites, test markets and focus groups in preparation for their new product launch.
Predictive maintenance – To besides leave project potential mechanical and equipment failures, a large volume of unstructured data such as error messages, log entries, and average temperature of the machine must be analyzed along with available structured data such as make and model of the equipment and year of manufacturing. By examining this big data set using the required analytical tools, companies can extend the shelf life of their equipment by preparing for scheduled maintenance ahead of time and predicting future occurrences of potential mechanical failures.
Customer experience – The smart customer is aware of all of the technological advancements and is loyal only to the most engaging and enhanced user experience available. This has triggered a race among the companies to provide unique customer experiences analyzing the data gathered from customers’ interactions with the company’s products and services. Providing personalized recommendations and offers to reduce customer churn rate and effectively kind words prospective leads into paying customers
.
Fraud and compliance – Big data helps in identifying the data patterns and assessing historical trends from previous fraudulent transactions to detect and prevent potentially fraudulent transactions effectively. Banks, financial institutions, and online payment services like PayPal are continually monitoring and gathering customer transaction data to prevent fraud.
Operational efficiency – With the help of big data predictive analysis. companies can learn and anticipate future demand and product trends by analyzing production capacity, customer feedback, and data about topselling items and product Will result in to improve decision-making and produce products that are in line with the current market trends.
Machine learning – For a machine to be able to learn and train on its own it requires humongous volume of data, i.e. big data. A robust training set containing structured, semi-structured and unstructured data will help the machine to develop a multidimensional view of the real world and the problem it is engineered to resolve.
Drive innovation – By studying and understanding the relationships between humans and their electronic devices as well as the manufacturers of these devices, companies can develop improved and innovative products by examining current product trends and meeting customer expectations
.
“The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to
find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smart decision making.”
- SAS
The functioning of big data
There are three important actions required to gain insights from big data:
Integration – The traditional data integration methods such as ETL (Extract, Transform, Load) are incapable of collating data from a wide variety of unrelated sources and applications that are you at the heart of big data. Advanced tools and technologies are required to analyze big data sets that are exponentially larger than traditional data sets. By integrating big data from these disparate sources, companies are able to analyze and extract valuable insight to grow and maintain their businesses.
Management – Big data management can be defined as “the organization, administration, and governance of large volumes of both structured and unstructured data.” Big data requires efficient and cheap storage, which can be accomplished using servers that are on-premise, cloud-based
or a combination of both. Companies are able to seamlessly access required data from anywhere across the world and then processing this is data using required processing engines on as-needed basis. The goal is to make sure the quality of the data is high-level and can be accessed easily by required tools and applications. Big data gathered from all kinds of Dale sources including social media platforms, search engine history and call logs. The big data usually contains large sets of unstructured data and semi-structured data, which are stored in a variety of formats. To be able to process and store this complicated data, companies require more powerful and advanced data management software beyond the traditional relational databases and data warehouse platforms.
Analysis – Once the big data has been collected and is easily accessible, it can be analyzed using advanced analytical tools and technologies. This analysis will provide valuable insight and actionable information. Big data can be explored to make new discoveries and develop data models using artificial intelligence and machine learning algorithms.
Big Data Analytics
The terms of big data and big data analytics are often used interchangeably, going to the fact that the inherent purpose of big data is to be analyzed. “Big data analytics” can be defined as a set of qualitative
and quantitative methods that can be employed to examine large amounts of unstructured, structured, and semistructured data to discover data patterns and valuable hidden insights. Big data analytics is the science of analyzing big data to collect metrics, key performance indicators, and Data trends that can be easily lost in the flood of raw data, buy using machine learning algorithms and automated analytical techniques. The different steps involved in “big data analysis” are:
Gathering Data Requirements – It is important to understand what information or data needs to be gathered to meet the business objective and goals. Data organization is also very critical for efficient and accurate data analysis. Some of the categories in which the data can be organized are gender, age, demographics, location, ethnicity, and income. A decision must also be made on the required data types (qualitative and quantitative) and data values (can be numerical or alphanumerical) to be used for the analysis.
Gathering Data – Raw data can be collected from disparate sources such as social media platforms, computers, cameras, other software applications, company websites, and even third-party data providers. The big data analysis inherently requires large volumes of data, majority of which is unstructured with a limited amount of structured and semi structured data.