Chapter 1: What is Data Science?
The first thing that we need to take some time looking over in this guidebook is the basics of data science.
Data science, to keep things simple, is the detailed study of the flow of information from the huge amounts of data that a company has gathered and stored.
It is going to involve obtaining some meaningful insights out of raw and usually unstructured data, that can then be processed through analytical programming and business skills.
Many companies are going to spend a lot of time collecting data and trying to use it to learn more about their customers, figure out how to release the best product, and learning how to gain a competitive edge over others.
While these are all great goals, just gathering the data is not going to be enough to make it happen.
Instead, we need to be able to take that data, and that data is usually pretty messy and needs some work and analyze it so that we are better able to handle all that comes with it.
The Importance of Data Science
In a world that is going more and more to the digital space, organizations are going to deal with unheard of amounts of data, but structured and unstructured, on a daily basis.
Evolving technologies are going to enable some cost savings for us, and smarter storage spaces to help us store some of this critical data.
Currently, no matter what kind of industry we are looking at or what kind of work the company does, there is already a huge need for skilled and knowledgeable data scientists.
They are actually some of the highest-paid IT professionals right now, mainly because they can provide such a good value for the companies they work for, and because there is such a shortage in these professionals.
The gap of data scientists versus the current supply is about 50 percent, and it is likely to continue growing as more people and companies start to see what value data science can have for them.
So, why is data becoming so important to these businesses?
In reality, data has always been important, but today, because of the growth in the internet and other sources, there is an unprecedented amount of data to work through.
In the past, companies were able to manually go through the data they had, and maybe use a few business intelligence tools to learn more about the customer and to make smart decisions.
But this is nearly impossible for any company to do now thanks to a large amount of data they have to deal with on a regular basis.
In the last few years, there has been a huge amount of growth in something known as the Internet of Things, due to which about 90 percent of the data has been generated in our current world.
This sounds even more impressive when we find out that each day, 2.5 quintillion bytes of data are generated and used, and it is more accelerated with the growth of the IoT.
This data is going to come to us from a lot of different sources, and where you decide to gather this data is going to depend on your goals and what you are hoping to accomplish in the process.
Some of the places where we are able to gather this kind of data will include:
- Sensors that are used in malls and other shopping locations in order to gather more information about the people who shop there.
- Posts placed on various social media sites.
- Digital videos and pictures that are captured on our phones
- Purchase transactions that are made through e-commerce.
These are just a few places where we are able to gather up some of the data that we need and put it to use with data science.
And as the IoT grows and more data is created on a daily basis, it is likely that we are going to find even more sources that will help us to take on our biggest business problems.
And this leads us to need data science more than ever.
All of this data that we are gathering from the sources above and more will be known as big data.
Currently, most companies are going to be flooded and a bit overwhelmed by all of the data that is coming their way.
This is why it is so important for these companies to have a good idea of what to do with the exploding amount of data and how they are able to utilize it to get ahead.
It is not enough to just gather up the data.
This may seem like a great idea, but if you just gather up that data, and don’t learn what is inside of it, then you are leading yourself to trouble.
Once you can learn what information is inside of that data, and what it all means, you will find that it is much easier to use that information to give yourself the competitive advantage that you are looking for.
Data science is going to help us to get all of this done.
It is designed to make it easier for us to really take in the big picture and use data for our needs.
It will encompass all of the parts of the process of getting the data to work for us, from gathering the data to cleaning it up and organizing it, to analyzing it, to creating visuals to help us better understand the data, and even to the point of how we decide to use that data.
All of this comes together and helps us to really see what is inside of the data, and it is all a part of the data science process.
Data science is going to work because it is able to bring together a ton of different skills, like statistics, mathematics, and business domain knowledge, and can help out a company in many ways.
Some of the things that data science is able to do when it is used in the proper manner, for a company will include some of the following:
- Reduce costs
- Get the company into a new market
- Tap into a new demographic to increase their reach.
- Gauge the effectiveness of a marketing campaign.
- Launch a new service or a new product.
And this is just the start of the list.
If you are willing to work with data science and learn the different steps that come with it, you will find that it is able to help your business out in many different manners, and it can be one of the best options for you to use in order to get ahead in your industry.
How Is Data Science Used?
One of the best ways to learn more about data science and how it works is to take a look at how some of the top players in the industry are already using data science.
There are a ton of big-name companies who are already relying on data science to help them reach their customers better, keep waste and costs down, and so much more.
For example, some of the names that we are going to take a look at here include Google, Amazon, and Visa.
As you will see with all of these, one of the biggest deciding factors for an organization is what value they think is the most important to extract from their data using analytics, and how they would like to present that information as well.
Let’s take a look at how each of these companies has been able to use data science for their needs to see some results.
First on the list is Google.
This is one of the biggest companies right now that is on a hiring spree for trained data scientists.
Google has been driven by data science in a lot of the work that they do, and they also rely on machine learning and artificial intelligence in order to reach their customers and to ensure that they are providing some of the best products possible to customers as well.
Data science and some good analysis have been able to help them get all of this done effectively.
Next on the list is the company Amazon.
This is a huge company known around the world, one that many of us use on a daily basis.
It is a cloud computing and e-commerce site that relies heavily on data scientists to help them release new products, keep customer information safe, and even to do things like providing recommendations on what to purchase next on the site.
They will use the data scientist to help them found out more about the mindset of the customer and to enhance the geographical reach of their cloud domain and their e-commerce, just to name a few of their business goals right now.
And then, we need to take a look at the Visa company and what they are doing with the help of data science.
As an online financial gateway for countless other companies, Visa ends up completing transactions that are worth hundreds of millions in one day, much more than what other companies can even dream about.
Due to a large number of transactions that are going on, Visa needs data scientists to help them increase their revenue, check if there are any fraudulent transactions, and even to customize some products and services based on the requirements of the customer.
The Lifecycle of Data Science
We are going to go into more detail about the lifecycle of data science as we progress through this guidebook, but first, we can take a moment just to see how we are able to use this for our own needs.
Data science is going to follow our data from the gathering stage of the data, all the way through until we use that data to make our big business decisions.
There are a number of steps that are going to show up in the process in the meantime, and being prepared to handle all of these, and all that they entail, is the challenge that comes when we want to rely on data science.
Some of the basic steps that are found in the data science lifecycle are going to include:
- Figuring out what business question we would like to answer with this process.
- The process of collecting raw data for use.
- Cleaning and organizing all unstructured data to be used.
- Preprocessing our data.
- Creating a model with the help of machine learning and taking some time to train and test it to ensure accurate results along the way.
- Running our data through the model to help us understand what insights and predictions are inside.
- Use visuals to help us better understand the complex relationships that are found in any data that we are using for this analysis.
While the steps may sound easy enough to work with, there are going to be some complexities and a lot of back and forth that we have to work with here.
The most important thing here is to go into it without any preconceived notions of what you would like to see happening and don’t try to push your own agenda on the data.
This is the best way to ensure that you will actually learn what is inside of that data and can make it easier to choose the right decisions for your needs as well.
The Components of Data Science
Now, we also need to take some time to look at the basics of data science.
There are going to be a few key components that come into play when we are talking about data science, and having these in place is going to make a big difference in how well we are able to handle some of the different parts that come with data science, and how we can take on some of the different parts that we need with our own projects.
Some of the key components that we need to take a look at when it comes to data science will include:
- The various types of data: The foundation of any data science project is going to be the raw set of data.
There are a lot of different types.
We can work with the structured data that is mostly found in tabular form, and the unstructured data, which is going to include PDF files, emails, videos, and images.
- Programming: You will need some kind of programing language to get the work done, with Python and R being the best option.
Data management and data analysis are going to be done with some computer programming.
Python and R are the two most popular programming languages that we will focus on here.
- Statistics and probability. Data is going to be manipulated in different ways in order to extract some good information out of it.
The mathematical foundation of data science is going to be probability and statistics.
Without having a good knowledge of the probability and statistics, there is going to be a higher possibility of misinterpreting the data and reaching conclusions that are not correct.
This is a big reason why the probability and statistics that we are looking at here are going to be so important in data science.
- Machine learning: As someone who is working with data science, you are going to spend at least a little time learning the algorithms of machine learning on a daily basis.
This can include methods of classification and regression.
It is important for a data scientist to know machine learning to complete their job since this is the tool that is needed to help predict valuable insights from the data that is available.
- Big data: In our current world, raw data is going to be what we use to train and test our models, and then figure out the best insights and predictions out of that data.
Working with big data is able to help us to figure out what important, although hidden, information is found in our raw data.
There are a lot of different tools that we are able to use in order to help us not only find the big data but also to process some of these big data as well.
There are many companies that are learning the value of data science and all that is going to come with it.
They like the idea that they can take all of the data they have been collecting for a long period of time and put it to use to increase their business and give them that competitive edge they have been looking for.
In the rest of this guidebook, we are going to spend some time focusing on how to work with data science and all of the different parts that come with it as well.