Chapter Four: Skills Required for Data Science
If you want to work in the field of data science, or become a data scientist, you need to have some information about different topics mentioned in this chapter. This chapter lists some technical and non-technical skills you should develop if you want to improve in this field.
Technical Skills
As a data scientist, you need to have some technical skills that help you with statistical analysis. You need to learn how to leverage and work with different frameworks and software to mine, collect, process, collate, analyze, interpret and visualize large volumes of data. You need to develop programming skills to perform such activities. An easy way to do this is to ensure you have the necessary academic background. Most data scientists have a Ph.D. or master's degree in engineering, statistics, and computer science. This is the only way to determine if they have the foundation to help them connect with different technical points that are the foundation of the practice. Many schools offer such programs to help people pursue data science.
If you do not want to go through hardcore courses, you can look at options, such as:
These programs help you develop a basic understanding of core data science subjects. These courses also provide some information that is outside of textbook learning. You will be given real-time scenarios and asked to develop models to assess and predict futuristic events. The following are some skills you need to develop.
Understanding Data
Data science is about working with and understanding different types of data. You need to understand and love working with data. The following are some questions you can answer to help you understand whether you love data or not:
The most important question you need to answer is whether you love working with data or not. If yes, you need to obtain certifications, so you develop the skills to become a data scientist.
Algorithms
Algorithms are sets of instructions that you write. You can use algorithms to instruct a machine to perform specific functions and tasks. Let us try to write an algorithm using that we will instruct a computer to add two numbers.
  1. Identify two variables and declare it to the machine
  2. Initialize the variables
  3. Ask the user to assign a value to each of these variables
  4. Declare another variable to hold the sum of the first two variables
  5. Calculate the sum of the first two variables and assign that value to the third variable
You can also use algorithms when you solve puzzles on paper. As a data scientist, you need to understand what algorithms are and how a machine understands them since you work with algorithms to help you analyze data. As a data scientist, you also need to learn how to design different algorithms that perform the necessary functions to help you analyze data. Let us assume you need to key in 10 numbers into the system. You may enter any ten numbers and leave it to the machine to identify the set's third largest number. To do this, you should write an algorithm to help the machine identify the number. As a data scientist, you need to write the necessary logic and develop an algorithm which helps you find the third largest number.
Programming
It is important to learn different languages, such as Java, C++, R, Python, Perl, SQL and other languages. The languages used commonly are R and Python. You can collect, clean, process, organize and analyze the information in the data set which helps you work with unstructured data.
Examples
Example 1
This takes care of only variables and constants – Program to build Hello World
> # We can use the print() function
> print("Hello World!")
[1] "Hello World!"
> # Quotes can be suppressed in the output
> print("Hello World!", quote = FALSE)
[1] Hello World!
> # If there is more than 1 item, we can concatenate using paste()
> print(paste("How","are","you?"))
[1] "How are you?"
In the program above, we have used print(), a built-in function to print the required string Hello World! The quotes that you see are printed by default. To avoid that, we can add an argument called quote = FALSE. Also, if there is more than a single item, you can use paste() or cat() functions to concatenate the strings together.
Example 2
We can add the elements of the vector by the function sum()
> sum(2,7,5)
[1] 14
> x
[1]  2 NA  3  1  4
> sum(x)    # if any element is NA or NaN, result is NA or NaN
[1] NA
> sum(x, na.rm=TRUE)    # this way we can ignore NA and NaN values
[1] 10
> mean(x, na.rm=TRUE)
[1] 2.5
> prod(x, na.rm=TRUE)
[1] 24
When a vector has NA (not applicable), or NaN (not a number), the functions that are used here such as sum(), mean(), prod(), etc make NA or NaN, respectively.
Example 3
This example will deal with an interactive screen, i.e. take inputs from the user.
my.name <- readline (prompt="Enter name: ")
my.age <- readline (prompt="Enter age: ")
# convert character into integer
my.age <- as.integer(my.age)
print (paste ("Hi,", my.name, "next year you will be", my.age+1, "years old."))
Output:
Enter name: Mary
Enter age: 17
[1] "Hi, Mary next year you will be 18 years old."
As you can see, we have used the function readline() to get input from the user.
Here, you can see that you can display an appropriate message for the user with the prompt argument.
In the above example, you convert the input age, which is a character vector into integer by the function as.integer().
This is necessary for doing further calculations
Example 4
In this example, we will find if a year is leap year or not by taking inputs from the user.
# Program to check if the input year is a leap year or not
year = as.integer(readline(prompt="Enter a year: "))
if((year %% 4) == 0) {
if((year %% 100) == 0) {
if((year %% 400) == 0) {
print (paste (year,"is a leap year"))
} else {
Print (paste (year,"is not a leap year"))
}
} else {
Print (paste (year,"is a leap year"))
}
} else {
Print (paste (year,"is not a leap year"))
}
Output 1:
Enter a year: 1900
[1] "1900 is not a leap year"
Output 2:
Enter a year: 2000
[1] "2000 is a leap year"
Here we have used the logic that a leap year is exactly divisible by 4 except for the years ending with 00. The century year is a leap year only if it is perfectly divisible by 400.
Nested if else is used to implement the logic in the above program.
Example 5
In this example, we will find the HCF of two numbers
# Program to find the H.C.F of two input number
# define a function
hcf <- function (x, y) {
# choose the smaller number
if(x > y) {
smaller = y
} else {
smaller = x
}
for(i in 1:smaller) {
if((x %% i == 0) && (y %% i == 0)) {
hcf = i
}
}
return(hcf)
}
# take input from the user
num1 = as.integer (readline(prompt = "Enter first number: "))
num2 = as.integer (readline(prompt = "Enter second number: "))
print (paste ("The H.C.F. of", num1,"and", num2,"is", hcf(num1, num2)))
Output:
Enter first number: 72
Enter second number: 120
[1] "The H.C.F. of 72 and 120 is 24"
This program asks the user to input two integers and the pass them to a function which give the output as the H.C.F.
The function first determines the smaller of the two numbers given as an input since the H.C.F can only be less than or equal to the smallest number among the two.
We then use a 'for' loop to go from 1 to that smaller number.
In each loop we check if our number perfectly divides both the input numbers.
If yes, then we store the number as H.C.F. At the completion of the loop we will end up with the largest number that perfectly divides both the numbers.
Example 6
In this example, we will show you how to develop a calculator of your own
# Program makes a simple calculator that can add, subtract, multiply and divide using functions
add <- function(x, y) {
return(x + y)
}
subtract <- function(x, y) {
return(x - y)
}
multiply <- function(x, y) {
return(x * y)
}
divide <- function(x, y) {
return(x / y)
}
# take input from the user
print ("Select operation.")
print ("1.Add")
print ("2.Subtract")
print ("3.Multiply")
print ("4.Divide")
choice = as.integer (readline (prompt="Enter choice[1/2/3/4]: "))
num1 = as.integer (readline (prompt="Enter first number: "))
num2 = as.integer (readline (prompt="Enter second number: "))
operator <- switch(choice,"+","-","*","/")
result <- switch (choice, add(num1, num2), subtract(num1, num2), multiply(num1, num2), divide(num1, num2))
print (paste (num1, operator, num2, "=", result))
Output:
[1] "Select operation."
[1] "1.Add"
[1] "2.Subtract"
[1] "3.Multiply"
[1] "4.Divide"
Enter choice[1/2/3/4]: 4
Enter first number: 20
Enter second number: 4
[1] "20 / 4 = 5"
In this code, we will first ask the user what operation he wants to carry out. Then, the user is asked to type in two numbers, and we use 'switch' branching is used to carry out a particular function.
The functions such as add(), subtract(), multiply() and divide() areal user-defined functions.
Analytical Tools
Every data scientist needs to learn how to use different analytical tools and understand the use of Software as a Service (SaaS). This understanding will help you obtain some information from the data you have cleaned and processed using different programming languages and tools, such as Hadoop, R, SAS, Spark, Hive and Pig. Different platforms allow you to improve your skills, and you can obtain certifications that will help you move ahead in your career.
Using Unstructured Data
Data scientists need to collect, process, store, manage, clean, understand and analyze the data collected from various sources. Most of these data are not structured, and they can be collected in different forms, such as images, videos, text, emails and other forms. For instance, if you work with a credit company, you need to learn how to use historical data and patterns to assist the company in identifying the customers they can trust. You also need to learn how to collect, process and analyze the data.
Non-Technical Skills
The following are some non-technical skills you need to develop if you want to be a data scientist. These are some personal skills a user needs to develop, and these are very different from qualifications and certifications.
Strong Business Acumen
As a data scientist, you need to have strong business acumen. You need to understand various elements important to the business model. Otherwise, you cannot use the necessary tools and skills to analyze the data set. It is only when you learn and develop these skills that you can identify the various problems and solutions to develop if you want to help the business make informed decisions so that it can sustain and grow. It becomes extremely hard for you to help an organization explore various opportunities if you do not know how to code.
Communication Skills
Data scientists can understand data better than anybody else in the organization. If you want to succeed as a data scientist, you need to learn to communicate your decisions and analysis to the stakeholders. This is the only way you can make it easier for the organization to make informed decisions. It is important to learn to communicate your findings to people in the organization, especially if they are non-technical. You need to develop the necessary communication skills to communicate your findings.
Intuition
This is an important non-technical skill you need to develop as a data scientist. It is true you may already have an insight about the data when you look at it, but you may not have the ability to perceive any hidden patterns in the data set. You need to know where to look and what to look at, so you add more value to the analysis.
These are the skills that make you efficient in your work. Do not worry about cramming too much information in a short time. You learn and improve as you gain experience.