The Importance of Hacking

“Torture the data, and it will confess to anything.” – Ronald Coase, Economics, Nobel Prize Laureate

$C:\Users\Roland\Desktop\hacking.jpg$

According to Drew Conway, a political science Ph.D. student at New York University and a former intelligence community member, skills for a data scientist should be broken into three categories:

1. Hacking skills

2. Statistical and math knowledge

3. Substantive expertise

Having hacking skills is important because data tends to be inside several locations and in different systems, which makes discovering them just a bit challenging.

Data hackers typically have a broad set of technical skills, but they likely won’t be a complete expert in any of the skills, such as:

Data Munging

Big data

Databases

Dashboarding/Reporting

Visualization

Machine learning

Statistical programming

That’s a pretty long list, so how can a person learn all of these things in a decent amount of time? They have to choose one comprehensive technology stack, and then complete everything in the stack.

One such technology stack would be the R-Hadoop stack. R is an open-source and free statistical programming language that was originally based upon the S programming language. There are a few reasons why a person may choose to start with R for data analysis:

R works great for getting your job done, especially when it comes to the tech industry.

R is a complete programming language, unlike SPSS or SAS, R isn’t only a procedural language.

R is quite easy to learn and is great for hacking. You won’t have to have a lot of experience with programming to be able to get started doing decent work with R.

It is comprehensive. Nearly any machine-learning or statistical task you are able to dream up has a pre-built library in R.

It’s free. SPSS and SAS tend to be expensive to get started in, and you will normally have to buy new methods if you are interested in trying them out.

Hadoop is an open-source and free distributed computing framework. Hadoop is typically used for every part of big data: modeling, databases, and analysis. A lot of the top companies use Hadoop, which includes LinkedIn, Facebook, and Twitter. Whenever you hear somebody talking about Hadoop, you will probably end up hearing about MapReduce, which is a framework that gives you the ability to solve large-scale data problems with clusters of commodity computers. The following are some reasons why Hadoop is perfect system for starting out with big data:

Hadoop is a perfect system to get your job done, and it seems as if it’s on every job ad for a data scientist.

Hadoop is comprehensive. Pretty much all big data processing and storage problems can be figured out using the Hadoop system.

It is quite easy to get started with, even if you don’t have a cluster of computers. Cloudera is a great service to check out with its online trail and a VM that you are able to download completely free.

Again, it is free.

The R-Hadoop stack gives you the ability to do pretty much anything that you would need to form data hacking, such as:

Data munging: This means to clean data and then rearrange it in a way that is more useful. Imagine something like parsing unusual date formats, turning columns into rows, getting rid of malformed values, and so on. Hadoop and R have applications to help with this. R is an amazing and easy way to process small to moderate sized sets of data. Hadoop gives you the ability to write out your own programs, and then rearrange and clean all of the large sets of data when you need to.

Big data: This is the main purpose of Hadoop. Hadoop gives you the ability to process and store essentially an unlimited amount of data on standard commercial hardware. There is no need for a supercomputer. Depending and the size of data, R has a pretty good selection of libraries that you can work directly with, such as data.table.

Databases: Hadoop has a scalable data warehouses system built on it called Hive that works for ad-hoc SQL-style querying of large sets of data. This was developed at Facebook. HBase, which is used by Twitter, and Cassandra, which is used by Netflix, are other types of database solutions that have been built on Hadoop.

Dashboarding and reporting: R has the knit package that will give you the ability to create dynamic and beautiful reports. This package is a web framework that is used for creating interactive and stylish web apps.

Visualization: Using the ggplot2 package, you can create completely customizable and professional looking 2D plots.

Machine learning: The caret package provides a wrapper for a lot of algorithms, and it makes it super easy to test, train, and tune machine learning models.

Statistical programming: R comes with a package that is used for data exploration, regression, statistical tests, and pretty much anything else that you could think of.

You can use Hadoop and R on a Windows computer, but they work a lot better and more naturally on a Unix-based system. That system might end up being a bit of a learning curve, but what you get from using a Unix system is amazing, and it will look great on a resume.

Hadoop and R may be able to cover most cases, but there are some situations where you may be looking to use a different feature. For example, Python has a library that will make text mining a lot more scalable and easier than R does. And if you are interested in creating a web app, Shiny may not be flexible enough so you will want to go with a more traditional web framework. For the most part, you should be able to get by Hadoop and R. Now we will be going into more depth about Python because it is more commonly used for data science than R is.

You’re probably wondering why you should stick with learning only one technology stack. A lot of people think that they should use the right tool for the job, and they would be afraid that only learning one would get them stuck in the data science ecosystem. These are all very good points but focusing on a single stack comes with its advantages, especially if you’re just starting out. First off, if you are switching training paths and resources a lot, you will end up wasting a bunch of time. Secondly, it becomes motivating and useful to focus l on a single technology because when you get good at one thing, it is a faster way of solving problems.

In the references chapter you will see links to help pages from Hadoop, R, and several other technologies we will discuss in this book.