“Torture the data, and it will confess to anything.” – Ronald Coase, Economics, Nobel Prize Laureate
According to Drew Conway, a political science Ph.D. student at New York University and a former intelligence community member, skills for a data scientist should be broken into three categories:
1. Hacking skills
2. Statistical and math knowledge
3. Substantive expertise
Having hacking skills is important because data tends to be inside several locations and in different systems, which makes discovering them just a bit challenging.
Data hackers typically have a broad set of technical skills, but they likely won’t be a complete expert in any of the skills, such as:
That’s a pretty long list, so how can a person learn all of these things in a decent amount of time? They have to choose one comprehensive technology stack, and then complete everything in the stack.
One such technology stack would be the R-Hadoop stack. R is an open-source and free statistical programming language that was originally based upon the S programming language. There are a few reasons why a person may choose to start with R for data analysis:
Hadoop is an open-source and free distributed computing framework. Hadoop is typically used for every part of big data: modeling, databases, and analysis. A lot of the top companies use Hadoop, which includes LinkedIn, Facebook, and Twitter. Whenever you hear somebody talking about Hadoop, you will probably end up hearing about MapReduce, which is a framework that gives you the ability to solve large-scale data problems with clusters of commodity computers. The following are some reasons why Hadoop is perfect system for starting out with big data:
The R-Hadoop stack gives you the ability to do pretty much anything that you would need to form data hacking, such as:
You can use Hadoop and R on a Windows computer, but they work a lot better and more naturally on a Unix-based system. That system might end up being a bit of a learning curve, but what you get from using a Unix system is amazing, and it will look great on a resume.
Hadoop and R may be able to cover most cases, but there are some situations where you may be looking to use a different feature. For example, Python has a library that will make text mining a lot more scalable and easier than R does. And if you are interested in creating a web app, Shiny may not be flexible enough so you will want to go with a more traditional web framework. For the most part, you should be able to get by Hadoop and R. Now we will be going into more depth about Python because it is more commonly used for data science than R is.
You’re probably wondering why you should stick with learning only one technology stack. A lot of people think that they should use the right tool for the job, and they would be afraid that only learning one would get them stuck in the data science ecosystem. These are all very good points but focusing on a single stack comes with its advantages, especially if you’re just starting out. First off, if you are switching training paths and resources a lot, you will end up wasting a bunch of time. Secondly, it becomes motivating and useful to focus l on a single technology because when you get good at one thing, it is a faster way of solving problems.
In the references chapter you will see links to help pages from Hadoop, R, and several other technologies we will discuss in this book.