The following topics will be covered in this chapter:
Data analytics projects with R can sometimes get very complex, especially when we have to work on our analysis over a longer period of time. To keep track of our changes and our progress in the project, it is important to use a version control system that can support us on these tasks. The best known of these version control approaches is Git. It helps us annotate every change we make to our code. This is also very helpful when we collaborate with other people, or when other people have to read and understand our code later on, and also when they need to understand the steps of its development. Git describes itself as a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency. You can read about it at https://git-scm.com/.
Gits can be created on servers, or on our local machine. Their distributed nature lets you sync commits with other machines. An alternative to creating your own, is using a hosted version control system. To do this, we can create an account on platforms such as GitHub, Bitbucker, or GitLab. Most of them offer free accounts, and if we take the example of GitHub, everybody can create repositories with an unlimited number of collaborators and public projects. This means that everybody can see our code on the website and use it. If we want to have private repositories, we have to buy a plan.
First of all, we should create an account on GitHub at https://github.com/join. After that, we can install the Git client from the official website, http://git-scm.com/.
For Windows, there is an .exe
file available. So, you just have to download the installer file from git-scm.com/download/win
, and execute it.
Installation on Linux is also very easy, as there is a binary installer available. So, we can install it with apt-get
:
sudo apt-get install git
Installing Git on other Linux distributions
If you do not use a Debian-based Linux distribution, you can find an overview of how to install it for your distribution at http://git-scm.com/download/linux.
After the successful installation and creation of our GitHub user account, we can go on and configure our Git client.
Open a new shell/console window and type in git
. This will show you all the possible options. If you use windows, you can use the Git bash emulation, which behaves just like the git
command in Linux.
We now have to set our username and email address. We do this with the following:
git config --global user.name 'Your Username' git config --global user.email 'Your Email Adress'
The Git system comes with some terminology. We do not have to know everything, but we should take a look at the fundamental elements of this version control system.
We begin with Repository, as the most basic element. You can think of it as a folder where your project is saved. This folder will also save the history when we change something, along with what exactly was changed. It will also keep track of who changed something.
Commit is actually the process of adding a change to the repository. This change will normally have a unique identifier and message, where we explain why we changed something.
Diff stands for the difference in changes between two commits. It shows us what was added or deleted in the repository and in every single affected file.
A branch is a parallel version of a repository. It is located in our repository but does not affect the master branch. It is often used to experiment with new functionalities.
Merging is the process of taking changes from one branch and applying them to another, normally in the same repository.
Fetching refers to the process of getting the latest changes from a remote repository (such as https://github.com/) without merging them with your local repository.
Pull is the combination of fetching changes and merging them. This is connected to Pull Request, which are requests to merge a certain change into the repository. This is often used when several users are working on a repository.
Pushing is the process of sending your local changes of a repository to a remote repository, such as https://github.com/.
The traditional way to use Git is via the shell. Normally, we begin with creating a new local repository with the following:
git init
We can then create files in this repo and add them to the version control structure with the add
command:
git add lm.R
Here we replace lm.R
with the name of the file we added. To commit all files, you can use the following:
git add *
To create a new remote repository of our project on GitHub, we can use the remote function:
git remote add origin git@github.com:USERNAME/https://github.com/USERNAME/PROJECTNAME.git
This creates a remote repository and you can push your project to this remote repository with the following lines:
git push origin master
We will then be able to see the project on the GitHub website on your user account. Using Git with the shell can be confusing in the beginning. But RStudio offers a great UI to work with version control systems such as Git.