Next steps

We have analyzed our dataset, performed necessary feature engineering and statistical tests, built visualizations, and gained substantial domain knowledge about credit risk analysis and what kind of features are considered by banks when they analyze customers. The reason why we analyzed each feature in the dataset in detail was to give you an idea about each feature that is considered by banks when analyzing credit rating for customers. This was to give you good domain knowledge understanding and also to help you get familiar with the techniques of performing an exploratory and descriptive analysis of any dataset in the future. So, what next? Now comes the really interesting part of using this dataset; building feature sets from this data and feeding them into predictive models to predict which customers can be potential credit risks and which of them are not. As mentioned previously, there are two steps to this: data and algorithms. In fact, we will go a step further and say that there are feature sets and algorithms which will help us in achieving our main objective.

Feature sets

A dataset is basically a file consisting of several records of observation where each tuple or record denotes one complete set of observations and the columns are specific attributes or features in that observation which talk about specific characteristics. In predictive analytics, usually there is one attribute or feature in the dataset whose class or category has to be predicted. This variable is credit.rating in our dataset, also known as the dependent variable. All the other features on which this depends are the independent variables. Taking a combination of these features forms a feature vector, which is also known popularly as a feature set. There are various ways of identifying what feature sets we should consider for predictive models, and you will see going ahead that for any dataset there is never a fixed feature set. It keeps changing based on feature engineering, the type of predictive model we are building, and the significance of the features based on statistical tests.

Each property in the feature set is termed as a feature or attribute and these are also known as independent or explanatory variables in statistics. Features can be of various types, as we saw in our dataset. We can have categorical features with several classes, binary features with two classes, ordinal features which are basically categorical features but have some order inherent in them (for example, low, medium, high), and numerical features which could be integer values or real values. Features are highly important in building predictive models and more often than not, the data scientists spend a lot of time in building the perfect feature sets to highly boost the accuracy of predictive models. This is why domain knowledge is highly essential, besides knowing about the machine learning algorithms.

Machine learning algorithms

Once we get the feature sets ready, we can start using predictive models to use them and start predicting the credit rating of customers based on their features. An important thing to remember is that this is an iterative process and we have to keep modifying our feature sets based on outputs and feedback obtained from our predictive models to further improve them. Several methods which are relevant in our scenario, which belong to the class of supervised machine learning algorithms, are explained briefly in this section.

Linear classification algorithms: These algorithms perform classification in terms of a linear function which assigns scores to each class by performing a dot product of the feature set and some weights associated with them. The predicted class is the one which has the highest score. The optimal weights for the feature set are determined in various ways and differ based on the chosen algorithms. Some examples of algorithms include logistic regression, support vector machines, and perceptrons.
Decision trees: Here we use decision trees as predictive models that map various observations from the data points to the observed class of the record we are to predict. A decision tree is just like a flowchart structure, where each internal nonleaf node denotes a check on a feature, each branch represents an outcome of that check, and each terminal leaf node contains a class label which we predict finally.
Ensemble learning methods: These include using multiple machine learning algorithms to obtain better predictive models. An example is the Random Forest classification algorithm which uses an ensemble of decision trees during the model training phase, and at each stage it takes the majority output decision from the ensemble of decision trees as its output. This tends to reduce overfitting, which occurs frequently when using decision trees.
Boosting algorithms: This is also an ensemble learning technique in the supervised learning family of algorithms. It consists of an iterative process of training several weak classification models and learning from them before adding them to a final classifier which is stronger than them. A weighted approach is followed when adding the classifiers, which is based on their accuracy. Future weak classifiers focus more on the records which were previously misclassified.
Neural networks: These algorithms are inspired by the biological neural networks which consist of systems of interconnected neurons that exchange messages with each other. In predictive modeling, we deal with artificial neural networks which consist of interconnected groups of nodes. Each node consists of a function which is usually a mathematical function (that is, sigmoid function) and has an adaptive weight associated with it which keeps changing based on inputs fed to the nodes and constantly checking the error obtained from the outputs in several iterations also known as epochs.

We will be covering several of these algorithms when building predictive models in the next chapter.