Chapter 6. Credit Risk Detection and Prediction – Predictive Analytics

In the previous chapter, we covered a lot of ground in the financial domain where we took up the challenge of detecting and predicting bank customers who could be potential credit risks. We now have a good idea about our main objective regarding credit risk analysis. Besides this, the substantial knowledge gained from descriptive analytics of the dataset and its features will be useful for predictive analytics, as we had mentioned earlier.

In this chapter, we will be journeying through the world of predictive analytics, which sits at the core of machine learning and data science. Predictive analytics encompasses several things which include classification algorithms, regression algorithms, domain knowledge, and business logic which are combined to build predictive models and derive useful insights from data. We had discussed various machine learning algorithms at the end of the previous chapter which would be applicable for solving our objective, and we will be exploring several of them in this chapter when we build predictive models using the given dataset and these algorithms.

An interesting take on predictive analytics is that it holds a lot of promise for organizations who want to strengthen their business and profits in the future. With the advent of big data, most organizations now have more data than they can analyze! While this is a big challenge, a tougher challenge is to select the right data points from this data and build predictive models which would be capable of predicting outcomes correctly in the future. However, there are several caveats in this approach because each model is basically mathematical functions based on formulae, assumptions, and probability. Also, in the real world, conditions and scenarios keep changing and evolving and thus one must remember that a predictive model built today may be completely redundant tomorrow.

A lot of skeptics say that it is extremely difficult for computers to mimic humans to predict outcomes which even humans can't predict due of the ever changing nature of the environment with time, and hence all statistical methods are only valuable under ideal assumptions and conditions. While this is true to some extent, with the right data, a proper mindset, and by applying the right algorithms and techniques, we can build robust predictive models which can definitely try and tackle problems which would be otherwise impossible to tackle by conventional or brute-force methods.

Predictive modeling is a difficult task and while there might be a lot of challenges and results might be difficult to obtain always, one must take these challenges with a pinch of salt and remember the quotation from the famous statistician George E.P. Box, who claimed that Essentially all models are wrong but some are useful!, which is quite true based on what we discussed earlier. Always remember that a predictive model will never be 100% perfect but, if it is built with the right principles, it will be very useful!

In this chapter, we will focus on the following topics:

Predictive analytics
How to predict credit risk
Important concepts in predictive modeling
Getting the data
Data preprocessing
Feature selection
Modeling using logistic regression
Modeling using support vector machines
Modeling using decision trees
Modeling using random forests
Modeling using neural networks
Model comparison and selection

Predictive analytics

We had already discussed a fair bit about predictive analytics in the previous chapter to give you a general overview of what it means. We will be discussing it in more detail in this section. Predictive analytics can be defined as a subset of the machine learning universe, which encompasses a wide variety of supervised learning algorithms based on data science, statistics, and mathematical formulae which enable us to build predictive models using these algorithms and data which has already been collected. These models enable us to make predictions of what might happen in the future based on past observations. Combining this with domain knowledge, expertise, and business logic enables analysts to make data driven decisions using these predictions, which is the ultimate outcome of predictive analytics.

The data we are talking about here is data which has already been observed in the past and has been collected over a period of time for analysis. This data is often known as historical data or training data which is fed to the model. However, most of the time in the predictive modeling methodology, we do not feed the raw data directly but use features extracted from the data after suitable transformations. The data features along with a supervised learning algorithm form a predictive model. The data which is obtained in the present can then be fed to this model to predict outcomes which are under observation and also to test the performance of the model with regards to various accuracy metrics. This data is known as testing data in the machine learning world.

The analytics pipeline that we will be following for carrying out predictive analytics in this chapter is a standard process, which is explained briefly in the following steps:

Getting the data: Here we get the dataset on which we will be building the predictive model. We will perform some basic descriptive analysis of the dataset, which we have already covered in the previous chapter. Once we have the data we will move on to the next step.
Data preprocessing: In this step, we carry out data transformations, such as changing data types, feature scaling, and normalization, if necessary, to prepare the data for being trained by models. Usually this step is carried out after the dataset preparation step. However, in this case, the end results are the same, so we can perform these steps in any order.
Dataset preparation: In this step, we use some ratio like 70:30 or 60:40 to separate the instances from the data into training and testing datasets. We usually use the training dataset to train a model and then check its performance and predicting capability with the testing dataset. Often data is divided in proportions of 60:20:20 where we also have a validation dataset besides the other two datasets. However, we will just keep it to two datasets in this chapter.
Feature selection: This process is an iterative one which even occurs in a later stage if needed. The main objective in this step is to choose a set of attributes or features from the training dataset that enables the predictive model to give the best predictions possible, minimizing error rates and maximizing accuracy.
Predictive modeling: This is the main step where we select a machine learning algorithm best suited for solving the problem and build the predictive model using the algorithm by feeding it the features extracted from the data in the training dataset. The output of this stage is a predictive model which can be used for predictions on future data instances.
Model evaluation: In this phase, we use the testing dataset to get predictions from the predictive model and use a variety of techniques and metrics to measure the performance of the model.
Model tuning: We fine tune the various parameters of the model and perform feature selection again if necessary. We then rebuild the model and re-evaluate it until we are satisfied with the results.
Model deployment: Once the predictive model gives a satisfactory performance, we can deploy this model by using a web service in any application to provide predictions in real time or near real time. This step focuses more on software and application development around deploying the model, so we won't be covering this step since it is out of scope. However, there are a lot of tutorials out there regarding building web services around predictive models to enable Prediction as a service.

The last three steps are iterative and may be performed several times if needed.

Even though the preceding process might look pretty intensive at first glance, it is really a very simple and straight-forward process, which once understood would be useful in building any type of predictive modeling. An important thing to remember is that predictive modeling is an iterative process where we might need to analyze the data and build the model several times by getting feedback from the model predictions and evaluating them. It is therefore extremely important that you do not get discouraged even if your model doesn't perform well on the first go because a model can never be perfect, as we mentioned before, and building a good predictive model is an art as well as science!

In the next section, we will be focusing on how we would apply predictive analytics to solve our prediction problem and the kind of machine learning algorithms we will be exploring in this chapter.