In the previous chapter, we covered a lot of ground in the financial domain where we took up the challenge of detecting and predicting bank customers who could be potential credit risks. We now have a good idea about our main objective regarding credit risk analysis. Besides this, the substantial knowledge gained from descriptive analytics of the dataset and its features will be useful for predictive analytics, as we had mentioned earlier.
In this chapter, we will be journeying through the world of predictive analytics, which sits at the core of machine learning and data science. Predictive analytics encompasses several things which include classification algorithms, regression algorithms, domain knowledge, and business logic which are combined to build predictive models and derive useful insights from data. We had discussed various machine learning algorithms at the end of the previous chapter which would be applicable for solving our objective, and we will be exploring several of them in this chapter when we build predictive models using the given dataset and these algorithms.
An interesting take on predictive analytics is that it holds a lot of promise for organizations who want to strengthen their business and profits in the future. With the advent of big data, most organizations now have more data than they can analyze! While this is a big challenge, a tougher challenge is to select the right data points from this data and build predictive models which would be capable of predicting outcomes correctly in the future. However, there are several caveats in this approach because each model is basically mathematical functions based on formulae, assumptions, and probability. Also, in the real world, conditions and scenarios keep changing and evolving and thus one must remember that a predictive model built today may be completely redundant tomorrow.
A lot of skeptics say that it is extremely difficult for computers to mimic humans to predict outcomes which even humans can't predict due of the ever changing nature of the environment with time, and hence all statistical methods are only valuable under ideal assumptions and conditions. While this is true to some extent, with the right data, a proper mindset, and by applying the right algorithms and techniques, we can build robust predictive models which can definitely try and tackle problems which would be otherwise impossible to tackle by conventional or brute-force methods.
Predictive modeling is a difficult task and while there might be a lot of challenges and results might be difficult to obtain always, one must take these challenges with a pinch of salt and remember the quotation from the famous statistician George E.P. Box, who claimed that Essentially all models are wrong but some are useful!, which is quite true based on what we discussed earlier. Always remember that a predictive model will never be 100% perfect but, if it is built with the right principles, it will be very useful!
In this chapter, we will focus on the following topics:
We had already discussed a fair bit about predictive analytics in the previous chapter to give you a general overview of what it means. We will be discussing it in more detail in this section. Predictive analytics can be defined as a subset of the machine learning universe, which encompasses a wide variety of supervised learning algorithms based on data science, statistics, and mathematical formulae which enable us to build predictive models using these algorithms and data which has already been collected. These models enable us to make predictions of what might happen in the future based on past observations. Combining this with domain knowledge, expertise, and business logic enables analysts to make data driven decisions using these predictions, which is the ultimate outcome of predictive analytics.
The data we are talking about here is data which has already been observed in the past and has been collected over a period of time for analysis. This data is often known as historical data or training data which is fed to the model. However, most of the time in the predictive modeling methodology, we do not feed the raw data directly but use features extracted from the data after suitable transformations. The data features along with a supervised learning algorithm form a predictive model. The data which is obtained in the present can then be fed to this model to predict outcomes which are under observation and also to test the performance of the model with regards to various accuracy metrics. This data is known as testing data in the machine learning world.
The analytics pipeline that we will be following for carrying out predictive analytics in this chapter is a standard process, which is explained briefly in the following steps:
The last three steps are iterative and may be performed several times if needed.
Even though the preceding process might look pretty intensive at first glance, it is really a very simple and straight-forward process, which once understood would be useful in building any type of predictive modeling. An important thing to remember is that predictive modeling is an iterative process where we might need to analyze the data and build the model several times by getting feedback from the model predictions and evaluating them. It is therefore extremely important that you do not get discouraged even if your model doesn't perform well on the first go because a model can never be perfect, as we mentioned before, and building a good predictive model is an art as well as science!
In the next section, we will be focusing on how we would apply predictive analytics to solve our prediction problem and the kind of machine learning algorithms we will be exploring in this chapter.