We have analyzed our dataset, performed necessary feature engineering and statistical tests, built visualizations, and gained substantial domain knowledge about credit risk analysis and what kind of features are considered by banks when they analyze customers. The reason why we analyzed each feature in the dataset in detail was to give you an idea about each feature that is considered by banks when analyzing credit rating for customers. This was to give you good domain knowledge understanding and also to help you get familiar with the techniques of performing an exploratory and descriptive analysis of any dataset in the future. So, what next? Now comes the really interesting part of using this dataset; building feature sets from this data and feeding them into predictive models to predict which customers can be potential credit risks and which of them are not. As mentioned previously, there are two steps to this: data and algorithms. In fact, we will go a step further and say that there are feature sets and algorithms which will help us in achieving our main objective.

A dataset is basically a file consisting of several records of observation where each tuple or record denotes one complete set of observations and the columns are specific attributes or features in that observation which talk about specific characteristics. In predictive analytics, usually there is one attribute or feature in the dataset whose class or category has to be predicted. This variable is credit.rating in our dataset, also known as the dependent variable. All the other features on which this depends are the independent variables. Taking a combination of these features forms a feature vector, which is also known popularly as a feature set. There are various ways of identifying what feature sets we should consider for predictive models, and you will see going ahead that for any dataset there is never a fixed feature set. It keeps changing based on feature engineering, the type of predictive model we are building, and the significance of the features based on statistical tests.

Each property in the feature set is termed as a feature or attribute and these are also known as independent or explanatory variables in statistics. Features can be of various types, as we saw in our dataset. We can have categorical features with several classes, binary features with two classes, ordinal features which are basically categorical features but have some order inherent in them (for example, low, medium, high), and numerical features which could be integer values or real values. Features are highly important in building predictive models and more often than not, the data scientists spend a lot of time in building the perfect feature sets to highly boost the accuracy of predictive models. This is why domain knowledge is highly essential, besides knowing about the machine learning algorithms.

Once we get the feature sets ready, we can start using predictive models to use them and start predicting the credit rating of customers based on their features. An important thing to remember is that this is an iterative process and we have to keep modifying our feature sets based on outputs and feedback obtained from our predictive models to further improve them. Several methods which are relevant in our scenario, which belong to the class of supervised machine learning algorithms, are explained briefly in this section.

We will be covering several of these algorithms when building predictive models in the next chapter.