Use case – collaborative filtering

This use-case is about collaborative filtering. We are going to build a recommendation system based on embeddings created from a deep learning model. To do this, we are going to use the same dataset we used in Chapter 4, Training Deep Prediction Models, which is the retail transactional database. If you have not already downloaded the database, then go to the following link, https://www.dunnhumby.com/sourcefiles, and select Let’s Get Sort-of-Real. Select the option for the smallest dataset, titled All transactions for a randomly selected sample of 5,000 customers. Once you have read the terms and conditions and downloaded the dataset to your computer, unzip it into a directory called dunnhumby/in under the code folder. Ensure that the files are unzipped directly under this folder, and not a subdirectory, as you may have to copy them after unzipping the data.

The data contains details of retail transactions linked by basket IDs. Each transaction has a date and a store code, and some are also linked to customers. Here are the fields that we will use in this analysis:

Field-name	Description	Format
`CUST_CODE`	Customer Code. This links the transactions/visits to a customer.	Char
`SPEND`	Spend associated to the items bought.	Numeric
`PROD_CODE`	Product Code.	Char
`PROD_CODE_10`	Product Hierarchy Level 10 Code.	Char
`PROD_CODE_20`	Product Hierarchy Level 20 Code.	Char
`PROD_CODE_30`	Product Hierarchy Level 30 Code.	Char
`PROD_CODE_40`	Product Hierarchy Level 40 Code.	Char

If you want more details on the structure of the files, you can go back and re-read the use case in Chapter 4, Training Deep Prediction Models. We are going to use this dataset to create a recommendation engine. There are a family of machine learning algorithms called Market Basket Analysis that can be used with transactional data, but this use case is based on collaborative filtering. Collaborative filtering are recommendations based on the ratings people give to products. They are commonly used for music and film recommendations, where people rate the items, usually on a scale of 1-5. Perhaps the best known recommendation system is Netflix because of the Netflix prize (https://en.wikipedia.org/wiki/Netflix_Prize).

We are going to use our dataset to create implicit rankings of how much a customer rates an item. If you are not familiar with implicit rankings, then they are rankings that are derived from data rather than explicitly assigned by the user. We will use one of the product codes, PROD_CODE_40, and calculate the quantiles of the spend for that product code. The quantiles will divide the fields into 5 roughly equally sized groups. We will use these to assign a rating to each customer for that product based on how much they spent on that product code. The top 20% of customers will get a rating of 5, the next 20% will get a rating of 4, and so on. Each customer/product code combination that exists will have a rating from 1-5:

There is a rich history of using quantiles in retail loyalty systems. One of the earliest segmentation approaches for retail loyalty data was called RFM analysis. RFM is an acronym for Recency, Frequency, and Monetary spend. It gives each customer a ranking 1 (lowest) – 5 (highest) on each of these categories, with an equal number of customers in each ranking. For Recency, the 20% of the customers that visited most recently would be given a 5, the next 20% would be given a 4, and so on. For Frequency, the top 20% of customers with the most transactions would be given a 5, the next 20% would be given a 4, and so on. Similarly for Monetary spend, the top 20% of the customers by revenue would be given a 5, the next 20% would be given a 4, and so on. The numbers would then be concatenated, so a customer with an RFM of 453 would be 4 for Recency, 5 for Frequency, and 3 for Monetary spend. Once the score has been calculated, it can be used for many purposes, for example, cross-sell, churn analysis, and so on. RFM analysis was very popular in the late 1990's / early 2000's with many marketing managers because it is easily implemented and well-understood. However, it is not flexible and is being replaced with machine learning techniques.