This use-case is about collaborative filtering. We are going to build a recommendation system based on embeddings created from a deep learning model. To do this, we are going to use the same dataset we used in Chapter 4, Training Deep Prediction Models, which is the retail transactional database. If you have not already downloaded the database, then go to the following link, https://www.dunnhumby.com/sourcefiles, and select Let’s Get Sort-of-Real. Select the option for the smallest dataset, titled All transactions for a randomly selected sample of 5,000 customers. Once you have read the terms and conditions and downloaded the dataset to your computer, unzip it into a directory called dunnhumby/in under the code folder. Ensure that the files are unzipped directly under this folder, and not a subdirectory, as you may have to copy them after unzipping the data.
The data contains details of retail transactions linked by basket IDs. Each transaction has a date and a store code, and some are also linked to customers. Here are the fields that we will use in this analysis:
Field-name |
Description |
Format |
CUST_CODE |
Customer Code. This links the transactions/visits to a customer. |
Char |
SPEND |
Spend associated to the items bought. |
Numeric |
PROD_CODE |
Product Code. |
Char |
PROD_CODE_10 |
Product Hierarchy Level 10 Code. |
Char |
PROD_CODE_20 |
Product Hierarchy Level 20 Code. |
Char |
PROD_CODE_30 |
Product Hierarchy Level 30 Code. |
Char |
PROD_CODE_40 |
Product Hierarchy Level 40 Code. |
Char |
If you want more details on the structure of the files, you can go back and re-read the use case in Chapter 4, Training Deep Prediction Models. We are going to use this dataset to create a recommendation engine. There are a family of machine learning algorithms called Market Basket Analysis that can be used with transactional data, but this use case is based on collaborative filtering. Collaborative filtering are recommendations based on the ratings people give to products. They are commonly used for music and film recommendations, where people rate the items, usually on a scale of 1-5. Perhaps the best known recommendation system is Netflix because of the Netflix prize (https://en.wikipedia.org/wiki/Netflix_Prize).
We are going to use our dataset to create implicit rankings of how much a customer rates an item. If you are not familiar with implicit rankings, then they are rankings that are derived from data rather than explicitly assigned by the user. We will use one of the product codes, PROD_CODE_40, and calculate the quantiles of the spend for that product code. The quantiles will divide the fields into 5 roughly equally sized groups. We will use these to assign a rating to each customer for that product based on how much they spent on that product code. The top 20% of customers will get a rating of 5, the next 20% will get a rating of 4, and so on. Each customer/product code combination that exists will have a rating from 1-5: