Association rules

Association rule mining is a technique that focuses upon observing frequently occurring patterns and associations from datasets found in databases such as relational and transactional databases. These rules do not say anything about the preferences of an individual; rather, they rely chiefly on the items within transactions to deduce a certain association. Every transaction is identified by a primary key (distinct ID) called, transaction ID. All these transactions are studied as a group and patterns are mined.

Association rules can be thought of as an if—then relationship. Just to elaborate on that, we have to come up with a rule: if an item A is being bought by the customer, then the chances of item B being picked by the customer too under the same transaction ID (along with item A) is found out. You needs to understand here that it's not a causality, rather, it is co-occurrence pattern that comes to the fore.

There are two elements of these rules:

Antecedent (if): This is an item/group of items that are typically found in the itemsets or datasets
Consequent (then): This comes along as an item with an antecedent/group of antecedents

Have a look at the following rule:

{Bread, milk} ⇒ {Butter}

The first part of this rule is called antecedent and the second part (after the arrow) is consequent. It is able to convey that there is a chance of Butter being picked in a transaction if Bread and Milk are picked earlier. However, the percentage chance for the consequent to be present in an itemset, given the antecedent, is not clear.

Let's look at a few metrics that will help us in getting there:

Support: This is a measure of the frequency of the itemset in all the transactions. For example, there are two itemsets popping up through the number of transactions for a retail outlet such as Walmart: itemset A = {Milk}, itemset B = {laptop}. Given that support is how frequent the itemset is in all the transactions, we are asked to find out which itemset has got the higher support. We know that itemset A will have higher support because Milk features in everyday grocery lists (and, in turn, the transaction) at a greater probability than laptop. Let's add another level of association and study with two new itemsets: itemset A= {milk, cornflakes}, itemset B= {milk, USB Drive}. The purchasing frequency of milk and cornflakes together will be higher than milk and USB Drive. It will make the support metric higher for A.

Let's translate this into mathematics:

Support(A, B) = Transactions comprising A and B/Total number of transactions

Here's an example:

- The total number of transactions is 10,000
- Transactions comprising A and B = 500
- Then support (A, B) = 500/10000= 0.05
- 5% of transactions contain A and B together

Confidence: This indicates how likely item 1 is to be purchased/picked when item 2 is already picked. In other words, it measures the likelihood of the occurrence of consequent transactions given that the antecedent is already there in the transaction. In other words, it is the probability of the occurrence of Butter in the transaction if Bread has already been part of that transaction. It is quite clear that it is a conditional probability of the occurrence of the consequent while having the antecedent:

- Confidence(A ⇒ B) = Transactions comprising A and B/Transactions comprising A
- Confidence can be transformed in terms of support
- Confidence(A ⇒ B) = Support(A, B)/Support(A)

Here's an example:

- Transactions with the itemset as milk = 50
- Transactions with the itemset as cereal = 30
- Transactions comprising milk and cereal = 10
- Total number of transactions = 100
- Confidence(milk ⇒ Cereal) = 10/(50 +10) = 0.167

It means that there is 16.7% probability of that event taking place.

A drawback of the confidence is it only accounts for how popular item 1 is, but not item 2. If item 2 is equally frequent, there will be a higher chance that a transaction containing item 1 will also contain item 2. Hence, it will result in an inflated outcome. To account for the frequency of both constituent items, we use a third measure called lift.

Lift: This is an indicator of how likely it is that item B will be picked in the cart/transaction, given that item A is already picked, while keeping a tab on the frequency of item B. A lift value greater than 1 says that there is a great association between item A and item B, which implies that there is a good chance that item B will be picked if item A is already in the cart. A lift value of less than 1 means that the chances are slim that item B will be picked if item A is already present. If the lift value hits zero, it means no association can be established here.

Lift(A⇒B) = (Transactions comprising A and B/(Transactions comprising A))/fraction of Transaction comprising B

Implies:

= Support(A, B)/(Support(A) * Support(B))

Lift(milk⇒cereal) = ( 10/(50+10))/0.4

= 0.416

We will see this in a better format here. The probability of having cereal in the cart with the knowledge that milk is already in the cart (which is called confidence) = 10/(50+10) = 0.167.

The probability of having cereal in the cart without the knowledge that milk is in the cart = (30+10)/100 = 0.4.

It means that having knowledge that milk is already in the cart reduces the chance of picking cereal from 0.4 to 0.167. It is a lift of 0.167/0.4= 0.416 and is less than 1. Hence, the chances of picking cereal while milk is already in the cart are very small.