Market Basket Analysis

Association rules are a popular technique for data mining. The association rule algorithm was developed initially by Rakesh Agrawal, Tomasz Imielinski, and Arun Swami at the IBM Almaden Research Center.^[61] It was originally designed as an efficient algorithm for finding interesting relationships in large databases of customer transactions. The algorithm finds sets of associations, items that are frequently associated with each other. For example, when analyzing supermarket data, you might find that consumers often purchase eggs and milk together. The algorithm was designed to run efficiently on large databases, especially databases that don’t fit into a computer’s memory.

R includes several algorithms implementing association rules. One of the most popular is the a priori algorithm. To try it in R, use the apriori function in the arules package:

library(arules)
apriori(data, parameter = NULL, appearance = NULL, control = NULL)

Here is a description of the arguments to apriori.

Argument	Description	Default
data	An object of class `transactions` (or a matrix or data frame that can be coerced into that form) in which associations are to be found.
parameter	An object of class `ASParameter` (or a list with named components) that is used to specify mining parameters. Parameters include support level, minimum rule length, maximum rule length, and types of rules (see the help file for `ASParameter` for more information).	`NULL`
appearance	An object of class `APappearance` (or a list with named components) that is used to specify restrictions on the associations found by the algorithm.	`NULL`
control	An object of class `APcontrol` (or a list with named components) that is used to control the performance of the algorithm.	`NULL`

The apriori implementation is well engineered and thought out: it makes ample use of S4 classes to define data types (including a transactions class for data and classes to control parameters), and prints useful information when it is run. However, it currently requires data sets to be loaded completely into memory.

As an example, we will look at a set of transactions from Audioscrobbler. Audioscrobbler was an online service that tracked the listening habits of users. The company is now part of Last.fm and still provides application programming interfaces (APIs) for looking at music preferences. However, in 2005, the company released a database of information on music preferences under a Creative Commons license. The database consists of a set of records showing how many times each user listened to different artists. For our purposes, we’ll ignore the count and just look at users and artists. For this example, I used a random sample of 20,000 user IDs from the database. Specifically, we will try to look for patterns in the artists that users listen to.

I loaded the data into R using the read.transactions function (in the arules package):

> library(arules)
> audioscrobbler <- read.transactions(
+   file="~/Documents/book/data/profiledata_06-May-2005/transactions.csv",
+   format="single",
+   sep=",",
+   cols=c(1,2))

You can find the data in the nutshell package:

> library(nutshell)
> data(audioscrobbler)

To find some results, I needed to change the default settings. I looked for associations at a 6.45% support level, which I specified through the parameter argument. (Why 6.45%? Because that returned exactly 10 rules on the test data, and 10 rules seemed like the right length for an example.)

> audioscrobbler.apriori <- apriori(
+   data=audioscrobbler,
+   parameter=new("APparameter",support=0.0645))

parameter specification:
 confidence minval smax arem  aval originalSupport support minlen
        0.8    0.1    1 none FALSE            TRUE  0.0645      1
 maxlen target   ext
      5  rules FALSE

algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[429033 item(s), 20001 transaction(s)] done [2.36s].
sorting and recoding items ... [287 item(s)] done [0.16s].
creating transaction tree ... done [0.03s].
checking subsets of size 1 2 3 4 done [0.25s].
writing ... [10 rule(s)] done [0.00s].
creating S4 object  ... done [0.17s].

As you can see, the apriori function includes some information on what it is doing while running. After it finishes, you can inspect the returned object to learn more. The returned object consists of association rules (and is an object of class arules). Like most modeling algorithms, the object has an informative summary function that tells you about the rules:

> summary(audioscrobbler.apriori)
set of 10 rules

rule length distribution (lhs + rhs):sizes
 3
10

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      3       3       3       3       3       3

summary of quality measures:
    support          confidence          lift
 Min.   :0.06475   Min.   :0.8008   Min.   :2.613
 1st Qu.:0.06536   1st Qu.:0.8027   1st Qu.:2.619
 Median :0.06642   Median :0.8076   Median :2.651
 Mean   :0.06640   Mean   :0.8128   Mean   :2.696
 3rd Qu.:0.06707   3rd Qu.:0.8178   3rd Qu.:2.761
 Max.   :0.06870   Max.   :0.8399   Max.   :2.888

mining info:
           data ntransactions support confidence
 audioscrobbler         20001  0.0645        0.8

You can view the returned rules with the inspect function:

> inspect(audioscrobbler.apriori)
   lhs                        rhs            support confidence     lift
1  {Jimmy Eat World,
    blink-182}             => {Green Day} 0.06524674  0.8085502 2.780095
2  {The Strokes,
    Coldplay}              => {Radiohead} 0.06619669  0.8019382 2.616996
3  {Interpol,
    Beck}                  => {Radiohead} 0.06474676  0.8180670 2.669629
4  {Interpol,
    Coldplay}              => {Radiohead} 0.06774661  0.8008274 2.613371
5  {The Beatles,
    Interpol}              => {Radiohead} 0.06719664  0.8047904 2.626303
6  {The Offspring,
    blink-182}             => {Green Day} 0.06664667  0.8399496 2.888058
7  {Foo Fighters,
    blink-182}             => {Green Day} 0.06669667  0.8169014 2.808810
8  {Pixies,
    Beck}                  => {Radiohead} 0.06569672  0.8066298 2.632306
9  {The Smashing Pumpkins,
    Beck}                  => {Radiohead} 0.06869657  0.8287093 2.704359
10 {The Smashing Pumpkins,
    Pink Floyd}            => {Radiohead} 0.06514674  0.8018462 2.616695

The left-hand side of the rules (lhs) forms the predicate of the rule; the right-hand side (rhs) forms the conclusion. For example, consider rule 1. This rule means, “If the user has listened to Jimmy Eat World and Blink-182, then for 6.524675% of the time, he or she also listened to Green Day.” You can draw your own conclusions about whether these results mean anything, other than that Audioscrobbler’s users were fans of alternative and classic rock.

The arules package also includes an implementation of the Eclat algorithm, which finds frequent item sets. To find item sets using the Eclat algorithm, try the function eclat:

eclat(data, parameter = NULL, control = NULL

The eclat function accepts similar arguments as apriori (some of the parameters within the arguments are slightly different). I tightened up the support level for the eclat function in order to keep the number of results low. If you keep the default parameters, then the algorithm will return item sets with only one item, which is not very interesting. So I set the minimum length to 2 and the support level to 12.9%. Here is an example of running eclat on the Audioscrobbler data:

> audioscrobbler.eclat <- eclat(
+    data=audioscrobbler,
+    parameter=new("ECparameter", support=0.129, minlen=2))

parameter specification:
 tidLists support minlen maxlen            target   ext
    FALSE   0.129      2      5 frequent itemsets FALSE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

eclat - find frequent item sets with the eclat algorithm
version 2.6 (2004.08.16)         (c) 2002-2004   Christian Borgelt
create itemset ...
set transactions ...[429033 item(s), 20001 transaction(s)] done [2.44s].
sorting and recoding items ... [74 item(s)] done [0.14s].
creating bit matrix ... [74 row(s), 20001 column(s)] done [0.01s].
writing  ... [10 set(s)] done [0.01s].
Creating S4 object  ... done [0.02s].

You can view information about the results with the summary function:

> summary(audioscrobbler.eclat)
set of 10 itemsets

most frequent items:
            Green Day             Radiohead Red Hot Chili Peppers
                    5                     5                     3
              Nirvana           The Beatles               (Other)
                    3                     2                     2

element (itemset/transaction) length distribution:sizes
 2
10

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
      2       2       2       2       2       2

summary of quality measures:
    support
 Min.   :0.1291
 1st Qu.:0.1303
 Median :0.1360
 Mean   :0.1382
 3rd Qu.:0.1394
 Max.   :0.1567

includes transaction ID lists: FALSE

mining info:
           data ntransactions support
 audioscrobbler         20001   0.129

You can also view the item sets with the inspect function:

> inspect(audioscrobbler.eclat)
   items                     support
1  {Red Hot Chili Peppers,
    Radiohead}             0.1290935
2  {Red Hot Chili Peppers,
    Green Day}             0.1397430
3  {Red Hot Chili Peppers,
    Nirvana}               0.1336433
4  {Nirvana,
    Radiohead}             0.1384931
5  {Green Day,
    Nirvana}               0.1382931
6  {Coldplay,
    Radiohead}             0.1538423
7  {Coldplay,
    Green Day}             0.1292435
8  {Green Day,
    Radiohead}             0.1335433
9  {The Beatles,
    Green Day}             0.1290935
10 {The Beatles,
    Radiohead}             0.1566922

As above, you can draw your own conclusions about whether the results are interesting.

^[61] You can read their paper here: http://rakesh.agrawal-family.com/papers/sigmod93assoc.pdf.