Association rules are a popular technique for data mining. The association rule algorithm was developed initially by Rakesh Agrawal, Tomasz Imielinski, and Arun Swami at the IBM Almaden Research Center.[61] It was originally designed as an efficient algorithm for finding interesting relationships in large databases of customer transactions. The algorithm finds sets of associations, items that are frequently associated with each other. For example, when analyzing supermarket data, you might find that consumers often purchase eggs and milk together. The algorithm was designed to run efficiently on large databases, especially databases that don’t fit into a computer’s memory.
R includes several algorithms implementing association rules. One of
the most popular is the a priori algorithm. To try it in R, use the
apriori
function in the arules
package:
library(arules) apriori(data, parameter = NULL, appearance = NULL, control = NULL)
Here is a description of the arguments to apriori
.
Argument | Description | Default |
---|---|---|
data | An object of class transactions (or a matrix or data frame
that can be coerced into that form) in which associations are to
be found. | |
parameter | An object of class ASParameter (or a list with named
components) that is used to specify mining parameters. Parameters
include support level, minimum rule length, maximum rule length,
and types of rules (see the help file for ASParameter for more
information). | NULL |
appearance | An object of class APappearance (or a list with named
components) that is used to specify restrictions on the
associations found by the algorithm. | NULL |
control | An object of class APcontrol (or a list with named
components) that is used to control the performance of the
algorithm. | NULL |
The apriori
implementation is
well engineered and thought out: it makes ample use of S4 classes to
define data types (including a transactions
class for data and classes to
control parameters), and prints useful information when it is run.
However, it currently requires data sets to be loaded completely into
memory.
As an example, we will look at a set of transactions from Audioscrobbler. Audioscrobbler was an online service that tracked the listening habits of users. The company is now part of Last.fm and still provides application programming interfaces (APIs) for looking at music preferences. However, in 2005, the company released a database of information on music preferences under a Creative Commons license. The database consists of a set of records showing how many times each user listened to different artists. For our purposes, we’ll ignore the count and just look at users and artists. For this example, I used a random sample of 20,000 user IDs from the database. Specifically, we will try to look for patterns in the artists that users listen to.
I loaded the data into R using the read.transactions
function (in the arules
package):
> library(arules) > audioscrobbler <- read.transactions( + file="~/Documents/book/data/profiledata_06-May-2005/transactions.csv", + format="single", + sep=",", + cols=c(1,2))
You can find the data in the nutshell
package:
> library(nutshell) > data(audioscrobbler)
To find some results, I needed to change the default settings. I
looked for associations at a 6.45% support level, which I specified
through the parameter
argument. (Why
6.45%? Because that returned exactly 10 rules on the test data, and 10
rules seemed like the right length for an example.)
> audioscrobbler.apriori <- apriori( + data=audioscrobbler, + parameter=new("APparameter",support=0.0645)) parameter specification: confidence minval smax arem aval originalSupport support minlen 0.8 0.1 1 none FALSE TRUE 0.0645 1 maxlen target ext 5 rules FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[429033 item(s), 20001 transaction(s)] done [2.36s]. sorting and recoding items ... [287 item(s)] done [0.16s]. creating transaction tree ... done [0.03s]. checking subsets of size 1 2 3 4 done [0.25s]. writing ... [10 rule(s)] done [0.00s]. creating S4 object ... done [0.17s].
As you can see, the apriori
function includes some information on what it is doing while running.
After it finishes, you can inspect the returned object to learn more. The
returned object consists of association rules (and is an object of class
arules
). Like most modeling algorithms,
the object has an informative summary
function that tells you about the rules:
> summary(audioscrobbler.apriori)
set of 10 rules
rule length distribution (lhs + rhs):sizes
3
10
Min. 1st Qu. Median Mean 3rd Qu. Max.
3 3 3 3 3 3
summary of quality measures:
support confidence lift
Min. :0.06475 Min. :0.8008 Min. :2.613
1st Qu.:0.06536 1st Qu.:0.8027 1st Qu.:2.619
Median :0.06642 Median :0.8076 Median :2.651
Mean :0.06640 Mean :0.8128 Mean :2.696
3rd Qu.:0.06707 3rd Qu.:0.8178 3rd Qu.:2.761
Max. :0.06870 Max. :0.8399 Max. :2.888
mining info:
data ntransactions support confidence
audioscrobbler 20001 0.0645 0.8
You can view the returned rules with the inspect
function:
> inspect(audioscrobbler.apriori)
lhs rhs support confidence lift
1 {Jimmy Eat World,
blink-182} => {Green Day} 0.06524674 0.8085502 2.780095
2 {The Strokes,
Coldplay} => {Radiohead} 0.06619669 0.8019382 2.616996
3 {Interpol,
Beck} => {Radiohead} 0.06474676 0.8180670 2.669629
4 {Interpol,
Coldplay} => {Radiohead} 0.06774661 0.8008274 2.613371
5 {The Beatles,
Interpol} => {Radiohead} 0.06719664 0.8047904 2.626303
6 {The Offspring,
blink-182} => {Green Day} 0.06664667 0.8399496 2.888058
7 {Foo Fighters,
blink-182} => {Green Day} 0.06669667 0.8169014 2.808810
8 {Pixies,
Beck} => {Radiohead} 0.06569672 0.8066298 2.632306
9 {The Smashing Pumpkins,
Beck} => {Radiohead} 0.06869657 0.8287093 2.704359
10 {The Smashing Pumpkins,
Pink Floyd} => {Radiohead} 0.06514674 0.8018462 2.616695
The left-hand side of the rules (lhs) forms the predicate of the rule; the right-hand side (rhs) forms the conclusion. For example, consider rule 1. This rule means, “If the user has listened to Jimmy Eat World and Blink-182, then for 6.524675% of the time, he or she also listened to Green Day.” You can draw your own conclusions about whether these results mean anything, other than that Audioscrobbler’s users were fans of alternative and classic rock.
The arules
package also includes
an implementation of the Eclat algorithm, which finds frequent item sets.
To find item sets using the Eclat algorithm, try the function eclat
:
eclat(data, parameter = NULL, control = NULL
The eclat
function accepts
similar arguments as apriori
(some of
the parameters within the arguments are slightly different). I tightened
up the support level for the eclat
function in order to keep the number of results low. If you keep the
default parameters, then the algorithm will return item sets with only one
item, which is not very interesting. So I set the minimum length to 2 and
the support level to 12.9%. Here is an example of running eclat
on the Audioscrobbler data:
> audioscrobbler.eclat <- eclat( + data=audioscrobbler, + parameter=new("ECparameter", support=0.129, minlen=2)) parameter specification: tidLists support minlen maxlen target ext FALSE 0.129 2 5 frequent itemsets FALSE algorithmic control: sparse sort verbose 7 -2 TRUE eclat - find frequent item sets with the eclat algorithm version 2.6 (2004.08.16) (c) 2002-2004 Christian Borgelt create itemset ... set transactions ...[429033 item(s), 20001 transaction(s)] done [2.44s]. sorting and recoding items ... [74 item(s)] done [0.14s]. creating bit matrix ... [74 row(s), 20001 column(s)] done [0.01s]. writing ... [10 set(s)] done [0.01s]. Creating S4 object ... done [0.02s].
You can view information about the results with the summary
function:
> summary(audioscrobbler.eclat)
set of 10 itemsets
most frequent items:
Green Day Radiohead Red Hot Chili Peppers
5 5 3
Nirvana The Beatles (Other)
3 2 2
element (itemset/transaction) length distribution:sizes
2
10
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 2 2 2 2 2
summary of quality measures:
support
Min. :0.1291
1st Qu.:0.1303
Median :0.1360
Mean :0.1382
3rd Qu.:0.1394
Max. :0.1567
includes transaction ID lists: FALSE
mining info:
data ntransactions support
audioscrobbler 20001 0.129
You can also view the item sets with the inspect
function:
> inspect(audioscrobbler.eclat)
items support
1 {Red Hot Chili Peppers,
Radiohead} 0.1290935
2 {Red Hot Chili Peppers,
Green Day} 0.1397430
3 {Red Hot Chili Peppers,
Nirvana} 0.1336433
4 {Nirvana,
Radiohead} 0.1384931
5 {Green Day,
Nirvana} 0.1382931
6 {Coldplay,
Radiohead} 0.1538423
7 {Coldplay,
Green Day} 0.1292435
8 {Green Day,
Radiohead} 0.1335433
9 {The Beatles,
Green Day} 0.1290935
10 {The Beatles,
Radiohead} 0.1566922
As above, you can draw your own conclusions about whether the results are interesting.