1. Introduction.

Arules, open source package available from The Comprehensive R Archive Network, is a powerful tool-set for mining associative rules in transactional databases. The most common use of arules package
is market basket analysis in marketing and retail; though, there were successful attempts applying arules to medical problems, crime prevention, and book recommendations.

In the broadest sense:

  • associative rules are defined as “IF … THEN … ” rules (e.g. “IF” I bought an airplane ticket, I might “THEN” book a room at a hotel).
  • transactional databases are unevenly sized sequences of items or events occurring together (e.g. a list of purchases from a grocery or a checklist of patient conditions prior to hospital admission).

An example of transactional database could be:

## [1] "citrus fruit,semi-finished bread,margarine,ready soups"             
## [2] "tropical fruit,yogurt,coffee"                                       
## [3] "whole milk"                                                         
## [4] "pip fruit,yogurt,cream cheese ,meat spreads"                        
## [5] "other vegetables,whole milk,condensed milk,long life bakery product"

Arules can help in identifying:

  • recurring purchasing patterns
  • complement and substitute products
  • trigger products, buying of which may lead to buying another product with high degree of certainty.

As such, arules might be helpful in:

  • planning inventory
  • designing “combo offers”
  • planning discount programs (discount one product, markup complements)
  • planning shelf and catalog layout
  • recommending products and crosselling (especially higher margin products).

This paper is not intended as a tutorial for arules (for this please check the links to useful resources at the end). The purpose of this paper is twofold:

  • show how arules can help in answering practical questions faced by retail analysts
  • show some helpful tips and tricks that improve quality and productivity of analysis.

2. Summary of transactional db and basic visualizations.

In this exercise we are using 1 month worth of transactions, which is distributed together with arules package under the name Groceries

library(arules)
library(arulesViz)
data("Groceries")

In case we did not have data readily available, we could have downloaded transactions like this:

transaction.data <- read.transactions('/path/to/file', sep=',')

(note, usual read.csv() won't do, as it expects equal number of data points per row.)

As soon as we have loaded Groceries into R's working environment,
we can have a look at the salient statistics:

summary(Groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda           yogurt          (Other) 
##             2513             1903             1809             1715             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46   29   14   14    9   11    4    6    1 
##   26   27   28   29   32 
##    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.00    4.41    6.00   32.00 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

In humane language the above means:

  • there were 9'835 transactions altogether
  • 169 different items were bought during the month
  • the most frequently bought item was “whole milk”: 2'513 purchases
  • there were 2'159 single item baskets, the biggest basket included 32 items
  • median basket included 3 items; mean had 4.4 items.

Items traded at the shop may be shown like this:

itemLabels(Groceries)[1:10] # [1:10] can be dropped to show all items
##  [1] "frankfurter"       "sausage"           "liver loaf"        "ham"               "meat"             
##  [6] "finished products" "organic sausage"   "chicken"           "turkey"            "pork"

Top 10 most frequent items, both by frequency and absolute counts, may be visualized as follows:

par(mfrow=c(1,2))

itemFrequencyPlot(Groceries,
                  type="relative",
                  topN=10, # can be changed to the number of interest
                  horiz=TRUE,
                  col='steelblue3',
                  xlab='',
                  main='Item frequency, relative')

itemFrequencyPlot(Groceries,
                  type="absolute",
                  topN=10,
                  horiz=TRUE,
                  col='steelblue3',
                  xlab='',
                  main='Item frequency, absolute')

plot of chunk unnamed-chunk-6

Alternatively, we might show least frequently bought items:

par(mar=c(2,10,2,2), mfrow=c(1,2))

barplot(sort(table(unlist(LIST(Groceries))))[1:10]/9835,
        horiz=TRUE,
        las=1,
        col='steelblue3',
        xlab='',
        main='Frequency, relative')

barplot(sort(table(unlist(LIST(Groceries))))[1:10],
        horiz=TRUE,
        las=1,
        col='steelblue3',
        xlab='',
        main='Frequency, absolute')

plot of chunk unnamed-chunk-7

which might tell us there are not many babies in the neighborhoods.

3. Simple contingency table

Sometimes it might be interesting to explore data as a simple contingency table.

tbl <- crossTable(Groceries)
tbl[1:5,1:5]
##             frankfurter sausage liver loaf ham meat
## frankfurter         580      99          7  25   32
## sausage              99     924         10  49   52
## liver loaf            7      10         50   3    0
## ham                  25      49          3 256    9
## meat                 32      52          0   9  254

By default, the table shows absolute counts, e.g.:

tbl['whole milk','whole milk']
## [1] 2513

shows already familiar number 2513 of purchases of “whole milk” (see the summary above). And:

tbl['whole milk','flour']
## [1] 83

would show number of occasions when these two items were purchased together.

If we add an additional argument sort=TRUE, we would get items sorted by frequency of purchase (note decreasing counts diagonal-, row-, and column-wise):

tbl <- crossTable(Groceries, sort=TRUE)
tbl[1:5,1:5]
##                  whole milk other vegetables rolls/buns soda yogurt
## whole milk             2513              736        557  394    551
## other vegetables        736             1903        419  322    427
## rolls/buns              557              419       1809  377    338
## soda                    394              322        377 1715    269
## yogurt                  551              427        338  269   1372

Now, it is a good time to formally introduce arules vocabulary of useful measures, which would help us in identifying interesting, actionable rules:

  • count. Number of times a particular item, or itemset, is encountered in the transactions database. Count for the “whole milk”, e.g. is 2'513.

  • support. Support of an item, or an itemset consisting of several items,
    is frequency of occurrence of a specific item. Support (or frequency) is obtained as count divided by number of transactions. Support for “whole milk' e.g. is 2513/9835 = 0.2555. As a rule, but not necessarily, we want items/itemsets with high support, as high frequency would ensure that our potentially valuable finding (i) is not due to chance (ii) might generate enough profit by recycling it many times.

  • confidence {A} => {B}. Confidence is probability of purchase B, given purchase A
    happened (in humane language this is conditional probability P(B|A). For recommending a good rule we prefer higher confidence.

    • Caveat 1: Confidence is not defined for contingency table as we are not considering transactions per se here.
    • Caveat 2: For substitute products, like "bottled beer”/“canned beer” or “tee”/“coffee”, we will observe low confidence, together with lift less than 1.
  • lift. Lift defined as \frac{P(AB)}{P(A)*P(B)} and shows how more often the rule under questions happens than if it did simply happen by chance. Lift defined both for itemsets and rules. In general, we prefer higher lift over lower lift.

  • chiSquared. Finally, in certain situations where liftclose to 1, but counts are large; or lift is meaningfully different from 1, but counts are low, we may need to turn to statistical chiSquared test to prove that events A and B are statistically dependent (i.e. we did not run into spurious correlation)

Equipped with this knowledge, let's see what products tend to compliment each other
with high lift (i.e. purchase of one product would lead to purchase
of another with high probability) and what products tend to be substitutes:

crossTable(Groceries, measure='lift',sort=T)[1:5,1:5]
##                  whole milk other vegetables rolls/buns   soda yogurt
## whole milk               NA           1.5136      1.205 0.8991  1.572
## other vegetables     1.5136               NA      1.197 0.9703  1.608
## rolls/buns           1.2050           1.1970         NA 1.1951  1.339
## soda                 0.8991           0.9703      1.195     NA  1.124
## yogurt               1.5717           1.6085      1.339 1.1244     NA

It's interesting to see, that “whole milk” goes well with all products, but
“soda”. So, judged by lift, we are on the way to claim that soda is a substitute for “whole milk” for some people: they tend to buy either one or the other, but buying them together is a relatively rare event.

To convince ourselves that lower than 1 lift is not due to chance, let's apply chiSquared test:

crossTable(Groceries, measure='chi')['whole milk', 'soda']
## [1] 0.0004535

Indeed, the low p-value excludes possibility that lift less than 1 is due to chance.

To summarize this section, crossTable() function allows showing and sorting items and pairs of items by:

  • count
  • support
  • lift
  • chiSquared

measures, thus identifying most promising two-member itemsets as compliment or substitute candidates.

4. Apriori: Search for frequent itemsets

apriori function is a workhorse for arules package that has a lot of flexibility
to accommodate almost any practical need of a retail analyst. apriori allows for:

  • mining both frequent itemsets and rules
  • that satisfy prespecified item length, support, confidence and lift
  • that might only construct itemsets out of prespecified items
  • that might only include prespecified items in lhs (left-hand-side of a purchase rule) or rhs (right-hand-side of a purchase rule).

Let's start with mining most frequent itemsets of minimum length equal 2,
and frequency of occurrence at least 1 in 1000, i.e. support=.001

itemsets <- apriori(Groceries,
                    parameter = list(support=.001,
                                     minlen=2,
                                     target='frequent' # to mine for itemsets
                                     ))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen            target   ext
##          NA    0.1    1 none FALSE            TRUE   0.001      2     10 frequent itemsets FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [13335 set(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
summary(itemsets)
## set of 13335 itemsets
## 
## most frequent items:
##       whole milk other vegetables           yogurt  root vegetables   tropical fruit          (Other) 
##             3764             3341             2401             1958             1796            27683 
## 
## element (itemset/transaction) length distribution:sizes
##    2    3    4    5    6 
## 2981 6831 3137  376   10 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.00    3.00    3.07    4.00    6.00 
## 
## summary of quality measures:
##     support       
##  Min.   :0.00102  
##  1st Qu.:0.00112  
##  Median :0.00142  
##  Mean   :0.00226  
##  3rd Qu.:0.00224  
##  Max.   :0.07483  
## 
## includes transaction ID lists: FALSE 
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.001          1

We see based on prespecified support 13'335 itemsets of maximum length 6 were built out of 157 items (12 were thrown out due to rarity).

Support for itemsets is calculated by default, so we can sort by it and print out top itemsets:

inspect(sort(itemsets, by='support', decreasing = T)[1:5])
##      items                              support
## 2981 {other vegetables,whole milk}      0.07483
## 2980 {whole milk,rolls/buns}            0.05663
## 2978 {whole milk,yogurt}                0.05602
## 2971 {root vegetables,whole milk}       0.04891
## 2970 {root vegetables,other vegetables} 0.04738

Couple of words on the syntax of sort and inspect:

  • sort(), as implied by name, sorts itemsets and rules by measure specified in
    by=... argument. Usually one sorts by support, lift, confidence, chiSqured, or any other
    measure, that could be calculated with interestMeasure() function.
  • inspect() is a command that prints out rules or itemsets of interest.

Should we want to add lift and show top 5 results by lift, we may proceed
as follows:

quality(itemsets)$lift <- interestMeasure(itemsets, measure='lift', Groceries)
inspect(sort(itemsets, by ='lift', decreasing = T)[1:5])
##       items                                                                      support  lift 
## 13326 {tropical fruit,root vegetables,other vegetables,whole milk,yogurt,oil}    0.001017 459.3
## 13328 {tropical fruit,other vegetables,whole milk,butter,yogurt,domestic eggs}   0.001017 399.6
## 13329 {tropical fruit,root vegetables,other vegetables,whole milk,butter,yogurt} 0.001118 255.9
## 12984 {other vegetables,curd,yogurt,whipped/sour cream,cream cheese }            0.001017 248.7
## 12950 {root vegetables,other vegetables,whole milk,yogurt,rice}                  0.001322 230.6

The figures above explain very well why fruits and vegetables departments are often
found next to diary departments.

Out of curiosity, we can repeat the exercise of building itemsets of length 2 that
we did with crossTable, but this time with apriori function:

itemsets <- apriori(Groceries,
                    parameter = list(support=.001,
                                     minlen=2,
                                     maxlen=2,
                                     target='frequent' # to mine for itemsets
                                     ))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen            target   ext
##          NA    0.1    1 none FALSE            TRUE   0.001      2      2 frequent itemsets FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [2981 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
quality(itemsets)$lift <- interestMeasure(itemsets, measure='lift', Groceries)
inspect(sort(itemsets, by ='lift', decreasing = T)[1:10])
##      items                                  support  lift  
## 592  {mayonnaise,mustard}                   0.001423 12.965
## 288  {hamburger meat,Instant food products} 0.003050 11.421
## 93   {detergent,softener}                   0.001118 10.600
## 139  {liquor,red/blush wine}                0.002135 10.025
## 1408 {flour,sugar}                          0.004982  8.463
## 210  {salty snack,popcorn}                  0.002237  8.192
## 1113 {ham,processed cheese}                 0.003050  7.071
## 101  {hamburger meat,sauces}                0.001220  6.684
## 32   {cream cheese ,meat spreads}           0.001118  6.605
## 404  {detergent,house keeping products}     0.001017  6.346

After seeing lift @12.97 for {mayonnaise,mustard}, do not be surprised if you find mustard next to mayonnaise on the shop shelf!

5. Apriori: Search for rules

When we switch to searching rules we need to change target='frequent itemsets' to target = 'rules'. As well, we should specify confidence (unless we are satisfied with default confidence=0.8)

rules <- apriori(Groceries,
                 parameter = list(support=.001,
                                  confidence=.5,
                                  minlen=2,
                                  target='rules' # to mine for rules
                                  ))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
##         0.5    0.1    1 none FALSE            TRUE   0.001      2     10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
summary(rules)
## set of 5668 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6 
##   11 1461 3211  939   46 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.00    4.00    3.92    4.00    6.00 
## 
## summary of quality measures:
##     support          confidence         lift      
##  Min.   :0.00102   Min.   :0.500   Min.   : 1.96  
##  1st Qu.:0.00112   1st Qu.:0.545   1st Qu.: 2.46  
##  Median :0.00132   Median :0.600   Median : 2.90  
##  Mean   :0.00167   Mean   :0.625   Mean   : 3.26  
##  3rd Qu.:0.00173   3rd Qu.:0.684   3rd Qu.: 3.69  
##  Max.   :0.02227   Max.   :1.000   Max.   :19.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.001        0.5

There were generated 5'668 rules of length from 2 to 6. The summary statistics
for support, confidence, and lift are self-explanatory.

We can sort rules by any of those measures:

inspect(sort(rules, by='lift', decreasing = T)[1:5])
##     lhs                                   rhs              support  confidence lift 
## 53  {Instant food products,soda}       => {hamburger meat} 0.001220 0.6316     19.00
## 37  {soda,popcorn}                     => {salty snack}    0.001220 0.6316     16.70
## 444 {flour,baking powder}              => {sugar}          0.001017 0.5556     16.41
## 327 {ham,processed cheese}             => {white bread}    0.001932 0.6333     15.05
## 55  {whole milk,Instant food products} => {hamburger meat} 0.001525 0.5000     15.04

A stand with soda, popcorn, and salty snacks with lift @16.7 looks very promising idea
for a department selling DVD's.

In case there is a suspicion for spurious correlation chiSquared test is to the rescue:

quality(rules)$chi <- interestMeasure(rules, measure='chi', significance=T, Groceries)
inspect(sort(rules, by='lift', decreasing = T)[1:5])
##     lhs                                   rhs              support  confidence lift  chi      
## 53  {Instant food products,soda}       => {hamburger meat} 0.001220 0.6316     19.00 4.967e-48
## 37  {soda,popcorn}                     => {salty snack}    0.001220 0.6316     16.70 5.279e-42
## 444 {flour,baking powder}              => {sugar}          0.001017 0.5556     16.41 1.703e-34
## 327 {ham,processed cheese}             => {white bread}    0.001932 0.6333     15.05 1.109e-58
## 55  {whole milk,Instant food products} => {hamburger meat} 0.001525 0.5000     15.04 2.865e-46

Indeed, we can reject H_0 of no interdependence between lhs and rhs.

6. Subsetting rules and itemsets

Let's consider set of “arbitrary” rules generated with apriori

rules <- apriori(Groceries,
                 parameter = list(support=.001,
                                  confidence=.7,
                                  maxlen=5,
                                  target='rules' # to mine for rules
                                  ))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
##         0.7    0.1    1 none FALSE            TRUE   0.001      1      5  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.02s].
## writing ... [1255 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
inspect(sort(rules, by="confidence", decreasing = T)[1:5])
##     lhs                                           rhs          support  confidence lift 
## 25  {rice,sugar}                               => {whole milk} 0.001220 1          3.914
## 52  {canned fish,hygiene articles}             => {whole milk} 0.001118 1          3.914
## 147 {root vegetables,butter,rice}              => {whole milk} 0.001017 1          3.914
## 205 {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001729 1          3.914
## 213 {butter,soft cheese,domestic eggs}         => {whole milk} 0.001017 1          3.914

The power of subset function can be shown by choosing the following subset:

  • rhs should be 'bottled beer'
  • confidence should be above .7
  • results should be sorted by lift
inspect(sort(subset(rules,
                    subset=rhs %in% 'bottled beer' & confidence > .7),
                    by = 'lift',
                    decreasing = T))
##   lhs                        rhs            support  confidence lift 
## 2 {liquor,red/blush wine} => {bottled beer} 0.001932 0.9048     11.24

By now, we must be well aware of the fact that people buying “liquor” and “red wine” are
almost certain to buy “bottled beer” (9 times out of 10), but not “canned beer”:

canned_rules <- apriori(Groceries,
                        parameter = list(support=.001,
                                         confidence=.01,
                                         maxlen=5,
                                         target='rules' # to mine for rules
                                  ))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
##        0.01    0.1    1 none FALSE            TRUE   0.001      1      5  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 done [0.02s].
## writing ... [40827 rule(s)] done [0.01s].
## creating S4 object  ... done [0.01s].
inspect(subset(canned_rules,
                subset=lhs %ain% c("liquor", "red/blush wine") & rhs %in% 'canned beer' ))

Zitch!!! Perhaps less than 1 in 100 people would do!

Important arguments of subset:

  • lhs — means left hand side, or antecendent
  • rhs — mean right hand side, or consequent
  • items — items, that make up itemsets
  • %in% — matches any
  • %ain% — matches all
  • %pin% — matches partially
  • default — no restrictions applied
  • & — additional restrictions on lift, confiedence etc.

Example 1

Both “whole milk” and “yogurt” must be present and rule's confidence must be higher than .95

inspect(subset(rules, subset=items %ain% c("whole milk","yogurt") & confidence >.95))
##     lhs                                              rhs                support  confidence lift 
## 915 {tropical fruit,grapes,whole milk,yogurt}     => {other vegetables} 0.001017 1          5.168
## 942 {tropical fruit,root vegetables,yogurt,oil}   => {whole milk}       0.001118 1          3.914
## 952 {root vegetables,other vegetables,yogurt,oil} => {whole milk}       0.001423 1          3.914

Example 2

Both “whole milk” and “yogurt” must be present in lhs and rule's confidence must be higher than .9

inspect(subset(rules, subset=lhs %ain% c("whole milk","yogurt") & confidence >.9))
##     lhs                                          rhs                support  confidence lift 
## 901 {root vegetables,whole milk,yogurt,rice}  => {other vegetables} 0.001322 0.9286     4.799
## 915 {tropical fruit,grapes,whole milk,yogurt} => {other vegetables} 0.001017 1.0000     5.168
## 953 {root vegetables,whole milk,yogurt,oil}   => {other vegetables} 0.001423 0.9333     4.824

Example 3

“Bread” must be present in lhs: any type of “bread” – “white bread”, “brown bread” – both qualify. “Whole milk” must be present in rhs “as is”. confidence of the rule must be higher than .9

inspect(subset(rules, subset= lhs %pin% "bread" & rhs %in% "whole milk" & confidence > .9))
##      lhs                                                          rhs          support  confidence lift 
## 611  {root vegetables,butter,white bread}                      => {whole milk} 0.001118 0.9167     3.588
## 997  {root vegetables,other vegetables,butter,white bread}     => {whole milk} 0.001017 1.0000     3.914
## 1088 {pip fruit,root vegetables,other vegetables,brown bread}  => {whole milk} 0.001220 0.9231     3.613
## 1095 {root vegetables,other vegetables,rolls/buns,brown bread} => {whole milk} 0.001017 0.9091     3.558

It appears from this limited sample, that people buying “whole milk” do not have any preferences
for either “white” or “brown bread” (perhaps contingency crossTable would show a different result).

Example 4

Let's see what we can expect at rhs with confidence higher than .7 if we have
both “flour” and “whole milk” on the lhs.

inspect(subset(rules, subset= lhs %ain% c("flour","whole milk") & confidence>.7))
##     lhs                                rhs                support confidence lift 
## 208 {citrus fruit,whole milk,flour} => {other vegetables} 0.00122 0.75       3.876

Example 5

Finally, let's go fishing for substitute products. So far we were concerned with complements, i.e. items and itemsets that showed high lift. In other words, they were deliberately bought together much more often than it were warranted by sheer chance.

Let's consider case “Bottled beer Vs. Canned beer” and prove that people tend to buy either one or the other, and rarely do they buy both, qualifying these two as substitute products.

rules <- apriori(Groceries,
                    parameter = list(support=.001,
                                     conf = .01,
                                     minlen=2,
                                     maxlen=2,
                                     target='rules'
                                     ))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen target   ext
##        0.01    0.1    1 none FALSE            TRUE   0.001      2      2  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [5818 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Let's only look at the rules where “beer” is present at both left- and right-hand-side of the
rule and add chiSquared p-value to prove statistical significance of our findings:

quality(rules)$chi  <- interestMeasure(rules, measure='chi', significance=T, Groceries)
inspect(subset(rules, lhs %pin% 'beer' & rhs %pin% 'beer'))
##      lhs               rhs            support  confidence lift   chi      
## 4785 {canned beer}  => {bottled beer} 0.002644 0.03403    0.4226 8.743e-07
## 4786 {bottled beer} => {canned beer}  0.002644 0.03283    0.4226 8.743e-07

The results so far are quite telling: there are indeed people who buy both bottled and canned beer at once

crossTable(Groceries)['canned beer','bottled beer']
## [1] 26
  • the probability of a consecutive purchase (confidence) is pretty small: ~3%
  • this is despite both bottled beer and canned beer being pretty popular purchases
crossTable(Groceries)['canned beer','canned beer']
## [1] 764
crossTable(Groceries)['bottled beer','bottled beer']
## [1] 792

All these figures, combined with statistically significant lift below 1 (chi ~ 1e-6) tells us that “bottled beer” and “canned beer” do behave as substitutes.

7. Visually mining rules.

Both rules and itemsets can be visualized with the help of arulesViz library.
The power of the plot() function from arulesViz library comes from
interactive argument. Remember, as a general rule we want rules with both
high support and high confidence.

plot(rules, interactive = T)

interactive-rules-choice

With the help of this function we can visually mine rules by:

1. Selecting rectangular area by clicking twice on the plot.

2. Then clicking inspect

This would produce rules found in that region, e.g.:

    lhs                                      rhs          support  confidence lift  
717 {other vegetables,curd,domestic eggs} => {whole milk} 0.002847 0.8235     3.223 

Summary

arules package by Michael Hashler represents a very powerful toolset for mining
transactional databases. In the context of market basket analysis it allows identifying items and itemsets that tend to be bought together. Thus, it facilitates mining associative rules in “IF … THEN context”. Apart from specifying confidence and support of the rules (basically conditional probability and frequency), it provides facilities for calculating other interesting measures, like chiSquared, that allows filtering spurious effects out. As well, a complementary package arulesViz, provides facilities for mining and representing associative rules visually.

Write a comment:

*

Your email address will not be published.

© 2014 In R we trust.
Top
Follow us: