1. Introduction.
Arules
, open source package available from The Comprehensive R Archive Network, is a powerful tool-set for mining associative rules in transactional databases. The most common use of arules
package
is market basket analysis in marketing and retail; though, there were successful attempts applying arules
to medical problems, crime prevention, and book recommendations.
In the broadest sense:
associative rules
are defined as “IF … THEN … ” rules (e.g. “IF” I bought an airplane ticket, I might “THEN” book a room at a hotel).transactional databases
are unevenly sized sequences of items or events occurring together (e.g. a list of purchases from a grocery or a checklist of patient conditions prior to hospital admission).
An example of transactional database
could be:
## [1] "citrus fruit,semi-finished bread,margarine,ready soups" ## [2] "tropical fruit,yogurt,coffee" ## [3] "whole milk" ## [4] "pip fruit,yogurt,cream cheese ,meat spreads" ## [5] "other vegetables,whole milk,condensed milk,long life bakery product"
Arules
can help in identifying:
- recurring purchasing patterns
- complement and substitute products
- trigger products, buying of which may lead to buying another product with high degree of certainty.
As such, arules
might be helpful in:
- planning inventory
- designing “combo offers”
- planning discount programs (discount one product, markup complements)
- planning shelf and catalog layout
- recommending products and crosselling (especially higher margin products).
This paper is not intended as a tutorial for arules
(for this please check the links to useful resources at the end). The purpose of this paper is twofold:
- show how
arules
can help in answering practical questions faced by retail analysts - show some helpful tips and tricks that improve quality and productivity of analysis.
2. Summary of transactional db and basic visualizations.
In this exercise we are using 1 month worth of transactions, which is distributed together with arules
package under the name Groceries
library(arules)
library(arulesViz)
data("Groceries")
In case we did not have data readily available, we could have downloaded transactions like this:
transaction.data <- read.transactions('/path/to/file', sep=',')
(note, usual read.csv()
won't do, as it expects equal number of data points per row.)
As soon as we have loaded Groceries
into R's
working environment,
we can have a look at the salient statistics:
summary(Groceries)
## transactions as itemMatrix in sparse format with ## 9835 rows (elements/itemsets/transactions) and ## 169 columns (items) and a density of 0.02609 ## ## most frequent items: ## whole milk other vegetables rolls/buns soda yogurt (Other) ## 2513 1903 1809 1715 1372 34055 ## ## element (itemset/transaction) length distribution: ## sizes ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29 14 14 9 11 4 6 1 ## 26 27 28 29 32 ## 1 1 1 3 1 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 2.00 3.00 4.41 6.00 32.00 ## ## includes extended item information - examples: ## labels level2 level1 ## 1 frankfurter sausage meat and sausage ## 2 sausage sausage meat and sausage ## 3 liver loaf sausage meat and sausage
In humane language the above means:
- there were 9'835 transactions altogether
- 169 different items were bought during the month
- the most frequently bought item was “whole milk”: 2'513 purchases
- there were 2'159 single item baskets, the biggest basket included 32 items
- median basket included 3 items; mean had 4.4 items.
Items traded at the shop may be shown like this:
itemLabels(Groceries)[1:10] # [1:10] can be dropped to show all items
## [1] "frankfurter" "sausage" "liver loaf" "ham" "meat" ## [6] "finished products" "organic sausage" "chicken" "turkey" "pork"
Top 10 most frequent items, both by frequency and absolute counts, may be visualized as follows:
par(mfrow=c(1,2))
itemFrequencyPlot(Groceries,
type="relative",
topN=10, # can be changed to the number of interest
horiz=TRUE,
col='steelblue3',
xlab='',
main='Item frequency, relative')
itemFrequencyPlot(Groceries,
type="absolute",
topN=10,
horiz=TRUE,
col='steelblue3',
xlab='',
main='Item frequency, absolute')
Alternatively, we might show least frequently bought items:
par(mar=c(2,10,2,2), mfrow=c(1,2))
barplot(sort(table(unlist(LIST(Groceries))))[1:10]/9835,
horiz=TRUE,
las=1,
col='steelblue3',
xlab='',
main='Frequency, relative')
barplot(sort(table(unlist(LIST(Groceries))))[1:10],
horiz=TRUE,
las=1,
col='steelblue3',
xlab='',
main='Frequency, absolute')
which might tell us there are not many babies in the neighborhoods.
3. Simple contingency table
Sometimes it might be interesting to explore data as a simple contingency table.
tbl <- crossTable(Groceries)
tbl[1:5,1:5]
## frankfurter sausage liver loaf ham meat ## frankfurter 580 99 7 25 32 ## sausage 99 924 10 49 52 ## liver loaf 7 10 50 3 0 ## ham 25 49 3 256 9 ## meat 32 52 0 9 254
By default, the table shows absolute counts, e.g.:
tbl['whole milk','whole milk']
## [1] 2513
shows already familiar number 2513 of purchases of “whole milk” (see the summary above). And:
tbl['whole milk','flour']
## [1] 83
would show number of occasions when these two items were purchased together.
If we add an additional argument sort=TRUE
, we would get items sorted by frequency of purchase (note decreasing counts diagonal-, row-, and column-wise):
tbl <- crossTable(Groceries, sort=TRUE)
tbl[1:5,1:5]
## whole milk other vegetables rolls/buns soda yogurt ## whole milk 2513 736 557 394 551 ## other vegetables 736 1903 419 322 427 ## rolls/buns 557 419 1809 377 338 ## soda 394 322 377 1715 269 ## yogurt 551 427 338 269 1372
Now, it is a good time to formally introduce arules
vocabulary of useful measures, which would help us in identifying interesting, actionable rules:
-
count
. Number of times a particular item, or itemset, is encountered in the transactions database.Count
for the “whole milk”, e.g. is 2'513. -
support
. Support of an item, or an itemset consisting of several items,
is frequency of occurrence of a specific item.Support
(or frequency) is obtained ascount
divided by number of transactions. Support for “whole milk' e.g. is 2513/9835 = 0.2555. As a rule, but not necessarily, we want items/itemsets with highsupport
, as high frequency would ensure that our potentially valuable finding (i) is not due to chance (ii) might generate enough profit by recycling it many times. -
confidence {A} => {B}
. Confidence is probability of purchase B, given purchase A
happened (in humane language this is conditional probability. For recommending a good rule we prefer higher confidence.
- Caveat 1:
Confidence
is not defined for contingency table as we are not considering transactions per se here. - Caveat 2: For substitute products, like "bottled beer”/“canned beer” or “tee”/“coffee”, we will observe low
confidence
, together withlift
less than 1.
- Caveat 1:
-
lift
. Lift defined asand shows how more often the rule under questions happens than if it did simply happen by chance.
Lift
defined both for itemsets and rules. In general, we prefer higher lift over lower lift. -
chiSquared
. Finally, in certain situations wherelift
close to 1, butcounts
are large; orlift
is meaningfully different from 1, but counts are low, we may need to turn to statistical chiSquared test to prove that events A and B are statistically dependent (i.e. we did not run into spurious correlation)
Equipped with this knowledge, let's see what products tend to compliment each other
with high lift (i.e. purchase of one product would lead to purchase
of another with high probability) and what products tend to be substitutes:
crossTable(Groceries, measure='lift',sort=T)[1:5,1:5]
## whole milk other vegetables rolls/buns soda yogurt ## whole milk NA 1.5136 1.205 0.8991 1.572 ## other vegetables 1.5136 NA 1.197 0.9703 1.608 ## rolls/buns 1.2050 1.1970 NA 1.1951 1.339 ## soda 0.8991 0.9703 1.195 NA 1.124 ## yogurt 1.5717 1.6085 1.339 1.1244 NA
It's interesting to see, that “whole milk” goes well with all products, but
“soda”. So, judged by lift
, we are on the way to claim that soda is a substitute for “whole milk” for some people: they tend to buy either one or the other, but buying them together is a relatively rare event.
To convince ourselves that lower than 1 lift is not due to chance, let's apply chiSquared test:
crossTable(Groceries, measure='chi')['whole milk', 'soda']
## [1] 0.0004535
Indeed, the low p-value excludes possibility that lift
less than 1 is due to chance.
To summarize this section, crossTable()
function allows showing and sorting items and pairs of items by:
count
support
lift
chiSquared
measures, thus identifying most promising two-member itemsets as compliment or substitute candidates.
4. Apriori: Search for frequent itemsets
apriori
function is a workhorse for arules
package that has a lot of flexibility
to accommodate almost any practical need of a retail analyst. apriori
allows for:
- mining both
frequent itemsets
andrules
- that satisfy prespecified item length, support, confidence and lift
- that might only construct itemsets out of prespecified items
- that might only include prespecified items in lhs (left-hand-side of a purchase rule) or rhs (right-hand-side of a purchase rule).
Let's start with mining most frequent itemsets of minimum length equal 2,
and frequency of occurrence at least 1 in 1000, i.e. support=.001
itemsets <- apriori(Groceries,
parameter = list(support=.001,
minlen=2,
target='frequent' # to mine for itemsets
))
## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport support minlen maxlen target ext ## NA 0.1 1 none FALSE TRUE 0.001 2 10 frequent itemsets FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 9 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. ## sorting and recoding items ... [157 item(s)] done [0.00s]. ## creating transaction tree ... done [0.01s]. ## checking subsets of size 1 2 3 4 5 6 done [0.02s]. ## writing ... [13335 set(s)] done [0.00s]. ## creating S4 object ... done [0.01s].
summary(itemsets)
## set of 13335 itemsets ## ## most frequent items: ## whole milk other vegetables yogurt root vegetables tropical fruit (Other) ## 3764 3341 2401 1958 1796 27683 ## ## element (itemset/transaction) length distribution:sizes ## 2 3 4 5 6 ## 2981 6831 3137 376 10 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2.00 3.00 3.00 3.07 4.00 6.00 ## ## summary of quality measures: ## support ## Min. :0.00102 ## 1st Qu.:0.00112 ## Median :0.00142 ## Mean :0.00226 ## 3rd Qu.:0.00224 ## Max. :0.07483 ## ## includes transaction ID lists: FALSE ## ## mining info: ## data ntransactions support confidence ## Groceries 9835 0.001 1
We see based on prespecified support 13'335 itemsets of maximum length 6 were built out of 157 items (12 were thrown out due to rarity).
Support
for itemsets is calculated by default, so we can sort by it and print out top itemsets:
inspect(sort(itemsets, by='support', decreasing = T)[1:5])
## items support ## 2981 {other vegetables,whole milk} 0.07483 ## 2980 {whole milk,rolls/buns} 0.05663 ## 2978 {whole milk,yogurt} 0.05602 ## 2971 {root vegetables,whole milk} 0.04891 ## 2970 {root vegetables,other vegetables} 0.04738
Couple of words on the syntax of sort
and inspect
:
-
sort()
, as implied by name, sorts itemsets and rules by measure specified in
by=...
argument. Usually one sorts bysupport
,lift
,confidence
,chiSqured
, or any other
measure, that could be calculated withinterestMeasure()
function. inspect()
is a command that prints out rules or itemsets of interest.
Should we want to add lift
and show top 5 results by lift
, we may proceed
as follows:
quality(itemsets)$lift <- interestMeasure(itemsets, measure='lift', Groceries)
inspect(sort(itemsets, by ='lift', decreasing = T)[1:5])
## items support lift ## 13326 {tropical fruit,root vegetables,other vegetables,whole milk,yogurt,oil} 0.001017 459.3 ## 13328 {tropical fruit,other vegetables,whole milk,butter,yogurt,domestic eggs} 0.001017 399.6 ## 13329 {tropical fruit,root vegetables,other vegetables,whole milk,butter,yogurt} 0.001118 255.9 ## 12984 {other vegetables,curd,yogurt,whipped/sour cream,cream cheese } 0.001017 248.7 ## 12950 {root vegetables,other vegetables,whole milk,yogurt,rice} 0.001322 230.6
The figures above explain very well why fruits and vegetables departments are often
found next to diary departments.
Out of curiosity, we can repeat the exercise of building itemsets of length 2 that
we did with crossTable
, but this time with apriori
function:
itemsets <- apriori(Groceries,
parameter = list(support=.001,
minlen=2,
maxlen=2,
target='frequent' # to mine for itemsets
))
## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport support minlen maxlen target ext ## NA 0.1 1 none FALSE TRUE 0.001 2 2 frequent itemsets FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 9 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. ## sorting and recoding items ... [157 item(s)] done [0.00s]. ## creating transaction tree ... done [0.01s]. ## checking subsets of size 1 2 done [0.00s]. ## writing ... [2981 set(s)] done [0.00s]. ## creating S4 object ... done [0.00s].
quality(itemsets)$lift <- interestMeasure(itemsets, measure='lift', Groceries)
inspect(sort(itemsets, by ='lift', decreasing = T)[1:10])
## items support lift ## 592 {mayonnaise,mustard} 0.001423 12.965 ## 288 {hamburger meat,Instant food products} 0.003050 11.421 ## 93 {detergent,softener} 0.001118 10.600 ## 139 {liquor,red/blush wine} 0.002135 10.025 ## 1408 {flour,sugar} 0.004982 8.463 ## 210 {salty snack,popcorn} 0.002237 8.192 ## 1113 {ham,processed cheese} 0.003050 7.071 ## 101 {hamburger meat,sauces} 0.001220 6.684 ## 32 {cream cheese ,meat spreads} 0.001118 6.605 ## 404 {detergent,house keeping products} 0.001017 6.346
After seeing lift
@12.97 for {mayonnaise,mustard}, do not be surprised if you find mustard next to mayonnaise on the shop shelf!
5. Apriori: Search for rules
When we switch to searching rules we need to change target='frequent itemsets'
to target = 'rules'
. As well, we should specify confidence
(unless we are satisfied with default confidence=0.8
)
rules <- apriori(Groceries,
parameter = list(support=.001,
confidence=.5,
minlen=2,
target='rules' # to mine for rules
))
## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport support minlen maxlen target ext ## 0.5 0.1 1 none FALSE TRUE 0.001 2 10 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 9 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. ## sorting and recoding items ... [157 item(s)] done [0.00s]. ## creating transaction tree ... done [0.01s]. ## checking subsets of size 1 2 3 4 5 6 done [0.02s]. ## writing ... [5668 rule(s)] done [0.00s]. ## creating S4 object ... done [0.01s].
summary(rules)
## set of 5668 rules ## ## rule length distribution (lhs + rhs):sizes ## 2 3 4 5 6 ## 11 1461 3211 939 46 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2.00 3.00 4.00 3.92 4.00 6.00 ## ## summary of quality measures: ## support confidence lift ## Min. :0.00102 Min. :0.500 Min. : 1.96 ## 1st Qu.:0.00112 1st Qu.:0.545 1st Qu.: 2.46 ## Median :0.00132 Median :0.600 Median : 2.90 ## Mean :0.00167 Mean :0.625 Mean : 3.26 ## 3rd Qu.:0.00173 3rd Qu.:0.684 3rd Qu.: 3.69 ## Max. :0.02227 Max. :1.000 Max. :19.00 ## ## mining info: ## data ntransactions support confidence ## Groceries 9835 0.001 0.5
There were generated 5'668 rules of length from 2 to 6. The summary statistics
for support, confidence, and lift are self-explanatory.
We can sort rules by any of those measures:
inspect(sort(rules, by='lift', decreasing = T)[1:5])
## lhs rhs support confidence lift ## 53 {Instant food products,soda} => {hamburger meat} 0.001220 0.6316 19.00 ## 37 {soda,popcorn} => {salty snack} 0.001220 0.6316 16.70 ## 444 {flour,baking powder} => {sugar} 0.001017 0.5556 16.41 ## 327 {ham,processed cheese} => {white bread} 0.001932 0.6333 15.05 ## 55 {whole milk,Instant food products} => {hamburger meat} 0.001525 0.5000 15.04
A stand with soda, popcorn, and salty snacks with lift @16.7 looks very promising idea
for a department selling DVD's.
In case there is a suspicion for spurious correlation chiSquared test is to the rescue:
quality(rules)$chi <- interestMeasure(rules, measure='chi', significance=T, Groceries)
inspect(sort(rules, by='lift', decreasing = T)[1:5])
## lhs rhs support confidence lift chi ## 53 {Instant food products,soda} => {hamburger meat} 0.001220 0.6316 19.00 4.967e-48 ## 37 {soda,popcorn} => {salty snack} 0.001220 0.6316 16.70 5.279e-42 ## 444 {flour,baking powder} => {sugar} 0.001017 0.5556 16.41 1.703e-34 ## 327 {ham,processed cheese} => {white bread} 0.001932 0.6333 15.05 1.109e-58 ## 55 {whole milk,Instant food products} => {hamburger meat} 0.001525 0.5000 15.04 2.865e-46
Indeed, we can reject of no interdependence between
lhs
and rhs
.
6. Subsetting rules and itemsets
Let's consider set of “arbitrary” rules generated with apriori
rules <- apriori(Groceries,
parameter = list(support=.001,
confidence=.7,
maxlen=5,
target='rules' # to mine for rules
))
## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport support minlen maxlen target ext ## 0.7 0.1 1 none FALSE TRUE 0.001 1 5 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 9 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. ## sorting and recoding items ... [157 item(s)] done [0.00s]. ## creating transaction tree ... done [0.01s]. ## checking subsets of size 1 2 3 4 5 done [0.02s]. ## writing ... [1255 rule(s)] done [0.00s]. ## creating S4 object ... done [0.01s].
inspect(sort(rules, by="confidence", decreasing = T)[1:5])
## lhs rhs support confidence lift ## 25 {rice,sugar} => {whole milk} 0.001220 1 3.914 ## 52 {canned fish,hygiene articles} => {whole milk} 0.001118 1 3.914 ## 147 {root vegetables,butter,rice} => {whole milk} 0.001017 1 3.914 ## 205 {root vegetables,whipped/sour cream,flour} => {whole milk} 0.001729 1 3.914 ## 213 {butter,soft cheese,domestic eggs} => {whole milk} 0.001017 1 3.914
The power of subset
function can be shown by choosing the following subset:
rhs
should be 'bottled beer'confidence
should be above .7- results should be sorted by
lift
inspect(sort(subset(rules,
subset=rhs %in% 'bottled beer' & confidence > .7),
by = 'lift',
decreasing = T))
## lhs rhs support confidence lift ## 2 {liquor,red/blush wine} => {bottled beer} 0.001932 0.9048 11.24
By now, we must be well aware of the fact that people buying “liquor” and “red wine” are
almost certain to buy “bottled beer” (9 times out of 10), but not “canned beer”:
canned_rules <- apriori(Groceries,
parameter = list(support=.001,
confidence=.01,
maxlen=5,
target='rules' # to mine for rules
))
## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport support minlen maxlen target ext ## 0.01 0.1 1 none FALSE TRUE 0.001 1 5 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 9 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. ## sorting and recoding items ... [157 item(s)] done [0.00s]. ## creating transaction tree ... done [0.01s]. ## checking subsets of size 1 2 3 4 5 done [0.02s]. ## writing ... [40827 rule(s)] done [0.01s]. ## creating S4 object ... done [0.01s].
inspect(subset(canned_rules,
subset=lhs %ain% c("liquor", "red/blush wine") & rhs %in% 'canned beer' ))
Zitch!!! Perhaps less than 1 in 100 people would do!
Important arguments of subset:
lhs
— meansleft hand side
, orantecendent
rhs
— meanright hand side
, orconsequent
items
— items, that make up itemsets%in%
— matches any%ain%
— matches all%pin%
— matches partiallydefault
— no restrictions applied&
— additional restrictions onlift
,confiedence
etc.
Example 1
Both “whole milk” and “yogurt” must be present and rule's confidence
must be higher than .95
inspect(subset(rules, subset=items %ain% c("whole milk","yogurt") & confidence >.95))
## lhs rhs support confidence lift ## 915 {tropical fruit,grapes,whole milk,yogurt} => {other vegetables} 0.001017 1 5.168 ## 942 {tropical fruit,root vegetables,yogurt,oil} => {whole milk} 0.001118 1 3.914 ## 952 {root vegetables,other vegetables,yogurt,oil} => {whole milk} 0.001423 1 3.914
Example 2
Both “whole milk” and “yogurt” must be present in lhs
and rule's confidence
must be higher than .9
inspect(subset(rules, subset=lhs %ain% c("whole milk","yogurt") & confidence >.9))
## lhs rhs support confidence lift ## 901 {root vegetables,whole milk,yogurt,rice} => {other vegetables} 0.001322 0.9286 4.799 ## 915 {tropical fruit,grapes,whole milk,yogurt} => {other vegetables} 0.001017 1.0000 5.168 ## 953 {root vegetables,whole milk,yogurt,oil} => {other vegetables} 0.001423 0.9333 4.824
Example 3
“Bread” must be present in lhs
: any type of “bread” – “white bread”, “brown bread” – both qualify. “Whole milk” must be present in rhs
“as is”. confidence
of the rule must be higher than .9
inspect(subset(rules, subset= lhs %pin% "bread" & rhs %in% "whole milk" & confidence > .9))
## lhs rhs support confidence lift ## 611 {root vegetables,butter,white bread} => {whole milk} 0.001118 0.9167 3.588 ## 997 {root vegetables,other vegetables,butter,white bread} => {whole milk} 0.001017 1.0000 3.914 ## 1088 {pip fruit,root vegetables,other vegetables,brown bread} => {whole milk} 0.001220 0.9231 3.613 ## 1095 {root vegetables,other vegetables,rolls/buns,brown bread} => {whole milk} 0.001017 0.9091 3.558
It appears from this limited sample, that people buying “whole milk” do not have any preferences
for either “white” or “brown bread” (perhaps contingency crossTable
would show a different result).
Example 4
Let's see what we can expect at rhs
with confidence
higher than .7 if we have
both “flour” and “whole milk” on the lhs
.
inspect(subset(rules, subset= lhs %ain% c("flour","whole milk") & confidence>.7))
## lhs rhs support confidence lift ## 208 {citrus fruit,whole milk,flour} => {other vegetables} 0.00122 0.75 3.876
Example 5
Finally, let's go fishing for substitute products. So far we were concerned with complements, i.e. items and itemsets that showed high lift
. In other words, they were deliberately bought together much more often than it were warranted by sheer chance.
Let's consider case “Bottled beer Vs. Canned beer” and prove that people tend to buy either one or the other, and rarely do they buy both, qualifying these two as substitute products.
rules <- apriori(Groceries,
parameter = list(support=.001,
conf = .01,
minlen=2,
maxlen=2,
target='rules'
))
## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport support minlen maxlen target ext ## 0.01 0.1 1 none FALSE TRUE 0.001 2 2 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 9 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s]. ## sorting and recoding items ... [157 item(s)] done [0.00s]. ## creating transaction tree ... done [0.01s]. ## checking subsets of size 1 2 done [0.00s]. ## writing ... [5818 rule(s)] done [0.00s]. ## creating S4 object ... done [0.00s].
Let's only look at the rules where “beer” is present at both left- and right-hand-side of the
rule and add chiSquared
p-value to prove statistical significance of our findings:
quality(rules)$chi <- interestMeasure(rules, measure='chi', significance=T, Groceries)
inspect(subset(rules, lhs %pin% 'beer' & rhs %pin% 'beer'))
## lhs rhs support confidence lift chi ## 4785 {canned beer} => {bottled beer} 0.002644 0.03403 0.4226 8.743e-07 ## 4786 {bottled beer} => {canned beer} 0.002644 0.03283 0.4226 8.743e-07
The results so far are quite telling: there are indeed people who buy both bottled
and canned beer
at once
crossTable(Groceries)['canned beer','bottled beer']
## [1] 26
- the probability of a consecutive purchase (
confidence
) is pretty small: ~3% - this is despite both
bottled beer
andcanned beer
being pretty popular purchases
crossTable(Groceries)['canned beer','canned beer']
## [1] 764
crossTable(Groceries)['bottled beer','bottled beer']
## [1] 792
All these figures, combined with statistically significant lift
below 1 (chi
~ 1e-6) tells us that “bottled beer” and “canned beer” do behave as substitutes.
7. Visually mining rules.
Both rules and itemsets can be visualized with the help of arulesViz
library.
The power of the plot()
function from arulesViz
library comes from
interactive
argument. Remember, as a general rule we want rules with both
high support
and high confidence
.
plot(rules, interactive = T)
With the help of this function we can visually mine rules by:
1. Selecting rectangular area by clicking twice on the plot.
2. Then clicking inspect
This would produce rules found in that region, e.g.:
lhs rhs support confidence lift 717 {other vegetables,curd,domestic eggs} => {whole milk} 0.002847 0.8235 3.223
Summary
arules
package by Michael Hashler represents a very powerful toolset for mining
transactional databases. In the context of market basket analysis it allows identifying items and itemsets that tend to be bought together. Thus, it facilitates mining associative rules in “IF … THEN context”. Apart from specifying confidence and support of the rules (basically conditional probability and frequency), it provides facilities for calculating other interesting measures, like chiSquared, that allows filtering spurious effects out. As well, a complementary package arulesViz
, provides facilities for mining and representing associative rules visually.
Write a comment: