Association Rules

Overview

Association Rule Mining is an unsupervised machine learning technique used to discover interesting relationships or associations among a set of items in a dataset. It aims to identify strong rules of the form \(X \Rightarrow Y\), where X and Y are disjoint sets of items. A common application is Market Basket Analysis, where the goal is to find associations between items frequently purchased together by customers (e.g., \(\{ ext{Bread, Butter}\} \Rightarrow \{ ext{Milk}\}\)).

These rules can provide valuable insights for businesses, such as optimizing store layouts, creating targeted promotions, and improving recommendation systems.

Core Concepts

  • Key Terminology

    • Itemset: A collection of one or more items. For example, \(\{ ext{Milk, Bread, Diaper}\}\) is an itemset.
    • k-itemset: An itemset containing k items.
    • Support (supp(X)): The proportion of transactions in the dataset that contain itemset X. It indicates the popularity of an itemset.
      $$ \text{Support}(X) = \frac{\text{Number of transactions containing X}}{\text{Total number of transactions}} $$
    • Confidence (conf(X \(\Rightarrow\) Y)): The conditional probability that a transaction containing itemset X also contains itemset Y. It measures the reliability of the rule.
      $$ \text{Confidence}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)} $$
    • Lift (lift(X \(\Rightarrow\) Y)): Measures how much more likely itemset Y is purchased when itemset X is purchased, compared to if Y were purchased independently of X. It indicates the strength or importance of a rule, beyond simple co-occurrence.
      $$ \text{Lift}(X \rightarrow Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X) \times \text{Support}(Y)} $$
      • lift = 1: X and Y are independent.
      • lift > 1: X and Y are positively correlated (presence of X increases likelihood of Y).
      • lift < 1: X and Y are negatively correlated (presence of X decreases likelihood of Y).
    • Leverage (leverage(X \(\Rightarrow\) Y)): Measures the difference between the observed frequency of X and Y appearing together and the frequency that would be expected if X and Y were independent.
      $$ \text{leverage}(X \Rightarrow Y) = \text{Support}(X \cup Y) - (\text{Support}(X) \cdot \text{Support}(Y)) $$
      Leverage values close to 0 indicate independence.
    • Conviction (conv(X \(\Rightarrow\) Y)): Can be interpreted as the ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions.
      $$ \text{conv}(X \Rightarrow Y) = \frac{1 - \text{Support}(Y)}{1 - \text{Confidence}(X \rightarrow Y)} $$
      A high conviction value means that the consequent is highly dependent on the antecedent. For instance, a conviction of 1.5 indicates that the rule X \(\Rightarrow\) Y would be incorrect 50% more often if the association between X and Y was purely random.
    • Frequent Itemset: An itemset whose support is greater than or equal to a user-specified minimum support threshold (minsup).
    • Association Rule: An implication of the form \(X \Rightarrow Y\), where X and Y are disjoint itemsets, and the rule satisfies user-specified minimum support and minimum confidence (minconf) thresholds.
  • Apriori Algorithm

    The Apriori algorithm is a classic algorithm for mining frequent itemsets and generating association rules. It works based on the \"Apriori principle\":

    If an itemset is frequent, then all of its subsets must also be frequent. Conversely, if a subset is infrequent, then all of its supersets must also be infrequent.

    Steps:

    1. Generate Frequent 1-itemsets (L1): Scan the dataset to find all items that satisfy the minimum support threshold.
    2. Iteratively Generate Candidate k-itemsets (Ck) and Frequent k-itemsets (Lk):
      • For k = 2, 3, ... until no more frequent itemsets can be found:
      • Candidate Generation (Join Step): Generate candidate k-itemsets (Ck) by joining frequent (k-1)-itemsets (Lk-1) with themselves. Only join pairs of Lk-1 itemsets if their first k-2 items are identical.
      • Candidate Pruning (Prune Step): Prune candidates from Ck that have any (k-1)-subset that is not in Lk-1 (applying the Apriori principle).
      • Support Counting: Scan the dataset to count the support of each candidate in Ck.
      • Frequent Itemset Generation: Add candidates from Ck that meet the minimum support threshold to Lk.
    3. Generate Association Rules: For each frequent itemset L, generate all non-empty subsets S of L. For each such subset S, output the rule \( S \Rightarrow (L-S) \) if \( \text{Confidence}(S \rightarrow (L-S)) \ge \text{minconf} \). Confidence is calculated as \( \text{Support}(L) / \text{Support}(S) \).

    Limitations: Apriori can be computationally expensive, especially with large datasets, as it requires multiple scans of the database and can generate a large number of candidate itemsets.

  • FP-Growth (Frequent Pattern Growth) Algorithm

    The FP-Growth algorithm is an alternative to Apriori that aims to be more efficient by avoiding the costly candidate generation step. It uses a compact data structure called an FP-tree (Frequent Pattern tree) to store frequent itemset information.

    General Steps:

    1. Scan dataset and find frequent 1-itemsets: Similar to Apriori, identify all items that meet the minimum support and sort them by decreasing support count.
    2. Construct the FP-tree:
      • Create the root of the tree (null).
      • For each transaction in the dataset:
      • Select and sort the frequent items in the transaction according to the order from step 1.
      • Insert the sorted frequent items into the FP-tree. Each node in the tree represents an item, and its count represents the number of transactions passing through that node. Shared prefixes will result in shared paths in the tree.
      • A header table is maintained, where each frequent item points to the first occurrence of that item in the FP-tree, with links to subsequent occurrences.
    3. Mine Frequent Patterns from the FP-tree:
      • This is done recursively. For each item in the header table (starting from the least frequent), find its conditional pattern base (the set of prefix paths in the FP-tree co-occurring with the item).
      • Construct a conditional FP-tree from this conditional pattern base.
      • Recursively mine this conditional FP-tree.
      • The frequent patterns are formed by appending the item to the patterns found in its conditional FP-tree.

    Advantages: Generally faster than Apriori, especially for dense datasets, as it only requires two scans of the database and avoids explicit candidate generation.

  • Eclat (Equivalence Class Clustering and Bottom-up Lattice Traversal)

    Eclat is another popular algorithm for frequent itemset mining. It uses a depth-first search (DFS) approach and a vertical data format (tid-lists, i.e., lists of transaction IDs where each item appears).

    General Steps:

    1. Initial Tid-lists: For each item, create a list of transaction IDs (tids) in which it appears. These are the frequent 1-itemsets if their tid-list size meets minsup.
    2. Recursive Mining:
      • The algorithm recursively processes itemsets. To find frequent (k+1)-itemsets from frequent k-itemsets, it intersects the tid-lists of k-itemsets.
      • For example, to find the support of itemset \(\{ ext{A, B}\}\), intersect the tid-list of A with the tid-list of B. The size of the resulting tid-list is \( ext{Support}(\{ ext{A, B}\}\)).
      • The search proceeds in a depth-first manner through the itemset lattice.

    Advantages: Can be faster than Apriori for certain types of datasets, especially when itemsets are long. It avoids repeated scans of the database by using the tid-list intersections.

    Disadvantages: Tid-lists can become very long for very frequent items, consuming significant memory. Intermediate tid-lists can also be large.

  • Examples of Real-World Use Cases

    • Market Basket Analysis: Identifying products frequently bought together to optimize store layout, product bundling, and cross-selling promotions (e.g., \(\{ ext{diapers}\} \Rightarrow \{ ext{beer}\}\)).
    • Recommendation Systems: Suggesting items to users based on items they or similar users have shown interest in (e.g., \"Customers who bought X also bought Y\", used by Amazon, Netflix).
    • Medical Diagnosis: Finding associations between symptoms and diseases, or patient characteristics and treatment outcomes.
    • Web Usage Mining: Analyzing weblog data to understand user navigation patterns, predict future user behavior, and personalize website content.
    • Bioinformatics: Discovering relationships between genes and diseases, or identifying patterns in DNA sequences.
    • Fraud Detection: Identifying patterns of behavior or transactions associated with fraudulent activities.
    • Customer Relationship Management (CRM): Understanding customer behavior to improve customer segmentation, retention, and targeted marketing.
  • Advantages

    • Easy to Understand: The rules generated (e.g., X \(\Rightarrow\) Y) are often intuitive and straightforward to interpret, making them accessible to non-technical users.
    • Actionable Insights: Can uncover hidden patterns that are directly actionable for business decisions.
    • Unsupervised: Does not require labeled data, making it applicable to a wide range of datasets.
    • Foundation for Other Techniques: The concepts are foundational for other data mining tasks like sequential pattern mining or classification based on associations.
  • Disadvantages

    • Large Number of Rules: Can generate a very large number of rules, many ofwhich might be trivial or uninteresting. Requires careful filtering and evaluation using metrics like lift or conviction.
    • Computational Complexity: Finding frequent itemsets can be computationally intensive, especially with high-dimensional data or low support thresholds.
    • Sensitivity to Thresholds: The number and quality of rules can be highly sensitive to the chosen minimum support and confidence thresholds. Setting these appropriately often requires domain knowledge and experimentation.
    • Correlation vs. Causation: Association rules indicate correlation, not necessarily causation. Just because two items are frequently bought together doesn't mean one causes the purchase of the other.
    • Handling Sparse Data: Can struggle with very sparse datasets where individual item occurrences are rare.

Implementation

  • Association Rule Mining with `mlxtend` (Apriori)

    
    import pandas as pd
    from mlxtend.preprocessing import TransactionEncoder
    from mlxtend.frequent_patterns import apriori, association_rules
    
    # Sample transaction data (list of lists)
    # Each inner list represents a transaction and contains items purchased
    dataset = [
        ['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
        ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
        ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
        ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
        ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']
    ]
    
    # Transform the dataset into the required one-hot encoded format
    te = TransactionEncoder()
    te_ary = te.fit(dataset).transform(dataset)
    df = pd.DataFrame(te_ary, columns=te.columns_)
    # print("One-hot encoded DataFrame:")
    # print(df)
    
    # --- Step 1: Find frequent itemsets using Apriori ---
    # We set a minimum support threshold (e.g., 0.5 means itemsets appearing in at least 50% of transactions)
    min_support_threshold = 0.5 # Adjust as needed
    frequent_itemsets = apriori(df, min_support=min_support_threshold, use_colnames=True)
    
    print("\nFrequent Itemsets (min_support=", min_support_threshold, "):")
    print(frequent_itemsets)
    
    # --- Step 2: Generate association rules from frequent itemsets ---
    # We can filter rules based on metrics like confidence or lift
    # For example, generate rules with a minimum confidence of 0.7
    min_confidence_threshold = 0.7 # Adjust as needed
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence_threshold)
    
    print("\nAssociation Rules (min_confidence=", min_confidence_threshold, "):")
    # Displaying selected columns for clarity
    print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift', 'leverage', 'conviction']])
    
    # Example: Filter rules by lift > 1
    # high_lift_rules = rules[rules['lift'] > 1]
    # print("\nHigh Lift Association Rules (lift > 1):")
    # print(high_lift_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
    
    # Example: Find rules leading to 'Kidney Beans'
    # rules_for_beans = rules[rules['consequents'] == {'Kidney Beans'}]
    # print("\nRules leading to Kidney Beans:")
    # print(rules_for_beans)
    

Interview Examples

Explain Support, Confidence, and Lift

Describe the metrics: support, confidence, and lift in the context of association rule mining. Why is lift particularly important?

The Apriori Principle

Explain the Apriori principle and how it helps in efficiently mining frequent itemsets.

FP-Growth vs. Apriori

What are the main differences between the Apriori and FP-Growth algorithms for association rule mining? When might FP-Growth be preferred?

Practice Questions

1. What are the practical applications of Association Rules? Medium

Hint: Consider both academic and industry use cases

2. Explain the core concepts of Association Rules Easy

Hint: Think about the fundamental principles

3. How would you implement this in a production environment? Hard

Hint: Consider scalability and efficiency