📊

Data Warehousing and Mining Principles

May 19, 2025

Data Warehousing and Data Mining Lecture Notes

Course Overview

  • Objective: Learn principles of data warehousing and data mining concepts.
  • Focus Areas:
    • Data warehouse design
    • Data mining concepts
    • Association rules
    • Classification algorithms
    • Clustering techniques

Unit I: Data Warehousing

Introduction to Data Warehouse

  • Definition: Subject-oriented, integrated, time-variant, non-volatile data collection.
  • Key Characteristics:
    • Subject-Oriented: Analyzes specific subject areas.
    • Integrated: Combines data from various sources.
    • Time-Variant: Stores historical data.
    • Non-Volatile: Data does not change once entered.

Data Warehouse Design Process

  • Approaches:
    • Top-Down: Overall design and planning.
    • Bottom-Up: Prototypes and experiments.
    • Combined: Integrates both strategies.
  • Steps:
    • Choose business process.
    • Select grain, dimensions, and measures.

Differences with Operational Database Systems

  • Operational databases focus on current operations, while data warehouses focus on historical analysis.

Data Warehouse Architecture

  • Three Tiers:
    • Tier 1: Warehouse database server.
    • Tier 2: OLAP server (ROLAP/MOLAP).
    • Tier 3: Client layer with tools for analysis.

Schema Design

  • Types:
    • Star Schema: Central fact table with dimension tables.
    • Snowflake Schema: Normalized dimension tables.
    • Fact Constellation: Multiple fact tables.

Unit II: Fundamentals of Data Mining

Introduction to Data Mining

  • Goal: Extract useful information and transform it into understandable structures.
  • Key Properties: Pattern discovery, prediction, actionable information.

Data Mining Functionalities

  • Tasks:
    • Descriptive: General properties of data.
    • Predictive: Inference to make predictions.

Data Preprocessing

  • Importance: Ensures data quality and reliability.
  • Techniques:
    • Data Cleaning
    • Data Integration
    • Data Transformation
    • Data Reduction

Unit III: Association Rules

Association Rule Mining

  • Purpose: Discover interesting relations between variables in large databases.
  • Applications: Market Basket Analysis.

Apriori Algorithm

  • Process: Iterative approach using candidate generation.

FP-Growth Algorithm

  • Approach: Frequent pattern growth without candidate generation.

Unit IV: Classification

General Approaches

  • Data Preparation: Cleaning, relevance analysis, transformation.

Decision Trees

  • Algorithm: Uses splitting criteria to classify data.

Bayesian Classification

  • Types:
    • Naive Bayesian Classifier
    • Bayesian Belief Networks

K-Nearest Neighbor (kNN)

  • Characteristics: Uses distance metrics, assigns class based on nearest neighbors.

Unit V: Clustering

Overview

  • Definition: Grouping similar data objects into clusters.
  • Applications: Market research, pattern recognition, outlier detection.

Clustering Methods

  • Types:
    • Partitioning Methods
    • Hierarchical Methods
    • Density-Based Methods
    • Grid-Based Methods

Outlier Analysis

  • Purpose: Detect data objects that do not comply with general data behavior.

Textbooks and References

  • Textbooks:
    • "Data Mining - Concepts and Techniques" by Jiawei Han and Micheline Kamber.
    • "Introduction to Data Mining" by Pang-Ning Tan, Vipin Kumar, and Michael Steinbach.
  • Reference Books:
    • "Data Mining Techniques" by Arun K Pujari.
    • "Data Warehousing Fundamentals" by Paulraj Ponnaiah.
    • "The Data Warehouse Lifecycle Toolkit" by Ralph Kimball.