Lecture Notes: Why Not Connect Reporting Solutions Directly to Source Databases?

Introduction

Discusses the evolution of data processing approaches and why directly connecting reporting solutions to source databases is often not ideal.

Assumptions: Single relational database (e.g., SQL Server).
Example: Sales reports.
Challenges:
- OLTP (Online Transactional Processing) vs. OLAP (Online Analytical Processing).
- Different workloads and performance requirements.

OLTP Characteristics:
- Transactional, many concurrent users (e.g., customer orders).
- Fast response time required.
- Data is normalized to handle transactions efficiently.
- Focus on small data units (individual row operations).
OLAP Characteristics:
- Analytical, large data set analysis (e.g., yearly sales comparison).
- Tolerant to slightly longer query times.
- Data is denormalized for simpler and faster querying.
- Supports self-service BI by providing a simple, intuitive data model.

Origin: 1980s, structured data handling.
Process:
- Source to staging area in a relational data warehouse.
- Transform data using SQL, present reports.
Modeling:
- Dimensional modeling (Ralph Kimball) vs. Inman approach.
Challenges: Slow to adapt due to schema requirement, not suited for big data.

Characteristics:
- Volume, Variety, Velocity.
- Need for new approach due to unstructured data and real-time processing.

Hybrid Approach: Combines data lake and relational warehouse benefits.
Process:
- Ingest data into a data lake, process with Databricks, then store in a warehouse.
Challenges: Data duplication, security management.

Concept: Decentralized data ownership across business domains.
Principles:
- Domain Ownership: Each domain handles its data processing.
- Data as a Product: High-quality, tested, and reliable output.
- Self-Serve Infrastructure: Domains provision their own resources.
- Federated Governance: Standards and policies to maintain consistency.