Understanding Data Processing Architectures

Sep 2, 2024

Lecture Notes: Why Not Connect Reporting Solutions Directly to Source Databases?

Introduction

  • Discusses the evolution of data processing approaches and why directly connecting reporting solutions to source databases is often not ideal.

Direct Connection to Source Database

  • Assumptions: Single relational database (e.g., SQL Server).
  • Example: Sales reports.
  • Challenges:
    • OLTP (Online Transactional Processing) vs. OLAP (Online Analytical Processing).
    • Different workloads and performance requirements.

OLTP vs. OLAP

  • OLTP Characteristics:

    • Transactional, many concurrent users (e.g., customer orders).
    • Fast response time required.
    • Data is normalized to handle transactions efficiently.
    • Focus on small data units (individual row operations).
  • OLAP Characteristics:

    • Analytical, large data set analysis (e.g., yearly sales comparison).
    • Tolerant to slightly longer query times.
    • Data is denormalized for simpler and faster querying.
    • Supports self-service BI by providing a simple, intuitive data model.

Challenges with Direct Connection

  • Complex queries and slow performance due to normalized data structure.
  • Risk of blocking OLTP database operations (main income source).
  • Security concerns: Critical applications cannot be compromised.

Alternatives and Solutions

  • Proof of Concepts: Quick reports might use direct connection.
  • Read-Only Replica: Mitigates stress on primary database.
  • SQL Views: Simplify data access, but might not be allowed by DBAs.

Evolution of Data Processing Approaches

Relational Data Warehouse

  • Origin: 1980s, structured data handling.
  • Process:
    • Source to staging area in a relational data warehouse.
    • Transform data using SQL, present reports.
  • Modeling:
    • Dimensional modeling (Ralph Kimball) vs. Inman approach.
  • Challenges: Slow to adapt due to schema requirement, not suited for big data.

Big Data Challenges

  • Characteristics:
    • Volume, Variety, Velocity.
    • Need for new approach due to unstructured data and real-time processing.

Data Lake

  • Concept: Ingest and store unstructured data files.
  • Schema on Read: Define schema when reading data.
  • Hadoop: Distributed processing framework facilitating big data management.
  • Limitations: Transaction support and concurrency issues.

Modern Data Warehouse

  • Hybrid Approach: Combines data lake and relational warehouse benefits.
  • Process:
    • Ingest data into a data lake, process with Databricks, then store in a warehouse.
  • Challenges: Data duplication, security management.

Data Lakehouse

  • New Approach: Combines database and data lake features using Delta format.
  • Benefits: Single data copy, SQL endpoint for user-friendly access.

Data Mesh

  • Concept: Decentralized data ownership across business domains.
  • Principles:
    • Domain Ownership: Each domain handles its data processing.
    • Data as a Product: High-quality, tested, and reliable output.
    • Self-Serve Infrastructure: Domains provision their own resources.
    • Federated Governance: Standards and policies to maintain consistency.

Conclusion

  • Overview of big data history and processing approaches.
  • Highlighted the transitions and reasons behind evolving data architectures.
  • Emphasis on adapting to modern methods for handling complex data needs.