Coconote
AI notes
AI voice & video notes
Export note
Try for free
Understanding Data Processing Architectures
Sep 2, 2024
🃏
Review flashcards
Lecture Notes: Why Not Connect Reporting Solutions Directly to Source Databases?
Introduction
Discusses the evolution of data processing approaches and why directly connecting reporting solutions to source databases is often not ideal.
Direct Connection to Source Database
Assumptions
: Single relational database (e.g., SQL Server).
Example
: Sales reports.
Challenges
:
OLTP (Online Transactional Processing) vs. OLAP (Online Analytical Processing).
Different workloads and performance requirements.
OLTP vs. OLAP
OLTP Characteristics
:
Transactional, many concurrent users (e.g., customer orders).
Fast response time required.
Data is normalized to handle transactions efficiently.
Focus on small data units (individual row operations).
OLAP Characteristics
:
Analytical, large data set analysis (e.g., yearly sales comparison).
Tolerant to slightly longer query times.
Data is denormalized for simpler and faster querying.
Supports self-service BI by providing a simple, intuitive data model.
Challenges with Direct Connection
Complex queries and slow performance due to normalized data structure.
Risk of blocking OLTP database operations (main income source).
Security concerns: Critical applications cannot be compromised.
Alternatives and Solutions
Proof of Concepts
: Quick reports might use direct connection.
Read-Only Replica
: Mitigates stress on primary database.
SQL Views
: Simplify data access, but might not be allowed by DBAs.
Evolution of Data Processing Approaches
Relational Data Warehouse
Origin
: 1980s, structured data handling.
Process
:
Source to staging area in a relational data warehouse.
Transform data using SQL, present reports.
Modeling
:
Dimensional modeling (Ralph Kimball) vs. Inman approach.
Challenges
: Slow to adapt due to schema requirement, not suited for big data.
Big Data Challenges
Characteristics
:
Volume, Variety, Velocity.
Need for new approach due to unstructured data and real-time processing.
Data Lake
Concept
: Ingest and store unstructured data files.
Schema on Read
: Define schema when reading data.
Hadoop
: Distributed processing framework facilitating big data management.
Limitations
: Transaction support and concurrency issues.
Modern Data Warehouse
Hybrid Approach
: Combines data lake and relational warehouse benefits.
Process
:
Ingest data into a data lake, process with Databricks, then store in a warehouse.
Challenges
: Data duplication, security management.
Data Lakehouse
New Approach
: Combines database and data lake features using Delta format.
Benefits
: Single data copy, SQL endpoint for user-friendly access.
Data Mesh
Concept
: Decentralized data ownership across business domains.
Principles
:
Domain Ownership: Each domain handles its data processing.
Data as a Product: High-quality, tested, and reliable output.
Self-Serve Infrastructure: Domains provision their own resources.
Federated Governance: Standards and policies to maintain consistency.
Conclusion
Overview of big data history and processing approaches.
Highlighted the transitions and reasons behind evolving data architectures.
Emphasis on adapting to modern methods for handling complex data needs.
📄
Full transcript