🌐

Understanding Distributed Systems and MapReduce

Oct 19, 2024

Distributed Systems Lecture Notes

Introduction to Distributed Systems

Definition: A distributed system consists of a set of cooperating computers that communicate over a network to accomplish tasks.
Examples:
- Storage solutions for large websites
- Big data computations (e.g., MapReduce)
- Peer-to-peer file sharing

Importance of Distributed Systems

Critical infrastructure relies on distributed systems due to their ability to handle tasks across multiple computers.
Designing systems: Always consider if a problem can be solved on a single computer first; distributed systems introduce complexity.

Reasons to Use Distributed Systems

High Performance: Achieved through parallelism (multiple CPUs, memory, and disk operations).
Fault Tolerance: Redundancy allows systems to continue functioning even when one part fails.
Natural Distribution: Some tasks require geographical distribution (e.g., interbank transfers).
Security: Isolating computation can mitigate risks from untrusted code.

Challenges in Distributed Systems

Concurrent Programming: Complexity arises from multiple parts executing simultaneously.
Unexpected Failure Patterns: Failure can be partial; some components may fail while others continue functioning.
Achieving Performance Goals: Designing systems to effectively utilize multiple computers can be complex.

Course Structure

Components:
- Lectures
- Paper readings (one per week)
- Two exams
- Labs focused on building distributed systems
- Optional final project instead of lab 4.
Assessment:
- Labs are the most significant component of the grade.

Topics Covered in Course

Storage Systems: Focus on well-defined abstractions and building replicated, fault-tolerant implementations.
Computation Systems: Discuss systems like MapReduce.
Communication: Considered a tool for building distributed systems.

Core Concepts

Scalability: The ability of a system to handle increased load by adding resources.
- Example: Doubling the resources should ideally double the performance.
Fault Tolerance: Systems must be designed to handle failures gracefully.
- Availability: Continuity of service despite failures.
- Recoverability: Systems can return to operational status after failures.

Consistency in Distributed Systems

Key Operations: Put (store) and Get (retrieve) operations must have defined semantics.
Types:
- Strong Consistency: Get sees the most recent Put.
- Weak Consistency: Old values may be retrieved due to replication delays.

MapReduce Overview

Purpose: A framework to simplify running computations on large datasets across many machines.
Operation Steps:
1. Map Phase: Process input data in parallel, producing key-value pairs.
2. Shuffle Phase: Group and transfer the intermediate data to reduce tasks.
3. Reduce Phase: Aggregate the data to produce final output.
Example Use Case: Counting occurrences of words in large documents.

Implementation Details of MapReduce

Map Function: Iterates over input to produce intermediate key-value pairs.
Reduce Function: Aggregates values for each unique key produced by the map phase.
Data Management:
- Input data is stored in a distributed file system (e.g., GFS).
- Output is also stored in the file system post-reduction.

Conclusion

Next Steps: Labs will implement a simplified version of MapReduce.
Discussion of Future Topics: Consider evolution and new frameworks beyond MapReduce.

Full transcript