🌐

Understanding Distributed Systems and MapReduce

Oct 19, 2024

Distributed Systems Lecture Notes

Introduction to Distributed Systems

  • Definition: A distributed system consists of a set of cooperating computers that communicate over a network to accomplish tasks.
  • Examples:
    • Storage solutions for large websites
    • Big data computations (e.g., MapReduce)
    • Peer-to-peer file sharing

Importance of Distributed Systems

  • Critical infrastructure relies on distributed systems due to their ability to handle tasks across multiple computers.
  • Designing systems: Always consider if a problem can be solved on a single computer first; distributed systems introduce complexity.

Reasons to Use Distributed Systems

  1. High Performance: Achieved through parallelism (multiple CPUs, memory, and disk operations).
  2. Fault Tolerance: Redundancy allows systems to continue functioning even when one part fails.
  3. Natural Distribution: Some tasks require geographical distribution (e.g., interbank transfers).
  4. Security: Isolating computation can mitigate risks from untrusted code.

Challenges in Distributed Systems

  • Concurrent Programming: Complexity arises from multiple parts executing simultaneously.
  • Unexpected Failure Patterns: Failure can be partial; some components may fail while others continue functioning.
  • Achieving Performance Goals: Designing systems to effectively utilize multiple computers can be complex.

Course Structure

  • Components:
    • Lectures
    • Paper readings (one per week)
    • Two exams
    • Labs focused on building distributed systems
    • Optional final project instead of lab 4.
  • Assessment:
    • Labs are the most significant component of the grade.

Topics Covered in Course

  • Storage Systems: Focus on well-defined abstractions and building replicated, fault-tolerant implementations.
  • Computation Systems: Discuss systems like MapReduce.
  • Communication: Considered a tool for building distributed systems.

Core Concepts

  • Scalability: The ability of a system to handle increased load by adding resources.
    • Example: Doubling the resources should ideally double the performance.
  • Fault Tolerance: Systems must be designed to handle failures gracefully.
    • Availability: Continuity of service despite failures.
    • Recoverability: Systems can return to operational status after failures.

Consistency in Distributed Systems

  • Key Operations: Put (store) and Get (retrieve) operations must have defined semantics.
  • Types:
    • Strong Consistency: Get sees the most recent Put.
    • Weak Consistency: Old values may be retrieved due to replication delays.

MapReduce Overview

  • Purpose: A framework to simplify running computations on large datasets across many machines.
  • Operation Steps:
    1. Map Phase: Process input data in parallel, producing key-value pairs.
    2. Shuffle Phase: Group and transfer the intermediate data to reduce tasks.
    3. Reduce Phase: Aggregate the data to produce final output.
  • Example Use Case: Counting occurrences of words in large documents.

Implementation Details of MapReduce

  • Map Function: Iterates over input to produce intermediate key-value pairs.
  • Reduce Function: Aggregates values for each unique key produced by the map phase.
  • Data Management:
    • Input data is stored in a distributed file system (e.g., GFS).
    • Output is also stored in the file system post-reduction.

Conclusion

  • Next Steps: Labs will implement a simplified version of MapReduce.
  • Discussion of Future Topics: Consider evolution and new frameworks beyond MapReduce.