💾

Introduction to Hadoop and Its Components

Jul 8, 2024

Introduction to Hadoop and Its Components

Pre-Digital Era Data

  • Small amounts of data generated slowly
  • Mostly structured (documents, rows, columns)
  • Single storage unit and processor were sufficient

Rise of Big Data

  • Internet led to vast amounts of data generated in multiple formats
    • Emails, images, audio, video (semi-structured and unstructured)
  • Handling became impossible with traditional methods

Need for a New Solution

  • Required multiple storage units and processors
  • Hadoop framework developed for efficient storage and processing

Components of Hadoop

  1. Hadoop Distributed File System (HDFS)

    • Storage unit of Hadoop
    • Data stored in blocks across multiple computers (default block size: 128MB)
    • Replication method ensures fault tolerance (replication factor: 3)
  2. MapReduce

    • Data processing unit of Hadoop
    • Splits data into parts, processes separately, then aggregates results
    • Example: Counting occurrences of words
      • Split input into parts
      • Mapper phase: count each word
      • Shuffle, sort, and group similar words
      • Reducer phase: give count to grouped words
      • Aggregated result as output
    • Improves load balancing and time efficiency
  3. YARN (Yet Another Resource Negotiator)

    • Manages cluster resources: RAM, network bandwidth, CPU
    • Components:
      • Resource Manager: Assigns resources
      • Node Manager: Handles nodes and monitors resource usage
      • Containers: Hold physical resources
      • Application Master: Requests containers for processing jobs

Hadoop Ecosystem

  • Additional tools and frameworks for managing, processing, and analyzing data
  • Components include:
    • Hive
    • Pig
    • Apache Spark
    • Flume
    • Scoop

Applications of Hadoop

  • Data warehousing
  • Recommendation systems
  • Fraud detection

Interactive Question

  • Question: What is the advantage of the 3x replication schema in HDFS?
    • a) Supports parallel processing
    • b) Faster data analysis
    • c) Ensures fault tolerance
    • d) Manages cluster resources
  • Leave your answers in the comment section
  • Chance to win Amazon gift vouchers

Conclusion

  • Hadoop is a game changer for businesses (used by Facebook, IBM, eBay, Amazon)
  • Thumbs up, subscribe, and click the bell icon for more updates