🛒

Scraping Amazon Data to CSV

Feb 5, 2025

Web Scraping Amazon Data to CSV

Overview

  • Goal: Scrape data from Amazon website and convert it into CSV format.
  • Context: As a Data Engineer, writing ETL (Extract, Transform, Load) pipelines is common.
  • Focus: One method of extraction: Web Scraping.

Introduction

  • Host: Tarshil, freelance data engineer.
  • Channel focus: Data engineering, freelancing, productivity.
  • Reminder to subscribe and like if content is helpful.

Understanding HTML Structure

  • HTML: Hypertext Markup Language.
  • Document Types:
    • Defined at the top of the document (Document Type Definition).
    • Tags: Building blocks of HTML; e.g., <h1> for headers, <p> for paragraphs.
  • Importance: To web scrape, understanding the basic HTML structure is necessary.

Amazon HTML Structure

  • Example: Displaying price on Amazon.
    • Key components:
      • Tag: <span>
      • Attribute: Contains CSS class for formatting.
      • Content: The actual text displayed (e.g., price).

Prerequisites

  1. Stable Internet Connection
  2. Laptop
  3. Python: Version 3.5 or higher.
    • Installation tutorial available.
  4. Jupyter Notebook:
    • Installation tutorial available.

Required Packages

  • Install packages before execution:
    1. Beautiful Soup: For parsing HTML.
      • Install via: pip install bs4
    2. Requests: To send requests to Amazon.
    3. Pandas: To convert scraped data into CSV format.

Steps for Web Scraping

  1. Import Packages: from bs4 import BeautifulSoup import requests import pandas as pd
  2. Send HTTP Request:
    • Access Amazon website and search for a product (e.g., gaming items).
    • Inspect element to understand HTML structure and find the desired elements.
  3. Extract Product Links:
    • Use find_all method on <a> tags to gather product links.
  4. Send Request to Links:
    • Create a full link by appending base URL to the partial link.
    • Use Requests to get HTML of the product page.
  5. Parse HTML Using Beautiful Soup:
    • Convert HTML into a Beautiful Soup object.
  6. Extract Product Data:
    • Use the find or find_all methods to get specific information like titles, prices, ratings, etc.
    • Use .text.strip() to get clean text without extra whitespace.

Code Structure

  • Create functions for extracting different pieces of data (title, price, ratings, etc.).
  • Use a try-except block to handle errors gracefully.
  • Store results in a dictionary and convert it into a DataFrame using Pandas.

Final Steps

  • Save the DataFrame to a CSV file.
  • Ensure correct importing of all necessary libraries (e.g., NumPy if needed).

Conclusion

  • Summary of the entire process from understanding HTML to saving the data.
  • Encouragement to revisit parts of the tutorial if needed.