Web Scraping Amazon Data to CSV

Overview

Goal: Scrape data from Amazon website and convert it into CSV format.
Context: As a Data Engineer, writing ETL (Extract, Transform, Load) pipelines is common.
Focus: One method of extraction: Web Scraping.

HTML: Hypertext Markup Language.
Document Types:
- Defined at the top of the document (Document Type Definition).
- Tags: Building blocks of HTML; e.g., <h1> for headers, <p> for paragraphs.
Importance: To web scrape, understanding the basic HTML structure is necessary.

Example: Displaying price on Amazon.
- Key components:
  - Tag: <span>
  - Attribute: Contains CSS class for formatting.
  - Content: The actual text displayed (e.g., price).

Install packages before execution:
1. Beautiful Soup: For parsing HTML.
  - Install via: pip install bs4
2. Requests: To send requests to Amazon.
3. Pandas: To convert scraped data into CSV format.

Import Packages: from bs4 import BeautifulSoup import requests import pandas as pd
Send HTTP Request:
- Access Amazon website and search for a product (e.g., gaming items).
- Inspect element to understand HTML structure and find the desired elements.
Extract Product Links:
- Use find_all method on <a> tags to gather product links.
Send Request to Links:
- Create a full link by appending base URL to the partial link.
- Use Requests to get HTML of the product page.
Parse HTML Using Beautiful Soup:
- Convert HTML into a Beautiful Soup object.
Extract Product Data:
- Use the find or find_all methods to get specific information like titles, prices, ratings, etc.
- Use .text.strip() to get clean text without extra whitespace.

Create functions for extracting different pieces of data (title, price, ratings, etc.).
Use a try-except block to handle errors gracefully.
Store results in a dictionary and convert it into a DataFrame using Pandas.