Python Web Scraping Tutorial: Finding the Cheapest In-Stock Products on Newegg

Jul 10, 2024

Python Web Scraping Tutorial: Finding the Cheapest In-Stock Products on Newegg

Introduction

  • Goal: Find the cheapest in-stock graphics card products on Newegg.
  • Tools: Beautiful Soup, Requests, regular expressions.
  • Product: Target graphics cards, but method applies to any product.

Steps

Input Product Name

  • Use input() to ask the user for the product name (e.g., graphics card model).
  • Example: gpu = input("What GPU do you want to search for?")

Build Newegg URL

  • Construct the URL to search on Newegg with filters for in-stock items.
  • Example URL: d=3080&n=4131 for in-stock RTX 3080.

Send Request and Parse HTML

  • Use requests to get the page content and read with Beautiful Soup.
  • Example: page = requests.get(url).text
  • Parse the HTML: doc = BeautifulSoup(page, 'html.parser')

Determining Number of Pages

  • Newegg search results are paginated; determine total number of pages.
  • Find the pagination element and extract the number of pages.
  • Example: pages = int(doc.find(class_='list-tool-pagination-text').strong.string.split()[-1])

Loop Through Pages

  • Loop through all pages from 1 to num_pages to collect all results.
  • Update URL with the page number (e.g., page=x).
  • Example: for page in range(1, pages+1)

Extract Product Information

  • Find and filter items containing the product name.
  • Locate parent divs that contain all needed details (name, price, link).
  • Use Beautiful Soup to navigate the DOM tree.
  • Example: item_container = item.find_parent(class_='item-container')
  • Use find to get specific child elements like price and link.

Store Information in Data Structure

  • Store extracted information (name, price, link) in a dictionary.
  • Example: items_found[name] = {'price': price, 'link': link}

Sorting and Displaying Results

  • Convert prices to integers for sorting and remove commas.
  • Example: price = int(price.replace(',', ''))
  • Sort items by price and print in a readable format.

Code Summary

import requests
from bs4 import BeautifulSoup
import re

# Get user input
product = input("What product do you want to search for?")

# URL construction
url = f'https://www.newegg.com/p/pl?d={product}&N=4131'
page = requests.get(url).text

# Parsing HTML
doc = BeautifulSoup(page, 'html.parser')

# Determine number of pages
pages = int(doc.find(class_='list-tool-pagination-text').strong.string.split()[-1])

# Initialize results dictionary
items_found = {}

# Loop through pages
for page_num in range(1, pages + 1):
    page_url = f'{url}&page={page_num}'
    page_content = requests.get(page_url).text
    doc = BeautifulSoup(page_content, 'html.parser')

    # Find items
    items = doc.find_all(text=re.compile(product))
    for item in items:
        parent = item.find_parent('a', class_='item-title')
        link = parent['href'] if parent else None
        price = parent.find_next('li', class_='price-current').strong.string
        price = int(price.replace(',', ''))

        # Store in dictionary
        items_found[item] = {'price': price, 'link': link}

# Sort and display results
sorted_items = sorted(items_found.items(), key=lambda x: x[1]['price'])
for item in sorted_items:
    print(f'{item[0]}: ${item[1]['price']}, Link: {item[1]['link']}')

Advanced Techniques Used

  • Using regular expressions with BeautifulSoup to find text.
  • Traversing HTML elements to find nested child elements.
  • Try-except blocks to handle unexpected HTML structures.