Python Web Scraping Tutorial: Finding the Cheapest In-Stock Products on Newegg

Introduction

Goal: Find the cheapest in-stock graphics card products on Newegg.
Tools: Beautiful Soup, Requests, regular expressions.
Product: Target graphics cards, but method applies to any product.

Steps

Input Product Name

Use input() to ask the user for the product name (e.g., graphics card model).
Example: gpu = input("What GPU do you want to search for?")

Build Newegg URL

Construct the URL to search on Newegg with filters for in-stock items.
Example URL: d=3080&n=4131 for in-stock RTX 3080.

Send Request and Parse HTML

Use requests to get the page content and read with Beautiful Soup.
Example: page = requests.get(url).text
Parse the HTML: doc = BeautifulSoup(page, 'html.parser')

Determining Number of Pages

Newegg search results are paginated; determine total number of pages.
Find the pagination element and extract the number of pages.
Example: pages = int(doc.find(class_='list-tool-pagination-text').strong.string.split()[-1])

Loop Through Pages

Loop through all pages from 1 to num_pages to collect all results.
Update URL with the page number (e.g., page=x).
Example: for page in range(1, pages+1)

Extract Product Information

Find and filter items containing the product name.
Locate parent divs that contain all needed details (name, price, link).
Use Beautiful Soup to navigate the DOM tree.
Example: item_container = item.find_parent(class_='item-container')
Use find to get specific child elements like price and link.

Store Information in Data Structure

Store extracted information (name, price, link) in a dictionary.
Example: items_found[name] = {'price': price, 'link': link}

Sorting and Displaying Results

Convert prices to integers for sorting and remove commas.
Example: price = int(price.replace(',', ''))
Sort items by price and print in a readable format.

Code Summary

import requests
from bs4 import BeautifulSoup
import re

# Get user input
product = input("What product do you want to search for?")

# URL construction
url = f'https://www.newegg.com/p/pl?d={product}&N=4131'
page = requests.get(url).text

# Parsing HTML
doc = BeautifulSoup(page, 'html.parser')

# Determine number of pages
pages = int(doc.find(class_='list-tool-pagination-text').strong.string.split()[-1])

# Initialize results dictionary
items_found = {}

# Loop through pages
for page_num in range(1, pages + 1):
    page_url = f'{url}&page={page_num}'
    page_content = requests.get(page_url).text
    doc = BeautifulSoup(page_content, 'html.parser')

    # Find items
    items = doc.find_all(text=re.compile(product))
    for item in items:
        parent = item.find_parent('a', class_='item-title')
        link = parent['href'] if parent else None
        price = parent.find_next('li', class_='price-current').strong.string
        price = int(price.replace(',', ''))

        # Store in dictionary
        items_found[item] = {'price': price, 'link': link}

# Sort and display results
sorted_items = sorted(items_found.items(), key=lambda x: x[1]['price'])
for item in sorted_items:
    print(f'{item[0]}: ${item[1]['price']}, Link: {item[1]['link']}')

Advanced Techniques Used

Using regular expressions with BeautifulSoup to find text.
Traversing HTML elements to find nested child elements.
Try-except blocks to handle unexpected HTML structures.