Web Scraping Amazon - Handling Pagination

Introduction

Instructor: John
Goal: To show a method for handling pagination when web scraping Amazon search pages

Method Overview

Check Next Button: Verify if the next button exists on the page
Scrape URL: If the next button exists, scrape the URL for the next page
Repeat Process: Move to the next page and repeat the process

Steps Involved

1. Identifying Pagination Class

Class Identification: Navigation Indicator
- Identifies pages with class a-last
- On the last page, class changes to a-disabled a-last

2. Setting Up the Environment

Packages Needed: requests-html, beautifulsoup4

Session Initiation:

from requests_html import HTMLSession
s = HTMLSession()

3. Preparing the URL

Initial URL: Load the URL and trim unnecessary parts

Example:

url = "https://www.amazon.com/s?keywords=example"

4. Creating Functions

Function 1: `get_data`

Purpose: Retrieve and parse HTML from a given URL

Code:

from bs4 import BeautifulSoup
def get_data(url):
    r = s.get(url)  # Use session to get the URL
    soup = BeautifulSoup(r.text, 'html.parser')  # Parse HTML
    return soup

Testing:

print(get_data(url))  # Print entire HTML soup for verification

Function 2: `get_next_page`

Purpose: Identify and return the URL for the next page

Code:

def get_next_page(soup):
    page = soup.find('ul', {'class': 'paginationclass'})  # Replace with actual class
    if not page.find('li', {'class': 'a-disabled a-last'}):
        next_link = page.find('li', {'class': 'a-last'}).find('a')['href']
        next_url = 'https://www.amazon.com' + next_link
        return next_url
    else:
        return None

Testing:

soup = get_data(url)
print(get_next_page(soup))  # Print next page URL

Loop for Pagination

Combining Functions:

while True:
    soup = get_data(url)
    url = get_next_page(soup)
    if not url:
        break
    print(url)  # Print each next page URL

Final Notes

Utility: Functions can be saved and reused for future Amazon scraping tasks
Loop Mechanism: Pagination handles consecutively until no next page button is found
Announcements:
- Black Friday special video with more advanced scraping techniques
- Encouragement to subscribe, like, and comment for more content

Conclusion

This method is efficient for handling dynamic pagination while web scraping Amazon.
Reusability and simplicity make it suitable for large-scale data extraction projects.

Web Scraping Amazon - Handling Pagination

Web Scraping Amazon - Handling Pagination

Introduction

Method Overview

Steps Involved

1. Identifying Pagination Class

2. Setting Up the Environment

3. Preparing the URL

4. Creating Functions

Function 1: get_data

Function 2: get_next_page

Loop for Pagination

Final Notes

Conclusion

Function 1: `get_data`

Function 2: `get_next_page`