Web Scraping Amazon - Handling Pagination

Jul 12, 2024

Web Scraping Amazon - Handling Pagination

Introduction

  • Instructor: John
  • Goal: To show a method for handling pagination when web scraping Amazon search pages

Method Overview

  • Check Next Button: Verify if the next button exists on the page
  • Scrape URL: If the next button exists, scrape the URL for the next page
  • Repeat Process: Move to the next page and repeat the process

Steps Involved

1. Identifying Pagination Class

  • Class Identification: Navigation Indicator
    • Identifies pages with class a-last
    • On the last page, class changes to a-disabled a-last

2. Setting Up the Environment

  • Packages Needed: requests-html, beautifulsoup4
  • Session Initiation:
    from requests_html import HTMLSession
    s = HTMLSession()
    

3. Preparing the URL

  • Initial URL: Load the URL and trim unnecessary parts
  • Example:
    url = "https://www.amazon.com/s?keywords=example"
    

4. Creating Functions

Function 1: get_data

  • Purpose: Retrieve and parse HTML from a given URL
  • Code:
    from bs4 import BeautifulSoup
    def get_data(url):
        r = s.get(url)  # Use session to get the URL
        soup = BeautifulSoup(r.text, 'html.parser')  # Parse HTML
        return soup
    
  • Testing:
    print(get_data(url))  # Print entire HTML soup for verification
    

Function 2: get_next_page

  • Purpose: Identify and return the URL for the next page
  • Code:
    def get_next_page(soup):
        page = soup.find('ul', {'class': 'paginationclass'})  # Replace with actual class
        if not page.find('li', {'class': 'a-disabled a-last'}):
            next_link = page.find('li', {'class': 'a-last'}).find('a')['href']
            next_url = 'https://www.amazon.com' + next_link
            return next_url
        else:
            return None
    
  • Testing:
    soup = get_data(url)
    print(get_next_page(soup))  # Print next page URL
    

Loop for Pagination

  • Combining Functions:
    while True:
        soup = get_data(url)
        url = get_next_page(soup)
        if not url:
            break
        print(url)  # Print each next page URL
    

Final Notes

  • Utility: Functions can be saved and reused for future Amazon scraping tasks
  • Loop Mechanism: Pagination handles consecutively until no next page button is found
  • Announcements:
    • Black Friday special video with more advanced scraping techniques
    • Encouragement to subscribe, like, and comment for more content

Conclusion

  • This method is efficient for handling dynamic pagination while web scraping Amazon.
  • Reusability and simplicity make it suitable for large-scale data extraction projects.