🔍

Efficient Web Scraping with AgentQL

Nov 6, 2024

Web Scraping with Playwright and AgentQL

Introduction

  • Demonstration of how to scrape products from multiple online stores using a single Python code.
  • Tools Used: Playwright and AgentQL (AI-powered web scraping tool).
  • AgentQL simplifies the process, eliminating the need to find CSS selectors or XPath identifiers.

Advantages of AgentQL

  • Uniform scraping code for different websites despite layout differences.
  • Resilience to website changes.
  • Capability to format data using natural language (e.g., date formats or currency symbols).
  • Quick setup using their guide or video demonstration.

Getting Started

  • AgentQL can be accessed via agentql.com.
  • Steps to begin:
    • Obtain API key from the website.
    • Install using pip install agentql.
    • Initialize with agentql init using the API key.

Coding Demonstration

Initial Setup

  1. VS Code Setup:
    • Create a new file main.py.
    • Start with basic Playwright boilerplate code.
  2. Basic Playwright Code:
    • Import necessary modules and initiate Playwright.
  3. Integrating AgentQL:
    • Import agentql and wrap the newPage call.

Basic Scraping Example

  • Demonstration with a t-shirt website.
  • Use page.goto with the URL.
  • Leverage page.queryData with AgentQL to extract product names and prices without specific selectors.

Extending Functionality

  • Add more fields like creator to extract additional data.
  • Use AI for queries instead of selectors.

Handling Data

  • Extracted data is stored in a dictionary and iterated for display.
  • Example of transforming price data into numerical format.

Advanced Scraping Techniques

Handling Pagination

  • Structure code to scrape multiple pages.
  • Use page.queryElements to navigate through pagination links.
  • Example of loop setup with page limit.

Handling Website Pop-ups

  • Use AgentQL to detect and close pop-ups.
  • Implement wait times for pop-up animations.

Cross-website Scraping

  • Demonstrated scraping on different sites with the same code.
  • Adaptive approach by adding contextual page elements to queries.

Creating Reusable Function

  • Define a scrape_products function.
  • Parameters: URL, page limit, optional headless mode.
  • Returns a list of product dictionaries.

Conclusion

  • AgentQL provides a flexible, AI-powered tool for web scraping.
  • Allows streamlined data extraction across various websites with minimal setup.
  • Offers a free tier for initial usage.

Try It Yourself

  • Visit agentql.com to get started.
  • Leave suggestions for agentql scraping tasks in comments of the video.

Thank you for watching, and see you in the next video!