Web Scraping with Playwright and AgentQL
Introduction
- Demonstration of how to scrape products from multiple online stores using a single Python code.
- Tools Used: Playwright and AgentQL (AI-powered web scraping tool).
- AgentQL simplifies the process, eliminating the need to find CSS selectors or XPath identifiers.
Advantages of AgentQL
- Uniform scraping code for different websites despite layout differences.
- Resilience to website changes.
- Capability to format data using natural language (e.g., date formats or currency symbols).
- Quick setup using their guide or video demonstration.
Getting Started
- AgentQL can be accessed via agentql.com.
- Steps to begin:
- Obtain API key from the website.
- Install using
pip install agentql.
- Initialize with
agentql init using the API key.
Coding Demonstration
Initial Setup
- VS Code Setup:
- Create a new file
main.py.
- Start with basic Playwright boilerplate code.
- Basic Playwright Code:
- Import necessary modules and initiate Playwright.
- Integrating AgentQL:
- Import
agentql and wrap the newPage call.
Basic Scraping Example
- Demonstration with a t-shirt website.
- Use
page.goto with the URL.
- Leverage
page.queryData with AgentQL to extract product names and prices without specific selectors.
Extending Functionality
- Add more fields like
creator to extract additional data.
- Use AI for queries instead of selectors.
Handling Data
- Extracted data is stored in a dictionary and iterated for display.
- Example of transforming price data into numerical format.
Advanced Scraping Techniques
Handling Pagination
- Structure code to scrape multiple pages.
- Use
page.queryElements to navigate through pagination links.
- Example of loop setup with page limit.
Handling Website Pop-ups
- Use AgentQL to detect and close pop-ups.
- Implement wait times for pop-up animations.
Cross-website Scraping
- Demonstrated scraping on different sites with the same code.
- Adaptive approach by adding contextual page elements to queries.
Creating Reusable Function
- Define a
scrape_products function.
- Parameters: URL, page limit, optional headless mode.
- Returns a list of product dictionaries.
Conclusion
- AgentQL provides a flexible, AI-powered tool for web scraping.
- Allows streamlined data extraction across various websites with minimal setup.
- Offers a free tier for initial usage.
Try It Yourself
- Visit agentql.com to get started.
- Leave suggestions for agentql scraping tasks in comments of the video.
Thank you for watching, and see you in the next video!