Python Web Scripting Tutorial with Beautiful Soup

May 30, 2024

Python Web Scripting Tutorial with Beautiful Soup

Introduction

  • Host: Guest on Free Code Camp, also creator of Gymshape Coding YouTube channel.
  • Goal: Teach web scraping using the Beautiful Soup library.
  • Applications: Banking websites, job postings (e.g., LinkedIn), Wikipedia, sports sites, etc.

Overview

  • Basics: Scraping a basic HTML page first to understand the concepts.
  • Advanced: Moving on to scraping real websites.
  • Final: Storing scraped information.

Understanding HTML Structure

  • Tags: HTML documents are created with tags.
  • Main Tags: <html>, <head>, <body>, etc.
  • Attributes: Tags can have attributes like class or id.
  • Anatomy of Tags: h1, div, a, <p>, etc.
  • Classes: Reused in elements to style multiple elements similarly.

Basic Web Scraping with Beautiful Soup

Setup and Installation

  • Install Beautiful Soup: pip install beautifulsoup4
  • Install Parser: pip install lxml

Working with HTML Files

  • Read HTML File: Open and read HTML file content in Python using with open.
  • Parse HTML with Beautiful Soup:
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(content, 'lxml')
    

Extracting Data

  • Finding Elements: Use soup.find() or soup.find_all() to locate specific tags.
  • Extracting Text: Extract the text with .text attribute.
  • Pretty Print HTML: Use soup.prettify() to format HTML.

Example: Extracting Course Names

  • Find All <h5> Tags: Retrieve all headers on the page.
  • Iterate and Print Text:
    courses_html_tags = soup.find_all('h5')
    for course in courses_html_tags:
        print(course.text)
    

Advanced Web Scraping - Real Website

Inspecting Elements

  • Using Browser Inspector: Inspect HTML structure to understand element hierarchy.
  • Locate Classes or IDs: Use class or ID attributes to filter elements.

Real World Example: Job Listings from a Website

  • Target Elements: Use the browser inspector to identify tags and classes related to job posts.
  • Python Web Scraper:
    import requests
    from bs4 import BeautifulSoup
    response = requests.get('target_url')
    soup = BeautifulSoup(response.text, 'lxml')
    jobs = soup.find_all('li', class_='job_class')
    for job in jobs:
        # Extract details
    

Filtering Results

  • Skills Filter: Exclude job posts that require certain skills.
  • User Input: Capture user input for filtering.
  • Conditional Filtering:
    if 'unfamiliar_skill' not in job_skills:
        # Process job
    

Storing Scraped Data

  • Write to File: Save data to text files.
  • File Operations: Using with open(file_path, 'w') as file: file.write(data).
  • Organize Data: Write job details into separate text files for each post.

Automating Scraping

  • Schedule Execution:
    import time
    while True:
        find_jobs()
        time.sleep(600)  # Run every 10 minutes
    

Finalizing and Enhancements

  • Clean Code: Use formatted strings, replace methods, and clean HTML content.
  • Future Enhancements: Allow multiple unfamiliar skills and extend functionality.

Conclusion

  • Recap: Web scraping with Beautiful Soup, practical examples, scheduling, and file handling.
  • Future Work: Exploring further enhancements and handling dynamic websites.
  • Call to Action: Check out further resources and subscribe for more tutorials.