🌐

Python Web Scripting with Beautiful Soup

Jul 14, 2024

Python Web Scripting with Beautiful Soup 🌐

Introduction

  • Special Python tutorial on web scripting using Beautiful Soup.
  • Thanks to FreeCodeCamp for the guest opportunity.
  • Mentioned personal YouTube channel "Gymshape Coding" for additional tech content.

Overview

  • Objective: Teach web scripting concepts using Beautiful Soup library in Python.
  • Web scraping examples: bank account, job websites like LinkedIn, Wikipedia, sports websites, etc.
  • Plan: 3 Parts:
    1. Scrape a basic HTML page to understand concepts.
    2. Scrape a real website.
    3. Store the scraped information.

HTML Page Structure Breakdown

  • Basic components: title, paragraphs, button, price.
  • HTML Tags: <html>, <head>, <body>, <div>, <h1>.
  • Special Notes:
    • Head tag includes meta tags and link tag for styling.
    • Body tag includes content displayed on page.
    • Important HTML classes: card, card-header, card-body, etc.

Python with Beautiful Soup

  1. **Installing Beautiful Soup and lxml parser: **
  • pip install beautifulsoup4
    • pip install lxml
  1. **Importing Libraries: **
  • from bs4 import BeautifulSoup
  1. Reading HTML Files in Python:
  • Using with open to read file content.
    • Example: html_file.read()

Basic Web Scraping

  • Creating Beautiful Soup instance:
    • soup = BeautifulSoup(content, 'lxml')
    • Use soup.prettify() to print formatted HTML.
  • Extracting HTML tags:
    • soup.find vs. soup.find_all for one or multiple elements.
    • Examples: extracting <h5> tags for course names, prices, etc.

Scraping Real Websites

  • Requests Library:
    • pip install requests
    • Using requests.get(url).text to fetch webpage content.
  • Identifying HTML Structure:
    • Use browser's inspect tool to find relevant tags and classes.
    • Example: extracting job ads based on class names.
  • Combining Beautiful Soup with Requests:
    • Creating Beautiful Soup instance with fetched content.
    • Extracting desired tags and attributes using filters.
    • Example: soup.find_all('li', class_='job-entry')_

Advanced Filtering

  • Extracting Conditional Content:
    • Extract based on post dates, job titles, skills, etc.
    • Example: filtering jobs posted within a few days.
  • String Manipulation:
    • Removing extra whitespace using replace and strip methods.
    • Extracting and printing job details such as company name, required skills, and posting date.

Automating and Storing Scraped Data

  • **Writing to Files: **
    • Store each job ad information in a separate text file in a directory.
    • Use file I/O operations: with open('<filename>', 'w') as f and f.write(<content>)
  • **Automating Scraping: **
    • Using a loop and time.sleep() to run scripts at regular intervals.
  • **Dynamic User Input: **
    • Allow user to enter unfamiliar skills to filter out irrelevant job posts.
    • Example: input() to take user input and filter results accordingly.

Conclusion

  • Final Program:
    • Runs in intervals, scrapes job postings, filters based on user input, and stores results.
    • Dynamic and useful for tracking changes or updates on websites like job boards.
  • **Potential Challenges: **
    • Accepting multiple unfamiliar skills as input.
    • Adjusting code for websites with frequently changing HTML structure.

Extra Resources

  • Mention of further readings or videos on the YouTube channel "Gymshape Coding".
  • Encouraged to explore more advanced web scraping projects and customizations.