Python Web Scripting Tutorial with Beautiful Soup

Introduction

Host: Guest on Free Code Camp, also creator of Gymshape Coding YouTube channel.
Goal: Teach web scraping using the Beautiful Soup library.
Applications: Banking websites, job postings (e.g., LinkedIn), Wikipedia, sports sites, etc.

Overview

Basics: Scraping a basic HTML page first to understand the concepts.
Advanced: Moving on to scraping real websites.
Final: Storing scraped information.

Understanding HTML Structure

Tags: HTML documents are created with tags.
Main Tags: <html>, <head>, <body>, etc.
Attributes: Tags can have attributes like class or id.
Anatomy of Tags: h1, div, a, <p>, etc.
Classes: Reused in elements to style multiple elements similarly.

Basic Web Scraping with Beautiful Soup

Setup and Installation

Install Beautiful Soup: pip install beautifulsoup4
Install Parser: pip install lxml

Working with HTML Files

Read HTML File: Open and read HTML file content in Python using with open.

Parse HTML with Beautiful Soup:

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'lxml')

Extracting Data

Finding Elements: Use soup.find() or soup.find_all() to locate specific tags.
Extracting Text: Extract the text with .text attribute.
Pretty Print HTML: Use soup.prettify() to format HTML.

Example: Extracting Course Names

Find All <h5> Tags: Retrieve all headers on the page.

Iterate and Print Text:

courses_html_tags = soup.find_all('h5')
for course in courses_html_tags:
    print(course.text)

Advanced Web Scraping - Real Website

Inspecting Elements

Using Browser Inspector: Inspect HTML structure to understand element hierarchy.
Locate Classes or IDs: Use class or ID attributes to filter elements.

Real World Example: Job Listings from a Website

Target Elements: Use the browser inspector to identify tags and classes related to job posts.

Python Web Scraper:

import requests
from bs4 import BeautifulSoup
response = requests.get('target_url')
soup = BeautifulSoup(response.text, 'lxml')
jobs = soup.find_all('li', class_='job_class')
for job in jobs:
    # Extract details

Filtering Results

Skills Filter: Exclude job posts that require certain skills.
User Input: Capture user input for filtering.

Conditional Filtering:

if 'unfamiliar_skill' not in job_skills:
    # Process job

Storing Scraped Data

Write to File: Save data to text files.
File Operations: Using with open(file_path, 'w') as file: file.write(data).
Organize Data: Write job details into separate text files for each post.

Automating Scraping

Schedule Execution:

import time
while True:
    find_jobs()
    time.sleep(600)  # Run every 10 minutes

Finalizing and Enhancements

Clean Code: Use formatted strings, replace methods, and clean HTML content.
Future Enhancements: Allow multiple unfamiliar skills and extend functionality.

Conclusion

Recap: Web scraping with Beautiful Soup, practical examples, scheduling, and file handling.
Future Work: Exploring further enhancements and handling dynamic websites.
Call to Action: Check out further resources and subscribe for more tutorials.