Lecture on Sentiment Analysis with BERT and Transformers
Introduction
- Topic: Understanding Sentiment Analysis using BERT and the Transformers library.
- Goal: Build a sentiment analysis model and apply it to Yelp reviews.
Steps Overview
- Install and import dependencies.
- Instatiate and download pre-trained models (BERT).
- Perform sentiment scoring on sample text.
- Scrape and analyze Yelp reviews.
- Store the results in a pandas DataFrame.
Installing Dependencies
- Transformers: For NLP models and sentiment analysis.
- PyTorch: Backend that supports Transformers library.
- Requests: For making HTTP requests to scrape data.
- BeautifulSoup: For parsing HTML and extracting data.
- Pandas: For data manipulation and storage.
- Numpy: Additional data transformation utilities.
Installation Commands
!pip install transformers
!pip install torch
!pip install requests
!pip install beautifulsoup4
!pip install pandas
!pip install numpy
Loading the BERT Model
- Tokenizer: Converts text into sequence of numbers.
- Model: AutoModelForSequenceClassification for sequence classification tasks.
- Tokenizer & Model Initialization:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch # also for argmax
import requests
from bs4 import BeautifulSoup
import re
# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('nlp-town/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlp-town/bert-base-multilingual-uncased-sentiment')
Sentiment Scoring Process
Tokenize Text
text = "I hated this, absolutely the worst!"
tokens = tokenizer.encode(text, return_tensors='pt')
Predict Sentiment
result = model(tokens)
sentiment = torch.argmax(result.logits)
sentiment_score = sentiment.item() + 1 # Scores 1-5
Web Scraping Yelp Reviews
Scraper Function
- Request Yelp Page
url = 'https://www.yelp.com/biz/mexico-sydney-2'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
- Find Reviews in HTML
regex = re.compile('.*comment.*')
results = soup.find_all('p', {'class': regex})
- Extract Text
reviews = [result.text for result in results]
Aggregating Reviews in a DataFrame
Create DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array(reviews), columns=['review'])
Apply Sentiment Analysis to each review
def sentiment_score(text):
tokens = tokenizer.encode(text, return_tensors='pt')
result = model(tokens)
sentiment = torch.argmax(result.logits)
return sentiment.item() + 1
df['sentiment'] = df['review'].apply(lambda x: sentiment_score(x[:512]))
Testing with another Yelp page
Steps
- Change URL in the scraper function.
- Re-run the code block to scrape and analyze the new reviews.
Summary
- Installed necessary libraries.
- Used BERT model and tokenizer from Transformers for sentiment analysis.
- Scraped reviews from Yelp and analyzed sentiment.
- Aggregated results in pandas DataFrame.
- Capable of extending to other data sources or languages.
End Note: This approach is useful for businesses looking to gauge customer sentiment from reviews.
Fun Fact: Some models can analyze text in multiple languages, making them versatile for international applications.
Tools Mentioned
- PyTorch
- Transformers Library
- Beautiful Soup
- Pandas
- Numpy
- Mito (for Excel to Python transformations)