Natural Language Processing Project - Sentiment Analysis on Amazon Reviews

Introduction

Instructor: Rob
Content Overview: Sentiment analysis using VADER and RoBERTa models on Amazon reviews.
Additional resources: Everything shared in a Kaggle notebook (link in description).

Python Libraries:
- pandas: For data manipulation.
- numpy: For numerical operations.
- matplotlib, seaborn: For plotting and visualization.
- nltk: Natural language toolkit for text processing.
- transformers (Hugging Face): For using pre-trained deep learning models.
- tqdm: For progress bar visualization.

Dataset: Amazon fine food reviews.
Content: Reviews and ratings (1-5 stars) in CSV format.
Data Size: Approximately half a million reviews, downsampled to 500 for this project.

Data Reading:
- Example: pd.read_csv('reviews.csv')
- Preview data using df.head().
Exploratory Data Analysis (EDA):
- Score counts visualization using a bar plot.
- Findings: Most reviews are 5 stars, followed by 1 star.

Tokenization:
- Example: nltk.word_tokenize(text), splits text into tokens.
Part of Speech Tagging:
- Example: nltk.pos_tag(tokens), assigns POS tags to tokens.
Chunking:
- Example: nltk.ne_chunk(tagged), groups tokens into chunks.

VADER: Valence Aware Dictionary and Sentiment Reasoner.
- Features: Analyzes text based on a lexicon of positive, negative, and neutral words.
- Process: Removes stop words, computes sentiment scores.
- Implementation:
  - Create analyzer: sia = SentimentIntensityAnalyzer().
  - Example: sia.polarity_scores(text).
Analysis:
- Running sentiment analysis on each review.
- Store results in a pandas DataFrame.
- Findings: Comparison of sentiment scores with star ratings. Positive correlation between high star ratings and positive sentiment.

RoBERTa Model: A transformer-based model for sentiment analysis.
- From Hugging Face’s library.
- Pretrained on large datasets (e.g., Twitter data).
Steps:
- Import tokenizer and model: AutoTokenizer, AutoModelForSequenceClassification.
- Tokenize text and run the model to get sentiment scores.
- Normalize output using softmax.
- Example function for sentiment scoring: polarity_scores_roberta(text).
Analysis:
- Comparison with VADER results.
- Findings: RoBERTa shows more distinct and confident sentiment separation.

Pair Plot: Visualize the comparison between VADER and RoBERTa sentiment scores.
Key Observations:
- Positive correlation with star ratings for both models.
- RoBERTa shows clearer distinction in sentiment compared to VADER, especially for 1-star and 5-star reviews.

Misclassifications: Examples where sentiment analysis results diverge from expected sentiment based on star ratings.
Insights: Context and nuance in language can cause discrepancies, highlighting the limitations of simpler models like VADER.

Usage: Extremely simple setup for sentiment analysis.
- Example: pipeline('sentiment-analysis').
- Automatically downloads and configures models.
Performance: Quick and easy for out-of-the-box sentiment analysis with reasonable accuracy.

Summary: Walkthrough of traditional and modern sentiment analysis approaches, comparison of different models, and practical implementation using Python libraries.
Call to Action: Subscribe for more content and follow live coding sessions on Twitch.