Sentiment Analysis on Amazon Reviews

Jul 18, 2024

Natural Language Processing Project - Sentiment Analysis on Amazon Reviews

Introduction

  • Instructor: Rob
  • Content Overview: Sentiment analysis using VADER and RoBERTa models on Amazon reviews.
  • Additional resources: Everything shared in a Kaggle notebook (link in description).

Libraries and Tools Used

  • Python Libraries:
    • pandas: For data manipulation.
    • numpy: For numerical operations.
    • matplotlib, seaborn: For plotting and visualization.
    • nltk: Natural language toolkit for text processing.
    • transformers (Hugging Face): For using pre-trained deep learning models.
    • tqdm: For progress bar visualization.

Data Overview

  • Dataset: Amazon fine food reviews.
  • Content: Reviews and ratings (1-5 stars) in CSV format.
  • Data Size: Approximately half a million reviews, downsampled to 500 for this project.

Data Import and Basic Analysis

  • Data Reading:
    • Example: pd.read_csv('reviews.csv')
    • Preview data using df.head().
  • Exploratory Data Analysis (EDA):
    • Score counts visualization using a bar plot.
    • Findings: Most reviews are 5 stars, followed by 1 star.

Basic NLTK Operations

  • Tokenization:
    • Example: nltk.word_tokenize(text), splits text into tokens.
  • Part of Speech Tagging:
    • Example: nltk.pos_tag(tokens), assigns POS tags to tokens.
  • Chunking:
    • Example: nltk.ne_chunk(tagged), groups tokens into chunks.

Sentiment Analysis with VADER

  • VADER: Valence Aware Dictionary and Sentiment Reasoner.
    • Features: Analyzes text based on a lexicon of positive, negative, and neutral words.
    • Process: Removes stop words, computes sentiment scores.
    • Implementation:
      • Create analyzer: sia = SentimentIntensityAnalyzer().
      • Example: sia.polarity_scores(text).
  • Analysis:
    • Running sentiment analysis on each review.
    • Store results in a pandas DataFrame.
    • Findings: Comparison of sentiment scores with star ratings. Positive correlation between high star ratings and positive sentiment.

Sentiment Analysis with RoBERTa

  • RoBERTa Model: A transformer-based model for sentiment analysis.
    • From Hugging Face’s library.
    • Pretrained on large datasets (e.g., Twitter data).
  • Steps:
    • Import tokenizer and model: AutoTokenizer, AutoModelForSequenceClassification.
    • Tokenize text and run the model to get sentiment scores.
    • Normalize output using softmax.
    • Example function for sentiment scoring: polarity_scores_roberta(text).
  • Analysis:
    • Comparison with VADER results.
    • Findings: RoBERTa shows more distinct and confident sentiment separation.

Comparative Analysis

  • Pair Plot: Visualize the comparison between VADER and RoBERTa sentiment scores.
  • Key Observations:
    • Positive correlation with star ratings for both models.
    • RoBERTa shows clearer distinction in sentiment compared to VADER, especially for 1-star and 5-star reviews.

Error Analysis

  • Misclassifications: Examples where sentiment analysis results diverge from expected sentiment based on star ratings.
  • Insights: Context and nuance in language can cause discrepancies, highlighting the limitations of simpler models like VADER.

Hugging Face Pipelines

  • Usage: Extremely simple setup for sentiment analysis.
    • Example: pipeline('sentiment-analysis').
    • Automatically downloads and configures models.
  • Performance: Quick and easy for out-of-the-box sentiment analysis with reasonable accuracy.

Conclusion

  • Summary: Walkthrough of traditional and modern sentiment analysis approaches, comparison of different models, and practical implementation using Python libraries.
  • Call to Action: Subscribe for more content and follow live coding sessions on Twitch.