Text Mining Techniques with Orange Software

Aug 11, 2024

Lecture: Text Mining using Orange Software

Introduction

  • Topic: Text Mining
  • Objective: Learn basics of text mining and use the software Orange for exercises, followed by using NVivo.

Types of Text Data

  • Formats: PDF, Word, Text files, Excel files
  • Examples: Research papers, interview transcripts, news articles

Text Analytics

  • Importance: Critical for qualitative research
  • Key Concept: Natural Language Processing (NLP)
  • Frequency Analysis: Important words often repeated
  • Context Analysis: Words appearing before/after a keyword
  • Sentiment Analysis: Categorizing words as positive, negative, or neutral

Cleaning Text Data

  • Why: To ensure meaningful analysis
  • Methods:
    • Remove numbers
    • Remove special characters (e.g., $, @)
    • Remove punctuations
    • Convert to lowercase/uppercase
    • Remove whitespace
    • Handle stop words (e.g., 'the', 'and')
    • Stemming (e.g., consult, consulting)
    • Synonym handling (e.g., talk, speak)

Bag of Words

  • Definition: Collection of important words after cleaning
  • Usage: Forms basis for applying text mining algorithms

Software Demonstration: Orange

Steps to Import Data

  1. Import Document: Select folder containing text data
  2. Corpus Viewer: View imported documents
  3. Create Word Cloud: Visual representation of word frequency

Pre-Processing

  • Options: Lowercase, remove ascent, parse HTML, tokenization (e.g., regex)
  • Filtering: Remove stop words, lexicon-specific words, numbers

Word Cloud (Post-Cleaning)

  • Comparison: Clean vs. raw word cloud
  • Removing Specific Words: Update stop word list and refresh

Concordance

  • Definition: Contextual analysis of a specific word
  • Usage: Helpful for literature and thematic reviews
  • Customization: Number of surrounding words (3 to 10)

Clustering

  • Distance Calculation: Algorithms like cosine distance
  • Hierarchical Clustering: Grouping similar words/documents

Sentiment Analysis

  • Purpose: Determine sentiment (positive, negative, neutral) in documents
  • Output: Percentage of positive, negative, neutral words

Saving and Using Results

  • Data Saving: Save analysis results for further use
  • Visualization: Graphs and sentiment index

Additional Features

  • Extract Keywords: Identify important keywords
  • Bag of Words Analysis: Frequency of words

Conclusion

  • Next Steps: More advanced text mining techniques in future sessions (e.g., similarity hashing, topic modeling)

Practice

  • Homework: Practice using Orange and prepare queries for the next session.

End of Lecture