Text Mining Techniques with Orange Software

Aug 11, 2024

Lecture: Text Mining using Orange Software

Introduction

Topic: Text Mining
Objective: Learn basics of text mining and use the software Orange for exercises, followed by using NVivo.

Types of Text Data

Formats: PDF, Word, Text files, Excel files
Examples: Research papers, interview transcripts, news articles

Text Analytics

Importance: Critical for qualitative research
Key Concept: Natural Language Processing (NLP)
Frequency Analysis: Important words often repeated
Context Analysis: Words appearing before/after a keyword
Sentiment Analysis: Categorizing words as positive, negative, or neutral

Cleaning Text Data

Why: To ensure meaningful analysis
Methods:
- Remove numbers
- Remove special characters (e.g., $, @)
- Remove punctuations
- Convert to lowercase/uppercase
- Remove whitespace
- Handle stop words (e.g., 'the', 'and')
- Stemming (e.g., consult, consulting)
- Synonym handling (e.g., talk, speak)

Bag of Words

Definition: Collection of important words after cleaning
Usage: Forms basis for applying text mining algorithms

Software Demonstration: Orange

Steps to Import Data

Import Document: Select folder containing text data
Corpus Viewer: View imported documents
Create Word Cloud: Visual representation of word frequency

Pre-Processing

Options: Lowercase, remove ascent, parse HTML, tokenization (e.g., regex)
Filtering: Remove stop words, lexicon-specific words, numbers

Word Cloud (Post-Cleaning)

Comparison: Clean vs. raw word cloud
Removing Specific Words: Update stop word list and refresh

Concordance

Definition: Contextual analysis of a specific word
Usage: Helpful for literature and thematic reviews
Customization: Number of surrounding words (3 to 10)

Clustering

Distance Calculation: Algorithms like cosine distance
Hierarchical Clustering: Grouping similar words/documents

Sentiment Analysis

Purpose: Determine sentiment (positive, negative, neutral) in documents
Output: Percentage of positive, negative, neutral words

Saving and Using Results

Data Saving: Save analysis results for further use
Visualization: Graphs and sentiment index

Additional Features

Extract Keywords: Identify important keywords
Bag of Words Analysis: Frequency of words

Conclusion

Next Steps: More advanced text mining techniques in future sessions (e.g., similarity hashing, topic modeling)

Practice

Homework: Practice using Orange and prepare queries for the next session.

End of Lecture

Full transcript

Lecture: Text Mining using Orange Software

Introduction

Topic: Text Mining
Objective: Learn basics of text mining and use the software Orange for exercises, followed by using NVivo.

Types of Text Data

Formats: PDF, Word, Text files, Excel files
Examples: Research papers, interview transcripts, news articles

Text Analytics

Importance: Critical for qualitative research
Key Concept: Natural Language Processing (NLP)
Frequency Analysis: Important words often repeated
Context Analysis: Words appearing before/after a keyword
Sentiment Analysis: Categorizing words as positive, negative, or neutral

Cleaning Text Data

Why: To ensure meaningful analysis
Methods:
- Remove numbers
- Remove special characters (e.g., $, @)
- Remove punctuations
- Convert to lowercase/uppercase
- Remove whitespace
- Handle stop words (e.g., 'the', 'and')
- Stemming (e.g., consult, consulting)
- Synonym handling (e.g., talk, speak)

Bag of Words

Definition: Collection of important words after cleaning
Usage: Forms basis for applying text mining algorithms

Software Demonstration: Orange

Steps to Import Data

Import Document: Select folder containing text data
Corpus Viewer: View imported documents
Create Word Cloud: Visual representation of word frequency

Pre-Processing

Options: Lowercase, remove ascent, parse HTML, tokenization (e.g., regex)
Filtering: Remove stop words, lexicon-specific words, numbers

Word Cloud (Post-Cleaning)

Comparison: Clean vs. raw word cloud
Removing Specific Words: Update stop word list and refresh

Concordance

Definition: Contextual analysis of a specific word
Usage: Helpful for literature and thematic reviews
Customization: Number of surrounding words (3 to 10)

Clustering

Distance Calculation: Algorithms like cosine distance
Hierarchical Clustering: Grouping similar words/documents

Sentiment Analysis

Purpose: Determine sentiment (positive, negative, neutral) in documents
Output: Percentage of positive, negative, neutral words

Saving and Using Results

Data Saving: Save analysis results for further use
Visualization: Graphs and sentiment index

Additional Features

Extract Keywords: Identify important keywords
Bag of Words Analysis: Frequency of words

Conclusion

Next Steps: More advanced text mining techniques in future sessions (e.g., similarity hashing, topic modeling)

Practice

Homework: Practice using Orange and prepare queries for the next session.

End of Lecture