Text Analysis

Get Started with Text Analytics Using Python

by Edward Pollitt

Text analysis in Python is a comprehensive way to dive deeper into your data. But there's a faster way. Try Displayr today.

Free text analytics tool

Why Python for Text Analytics?

Text analytics is the process of examining unstructured text data to extract meaningful patterns. Of course, this can. be done manually by reading through various pieces of text to determine patterns and insights.

But this is an arduous process, so we largely see it automated these days. Python, the programming language used for software development and web applications, can be used in a data science context to analyze text.

So why Python for text analysis? Python was created as a general-purpose language, unlike R, which is mainly focused on statistical analysis.

As a result, Python is considerably more widely-used than R, meaning if you are able to use Python for text analytics, you will probably be able to apply this skill to other use cases.

Python is also considered the closest programming language to English, making it more intuitive and easier-to-learn than other languages.

Setting Up Python for Text Analytics

Another reason why Python has emerged as a popular option for text analytics is due to its rich ecosystem of libraries and tools. Creating a robust development environment requires careful setup and configuration of essential components.

Essential Python libraries for text analytics include:

NLTK (Natural Language Toolkit)
spaCy
scikit-learn
pandas
numpy
gensim

To set up your environment, execute these commands in your terminal:

pip install nltk
pip install spacy
pip install scikit-learn
pip install pandas
pip install numpy
pip install gensim

After installation, download necessary language models and resources:

import nltk
nltk.download('popular')
python -m spacy download en_core_web_sm

Basic Python concepts crucial for text analytics work include:

String Operations: Working with text strings and regular expressions
List Comprehension: Efficient data transformation
Dictionary Manipulation: Managing key-value pairs
File Operations: Reading and writing text files
Functions: Creating reusable code blocks

Want to analyze text without all the code?
Start a free trial of Displayr.

Start a free trial

Text Preprocessing Techniques

Raw text data requires significant preprocessing before it can be effectively analyzed. This critical step ensures consistency and removes noise that might interfere with analysis.

The text preprocessing pipeline typically includes several key steps:

Text Cleansing:

import re

def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^a-zA-Zs]', '', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

Tokenization breaks text into meaningful units:

from nltk.tokenize import word_tokenize

text = "Natural language processing is fascinating!"
tokens = word_tokenize(text)
print(tokens)

Stop words removal eliminates common words that add little analytical value:

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

filtered_tokens = [word for word in tokens if word not in stop_words]

Stemming reduces words to their root form:

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

stemmed_words = [stemmer.stem(word) for word in filtered_tokens]

Lemmatization provides a more sophisticated approach to word normalization:

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_tokens]

Feature Engineering and Text Representation

Converting preprocessed text into numerical features enables machine learning algorithms to analyze the data effectively. Several approaches exist for this transformation, each with its own advantages and use cases.

The Bag of Words (BoW) model represents text as a numerical vector:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
text_corpus = ["This is the first document.",
              "This document is the second document.",
              "And this is the third document."]
X = vectorizer.fit_transform(text_corpus)

TF-IDF (Term Frequency-Inverse Document Frequency) weighs terms based on their importance:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(text_corpus)

Word embeddings provide dense vector representations that capture semantic relationships:

import gensim.downloader as api

# Load pre-trained Word2Vec embeddings
word2vec_model = api.load('word2vec-google-news-300')

# Get vector for a specific word
vector = word2vec_model['python']

These numerical representations enable various analytical tasks:

Text Classification: Categorizing documents into predefined classes
Sentiment Analysis: Determining emotional tone
Topic Modeling: Discovering underlying themes
Document Clustering: Grouping similar documents
Information Retrieval: Finding relevant documents

Text Preprocessing and Feature Extraction

Before diving into advanced text analytics techniques, it's important to clean and preprocess the text data. This involves several steps:

Tokenization - Splitting text into individual words or tokens. This helps isolate the key components for analysis.
Removing stop words - Eliminating common words like "a", "and", "the" that don't add predictive value.
Stemming and lemmatization - Reducing words to their root form. For example, "running" becomes "run". This helps consolidate different grammatical forms of a word.
Part-of-speech tagging - Labeling each token with its part of speech (noun, verb, adjective etc.) to retain context.

Once preprocessed, we can extract key features from the text that will be used during modeling:

Term frequency - Counting how often each term appears in a document. Frequently occurring words may be more predictive.
TF-IDF (Term Frequency-Inverse Document Frequency) - Weighing words by uniqueness to identify important keywords. Words that are common across documents get lower weights.
N-grams - Combinations of multiple words together, like bigrams (2 words) and trigrams (3 words). Captures useful phrases.
Word embeddings - Representing words as dense vectors based on their semantic meaning. Models like Word2Vec and GloVe are commonly used to generate embeddings.

Thoughtful preprocessing and feature extraction are critical first steps in any text analytics project. This sets up the data for the more advanced modeling techniques.

Text Classification and Clustering

Once text data has been prepared, a variety of techniques can be applied to extract insights:

Text Classification

Text classification involves categorizing documents into pre-defined classes or topics. Common algorithms include:

Naive Bayes - A probabilistic model that uses Bayes' theorem to predict class membership. Performs well despite its simplicity.
Support Vector Machines (SVM) - Identifies optimal boundaries between classes in high-dimensional space. Effective for complex datasets.
Random Forests - Ensemble method that aggregates predictions from many decision tree models. Handles non-linear relationships well.

Performance is evaluated using metrics like accuracy, precision, recall and F1 score. Multi-class problems can be broken into multiple binary classifiers.

Text Clustering

Clustering refers to grouping documents by similarity, without predefined classes. This allows exploring the natural structure in text data. Popular techniques include:

K-means - Iteratively assigns data points to one of k clusters based on distance from centroids. Scales well to large datasets.
Hierarchical clustering - Builds a hierarchy of clusters in a top-down (divisive) or bottom-up (agglomerative) manner.
DBSCAN - Density-based clustering that groups closely packed points, marking outliers as noise. Handles arbitrary cluster shapes.

Cluster analysis is often combined with visualization methods like t-SNE to project documents into 2D space for exploration.

Advanced NLP Techniques

In addition to classification and clustering, more advanced NLP techniques enable deeper text understanding:

Sentiment Analysis
- Uses classifiers to categorize text by emotional valence - positive, negative or neutral. Key for understanding customer satisfaction, survey responses etc.
Named Entity Recognition (NER)
- Identifies and extracts entities like people, organizations, locations within unstructured text. Critical for gathering structured facts.
Text Summarization
- Automatically generates condensed versions of documents while retaining key information. Helps cope with information overload.
Topic Modeling
- Discovers abstract topics in a corpus using algorithms like Latent Dirichlet Allocation (LDA). Provides insight into hidden semantic structures.
Deep Learning
- Techniques like RNNs, CNNs and Transformers enable more nuanced text understanding. Pre-trained models like BERT and GPT are commonly fine-tuned for text analysis tasks.

These advanced methods enable text analytics to move beyond just classification and clustering to deeper semantic understanding.

Exploratory Data Analysis (EDA) for Text Data

Effective exploratory analysis is key for understanding text data before modeling:

Word clouds provide a visual overview of the most frequent terms. Larger fonts indicate greater importance. Color can encode categories.
Word frequency distributions show the counts of top terms. Spikes reveal keywords and trends. Comparisons uncover differences between document groups.
Collocation networks highlight which words commonly co-occur together. This reveals semantic relationships in the corpus.
Temporal analysis tracks how word usage changes over time. This can detect emerging trends and topics.
Sentiment analysis gives aggregate emotion scores to documents. Positive vs negative sentiment over time is insightful.
Topic modeling reveals latent themes even in unlabeled corpuses. Visualizations like pyLDAvis effectively communicate topic patterns.

Thoughtful EDA guides feature engineering, model selection, and result interpretation. Visual methods help communicate insights from unstructured text.

Displayr

One of the major downsides to performing text analytics with Python - or any other programming language - is that it can be a slow process. With Displayr, your text categorization, sentiment analysis, or entity extraction is only ever a few clicks away.

And if you do prefer to work with code, you can write R code from within the app and run additional languages, like Python, from within the R code.

Try it free today.

TECHNIQUES

TECHNIQUES

OBJECTIVES

CAPABILITIES

DATA SOURCES

LEARN

SUPPORT

UPCOMING WEBINAR