A Beginner’s Guide to Unstructured Text Data and Sentiment Analysis

Have you ever wondered why big data companies are making so much money? It’s often because they are absolutely dominating at text analysis. Imagine being able to automatically identify the most frequently used words in a massive unstructured dataset, or detect the sentiment of millions of social media posts in real-time. That’s the magic of text analysis, and it’s why data analysts doing this kind of work are in high demand.
Text analysis can seem kind of overwhelming, especially if you’re new to working with unstructured data. In this article, though, I’ll provide a basic introduction to text analysis that will help you know how to get started, even if you are a complete beginner. I’ll be using Python for these examples.

Step 1: Pre-Processing
The first step in text analysis is pre-processing. This is where we take our raw text data and clean it up so that we can start analyzing it. This can involve removing stop words (such as “the”, “and”, and “of”), converting all of the text to lowercase, and removing punctuation. You might also stem or lemmatize words (reduce words to their roots).
You can do the pre-processing step in Python using a package called NLTK (Natural Language Toolkit). This is a very powerful text analysis package that you can read about here.
Here’s an example of how to do some basic pre-processing in Python (we won’t get into stemming or lemmatizing in this basic sample):
import nlt
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words("english"))
text = "This is an example of how to pre-process text data."
# convert to lowercase
text = text.lower()
# remove stop words
text = [word for word in text.split() if word not in stop_words]
# remove punctuation
text = [word for word in text if word.isalpha()]Step 2: Vectorizing
Once your text data is pre-processed, the next step is to vectorize it. This is where you convert the text data into a numerical format so it’s ready for analysis. One of the most common ways to do this involves using the Term Frequency-Inverse Document Frequency (TF-IDF) method that’s available in scikit-learn. The TF-IDF matrix is a common input for machine learning algorithms, and it provides a way to quantify the importance of each word in the text data with respect to the rest of the data. Words that are common across all documents will have a low TF-IDF score and are less likely to be useful for analysis; words that are rare across the documents are more likely to be useful and will have a higher TF-IDF score.
Here’s an example of how to do vectorization in Python. (Note that this made-up text has not been pre-processed; I’m just showing you how to vectorize a couple of sample sentences here).
from sklearn.feature_extraction.text import TfidfVectorizer
# define a list of text data
text = ["this is an example of vectorizing text data", "this is another example of text data that needs to be vectorized"]
# initialize an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()
# fit and transform the text data using the TfidfVectorizer
tfidf = vectorizer.fit_transform(text)
# print the shape of the resulting Tf-idf matrix
print(tfidf.shape)
# print the names of the features extracted by the TfidfVectorizer
print(vectorizer.get_feature_names())This code outputs a tuple where the first value is the number of samples or documents in the text data and the second value is the number of unique words in the text data. It also outputs a list of unique words (AKA features).
Step 3: Text Analysis
Finally, it’s time to perform the actual text analysis. This is where you’ll use algorithms and techniques to extract meaningful insights from your text data. Some common techniques include sentiment analysis, topic modeling, and word frequency analysis. Here, we’ll only focus on sentiment analysis. Sentiment analysis helps you determine whether the text you’re analyzing is overall positive or negative. You can see how, for example, this would be very valuable if you wanted to assess the overall sentiment of tweets about your company or reviews on your product.
NLTK has an already-trained sentiment analyzer built right in called VADER (Valence Aware Dictionary and Sentiment Reasoner). The VADER analyzer works best with social media style writing like tweets, but you can also find other pre-trained models for other types of writing — or you can build your own custom model. For our example, we’ll use VADER, and again note that the sample text has not been properly pre-processed for analysis here (we’re just using sample sentences to keep it simple):
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.downloader.download('vader_lexicon')
text = "This is an example of sentiment analysis. It's a great way to extract meaningful insights from text data."
sentiment_analyzer = SentimentIntensityAnalyzer()
sentiment = sentiment_analyzer.polarity_scores(text)
print(sentiment)The output is: {‘neg’: 0.0, ‘neu’: 0.701, ‘pos’: 0.299, ‘compound’: 0.7506}. This is a dictionary of scores where the negative, neutral, and positive scores all add up to 1. The compound score is a more complex metric that takes into account the intensity of the sentiment where -1 is strongly negative and 1 is strongly positive.
If you want to play around with sentiment analysis on “real texts,” NLTK has some texts and novels from Project Gutenberg included with it. If you want to see everything that’s available, just run this code:
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()So there you have it… to do the most basic sentiment analysis, all you really need to do is pre-process, vectorize, and analyze using a pre-existing package in Python. This simple guide is definitely overlooking a lot of nuance that can make your sentiment analysis better, but it’s a great place to start if you have not worked with unstructured data before! Dive into some text analysis today and see what you might discover.
To learn more about other cool text processing things you can do using NLTK, check out this page.
Agree with these thoughts? Let me know! Suggestions or corrections to improve this article? Please share them with me!
If you liked this article, you may also like:
I’m a former English professor and current higher ed administrator, and I write about what I’ve learned on my self-taught journey to develop my data analysis skills. Through my articles, I hope to help you build your data literacy skills, learn some tips and tricks for Python, SQL, Tableau, Excel, and other common technologies analysts encounter, and think about the ways that data can help both you and your organization grow. Along the way, I’ll share strategies for developing the right mindset and approach for teaching yourself new skills — some drawn from my past teaching experience, and others gathered the hard way from stumbling here and there myself.
