Summary

The provided content is a detailed tutorial on using the CountVectorizer method from the scikit-learn library to convert text data into numerical format for Natural Language Processing (NLP) tasks.

Abstract

The article "CountVectorizer to Extract Features from Texts in Python, in Detail" offers an in-depth guide to the CountVectorizer method, a fundamental tool in NLP for transforming text into a numerical representation. It explains how this method counts the frequency of each word in the text, which is essential for computational processing. The tutorial demonstrates the use of CountVectorizer with Python code examples, showing how to vectorize text data and convert the resulting array into a DataFrame for clarity. It also discusses various parameters such as lowercase, stop_words, max_df, min_df, and max_features that can be adjusted to refine the feature set, improve computational efficiency, and enhance model performance. The article emphasizes the importance of these parameters in real-world analytics, where dealing with a large number of words can be computationally expensive. By carefully selecting which words to include or exclude, analysts can create more efficient and effective NLP models. The author concludes by encouraging experimentation with these parameters and acknowledges the existence of more sophisticated vectorization methods, while affirming the continued relevance of CountVectorizer for many NLP applications.

Opinions

The author believes that converting text data to numeric form is a critical first step in NLP projects.
The use of CountVectorizer's default settings, such as converting all words to lowercase, is seen as beneficial for simplicity but can be adjusted if needed.
The exclusion of stop words is considered an important step in data processing to improve the efficiency and accuracy of analytics or machine learning models.
The author suggests that the max_df and min_df parameters are useful for eliminating words that are either too common or too rare across documents, which can help in focusing on the most relevant features.
The max_features parameter is recommended for limiting the number of features to the top most frequent words, thereby reducing computational complexity without significantly compromising model performance.
The article promotes the idea that despite the availability of more advanced text vectorization methods, CountVectorizer remains a valuable tool for many NLP tasks.

CountVectorizer to Extract Features from Texts in Python, in Detail

Everything you need to know to use CountVectorizer efficiently in Sklearn

The most basic data processing that any Natural Language Processing (NLP) project requires is to convert the text data to the numeric data. As long as the data is in text form we cannot do any kind of computation action on it.

There are multiple methods available for this text-to-numeric data conversion. This tutorial will explain one of the most basic vectorizers, the CountVectorizer method in the scikit-learn library.

This method is very simple. It takes the frequency of occurrence of each word as the numeric value. An example will make it clear.

In the following code block:

We will import the CountVectorizer method.
Call the method.
Fit the text data to the CountVectorizer method and, convert that to an array.

import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer 

#This is the text to be vectorized
text = ["Hello Everyone! This is Lilly. My aunt's name is also Lilly. I love my aunt.\
        I am trying to learn how to use count vectorizer."]

cv= CountVectorizer() 
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_arr

Output:

array([[1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1]],
      dtype=int64)

Here I have the numeric values representing the text data above.

How do we know which values represent which words in the text?

To make that clear, it will be helpful to convert the array into a DataFrame where column names will be the words themselves.

cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())
cnt_df

Now, it shows clearly. The value of the word ‘also’ is 1 which means ‘also’ appeared only once in the test. The word ‘aunt’ came twice in the text. So, the value of the word ‘aunt’ is 2.

In the last example, all the sentences were in one string. So, we got only one row of data for four sentences. Let’s rearrange the text and see what happens:

text = ["Hello Everyone! This is Lilly", 
        "My aunt's name is also Lilly",
        "I love my aunt",
        "I am trying to learn how to use count vectorizer"]
cv= CountVectorizer() 
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_arr

Output:

array([[0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 2, 1, 1, 1]],
      dtype=int64)

This time we have a two-dimensional array with one individual list for each string in the text. Putting this array in a DataFrame:

cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())
cnt_df

Look carefully at this DataFrame. All the words are there as column names. Each row represents a string in the text and the value in the column shows how many times the word appeared in the string. If the word doesn’t appear, the value is zero.

There are some parameters available for CountVectorizer method in sklearn library that are worth checking.

lowercase

If you notice by default CountVectorizer method converts all the words to lowercase. If you do not want that you need to set lowercase = False.

cv= CountVectorizer(lowercase=False) 
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())

Now, the words are taken the way it is in the text. The word ‘My’ came twice in the DataFrame as ‘My’, and ‘my’.

stop_words

The stop_words are the words that we can consider unnecessary for the analytics. In our text, I may think ‘also’, ‘is’, and ‘to’ are not necessary words. I can simply exclude them which is a very important part of data processing for most analytics or machine learning models. Here we have only 4 strings. But in real-world analytics, we need to deal with thousands of strings. Thousands of strings may involve thousands of words and each word becomes a feature. If we can exclude some of the frequently appearing or not so necessary for the model, it will save a lot of computational effort.

There are default lists of stop words in CountVectorizer method itself for a lot of major languages. Here is an example.

cv= CountVectorizer(stop_words='english') 
count_matrix = cv.fit_transform(text)
cnt_arr = count_matrix.toarray()
cnt_df = pd.DataFrame(data = cnt_arr, columns = cv.get_feature_names())
cnt_df

Look! A lot of the words are gone!

If you think the words that are gone are not enough for you or too many words are gone, please provide your own list of stop_words. For example, if I only want ‘also’, ‘is’, ‘am’, and ‘to’ to be excluded, I will provide the list of stop_words like this:

cv= CountVectorizer(stop_words=['also', 'is', 'am', 'to'])

max_df

This is another way of eliminating words. If we use max_df = 0.5 that means if a word appears in more than 50% of the documents or strings then that will be eliminated. An integer value can be used as max_df as well. Max_df = 20 means if a word exists in more than 20 documents it will be eliminated.

To demonstrate this, I created a new text:

text = ["lilly is a good girl", 
        "lilly is a good student",
        "lilly is very good in math", 
        "lilly loves coffee", 
        "She is from Brazil"]
cnt_vect = CountVectorizer(max_df=0.75)
count_mtrx = cnt_vect.fit_transform(text)
cnt_arr = count_mtrx.toarray()
cnt_df = pd.DataFrame(data = cnt_arr, columns=cnt_vect.get_feature_names())
cnt_df

‘Lilly’ appeared in 4 documents out of 5. So it is eliminated. Same as ‘is’.

min_df

This is the opposite of max_df. If a document appears less than a proportion or a specified they are eliminated by min_df. In this example, I am using the same text as the last example and setting min_df = 2. So, any word that exists in less than 2 documents is eliminated.

cnt_vect = CountVectorizer(min_df=2)
count_mtrx = cnt_vect.fit_transform(text)
cnt_arr = count_mtrx.toarray()
cnt_df = pd.DataFrame(data = cnt_arr, columns=cnt_vect.get_feature_names())
cnt_df

We have only three words left as we already have just 5 documents. This can be useful in machine learning projects.

When we are trying to extract a trend, the words that only exist seldom in a couple of documents out of thousands of documents, are not very helpful.

max_features

This is another useful feature. When we have thousands of words, it is computationally expensive and time-consuming. If we have a total of 10000 words that becomes 10000 features. Now if you think only the top 2000 words might be good enough based on the term frequency, you can simply use max_features = 2000. Here we even do not have that many words. So, I will use max_features = 5.

cnt_vect = CountVectorizer(max_features=5, stop_words='english')
count_mtrx = cnt_vect.fit_transform(text)
cnt_arr = count_mtrx.toarray()
cnt_df = pd.DataFrame(data = cnt_arr, columns=cnt_vect.get_feature_names())
cnt_df

Here we have the top five words that appeared the most.

Conclusion

This article tried to explain the CountVectorizer method and how you can best use this method of text processing. The parameters I explained here can make your analytics or Natural Language Processing models efficient if used correctly. These parameters can be used alone or you can use some of them together with one another based on your need. There is a lot of scope for experiment. There are more sophisticated methods to vectorize text data nowadays. But this simple method still works in many cases.

Feel free to follow me on Twitter and like my Facebook page.

If you want a video version of this tutorial, here is the link:

A Complete Exploratory Data Analysis in Python

Data Cleaning, Analysis, Visualization, Feature Selection, Predictive Modeling

pub.towardsai.net

30 Very Useful Pandas Functions for Everyday Data Analysis Tasks

Pandas Cheatsheet

towardsdatascience.com

6 Tips for Dealing With Null Values

Includes Iterative Method, Mean and Median Fill with Groupby, Mean and Median Fill

towardsdatascience.com

A Detailed Tutorial on Polynomial Regression in Python, Overview, Implementation, and Overfitting

Complete code in Python

pub.towardsai.net

TensorFlow Model Training Using GradientTape

Use of GradientTape to Update the Weights

towardsdatascience.com

Anomaly Detection in TensorFlow and Keras Using the Autoencoder Method

A cutting-edge unsupervised method for noise removal, dimensionality reduction, anomaly detection, and more

towardsdatascience.com

CountVectorizer to Extract Features from Texts in Python, in Detail

Everything you need to know to use CountVectorizer efficiently in Sklearn

Conclusion

More Reading

A Complete Exploratory Data Analysis in Python

Data Cleaning, Analysis, Visualization, Feature Selection, Predictive Modeling

30 Very Useful Pandas Functions for Everyday Data Analysis Tasks

Pandas Cheatsheet

6 Tips for Dealing With Null Values

Includes Iterative Method, Mean and Median Fill with Groupby, Mean and Median Fill

A Detailed Tutorial on Polynomial Regression in Python, Overview, Implementation, and Overfitting

Complete code in Python

TensorFlow Model Training Using GradientTape

Use of GradientTape to Update the Weights

Anomaly Detection in TensorFlow and Keras Using the Autoencoder Method

A cutting-edge unsupervised method for noise removal, dimensionality reduction, anomaly detection, and more