Summary

The webpage outlines methods for converting textual data into numerical values for NLP projects in data science.

Abstract

The article discusses the challenges of working with textual data in NLP projects and the necessity of converting such data into numerical formats to make it compatible with machine learning models. It details several strategies for this conversion, including the use of dictionaries to map categorical values to numbers, the application of manual encoders or sci-kit learn encoders for larger datasets, and the implementation of one-hot encoding to avoid implying importance based on numerical values. The article emphasizes the importance of preprocessing categorical data to ensure that machine learning models do not misinterpret the significance of encoded values.

Opinions

The author suggests that many machine learning models require numerical input, implying that textual data must be transformed to be useful.
The use of a dictionary for mapping categorical to numerical values is presented as a straightforward method, especially suitable for ordinal data.
The author conveys that manual encoders are suitable for datasets with a large number of distinct values, providing a scalable solution.
Sci-kit learn encoders are recommended for their ease of use and efficiency in transforming categorical data.
One-hot encoding is highlighted as a solution to prevent models from assigning undue importance to certain categories based on their encoded numerical values.
The article advises that one-hot encoding should be performed after initial encoding to maintain the categorical nature of the data without introducing a false sense of order or importance.

NLP: Text Data To Numbers

Explaining How We Can Convert Text To Numbers For Data Science Projects

Working on a NLP project can be a tedious task, in particular when the data is in textual format and the models require numerical values. This article explains how we can convert text to numerical values.

Handling Categorical Values

Let’s assume we want to forecast a variable e.g. Number Of Tweets and it is dependent on following two variables: Most Active Current News Type and Number Of Active Users.

In this instance, Most Active Current News Type is a categorical feature. It can contain textual data such “Fashion”, “Economical” etc. Additionally, Number Of Active Users contains numerical fields.

Scenario

Before we feed the data set into our model, we need to transform categorical values into numerical values because many models do not work with textual values.

Solution: Dictionary

There are a number of strategies to handle categorical features:

Create a dictionary to map categorical values to numerical values

A dictionary is a data storage structure. It contains a list of key-value paired elements. It enables a key to be mapped to a value.

map = {'Fashion': 1, 'Economical':2}

#this will map categorical to numerical values

target_feature = 'Most Active Current News Type'
data_frame[target_feature] = data_frame[target_feature].map(map)

This strategy works well for ordinal values too. Ordinal values are those textual values that can be ordered such as Clothes Size (Small, Medium, Large etc).

Solution: Encoders

2. Another strategy is to use encoders to assign a unique numerical value to each textual value. This strategy works better for variable with a large number of distinct values (>30) such as for managing Organisational Job Hierarchy.

We could use manual or sci-kit encoders.

Manual Encoders

import numpy as np

target_feature = 'Most Active Current News Type'

#get unique values
unique = np.unique(data_frame[target_feature])
map = {textual_value:index for index,textual_value in enumerate(map)}

#apply map
#this will map categorical to numerical values
data_frame[target_feature] = data_frame[target_feature].map(map)

Sci Kit Learn Encoders

from sklearn.preprocessing import LabelEncoder

target_feature = 'Most Active Current News Type'

#use encoder and transform
encoder = LabelEncoder()
encoded_values = encoder.fit_transform(data_frame[target_feature].values)
data_frame[target_feature] = pd.Series(encoded_values, index=data_frame.index)

#to inverse, use inverse method
decoded = encoder.inverse_transform(data_frame[target_feature].values)
data_frame[target_feature] = pd.Series(decoded, index=data_frame.index)

A Gotcha

After the textual values are encoded to numerical values, we will see some values which are greater than the other values. Higher values imply they have higher importance.

This can lead to our models treating features differently. As an instance, Fashion news type might get a value of 1 and Economical news type might get a value of 10. This makes the machine learning model assume that Economical news type has more importance than Fashion news type.

Solution: We can solve this by using One-Hot Encoding

One Hot Encoding

To prevent some categorical values getting higher importance than the others, we could use the one hot encoding technique before we feed encoded data into our machine learning model.

One hot encoding technique essentially creates a replica (dummy) feature for each distinct value in our target categorical feature. Once the dummy values are created, a boolean (0 or 1) is populated to indicate whether the value is true or false for the feature. As a consequence, we end up get a wide sparse matrix which has 0/1 values populated.

As an instance, if your feature has values “A”, “B” and “C” then three new features (columns) will be created: Feature A, Feature B and Feature C. If first row’s feature value was A then for feature A, you will see 1 and for feature B and C, it will be 0 and so on.

Solution:

We can use Pandas get_dummies() method that only converts categorical values to integers.

data_frame = pd.get_dummies(data_frame)

Additionally, we could use sklearn.preprocessing.OneHotEncoder

Tip: Always One Hot Encode After Encoding Textual Values To Prevent Ordering

Summary

This article explained how we can convert data in text into numerical values which our models can consume.

Hope it helps.