NLP: Text Data To Numbers
Explaining How We Can Convert Text To Numbers For Data Science Projects
Working on a NLP project can be a tedious task, in particular when the data is in textual format and the models require numerical values. This article explains how we can convert text to numerical values.
Handling Categorical Values
Let’s assume we want to forecast a variable e.g. Number Of Tweets and it is dependent on following two variables: Most Active Current News Type and Number Of Active Users.
In this instance, Most Active Current News Type is a categorical feature. It can contain textual data such “Fashion”, “Economical” etc. Additionally, Number Of Active Users contains numerical fields.
Scenario
Before we feed the data set into our model, we need to transform categorical values into numerical values because many models do not work with textual values.
Solution: Dictionary
There are a number of strategies to handle categorical features:
- Create a dictionary to map categorical values to numerical values
A dictionary is a data storage structure. It contains a list of key-value paired elements. It enables a key to be mapped to a value.
map = {'Fashion': 1, 'Economical':2}#this will map categorical to numerical valuestarget_feature = 'Most Active Current News Type'
data_frame[target_feature] = data_frame[target_feature].map(map)This strategy works well for ordinal values too. Ordinal values are those textual values that can be ordered such as Clothes Size (Small, Medium, Large etc).
Solution: Encoders
2. Another strategy is to use encoders to assign a unique numerical value to each textual value. This strategy works better for variable with a large number of distinct values (>30) such as for managing Organisational Job Hierarchy.
We could use manual or sci-kit encoders.
Manual Encoders
import numpy as nptarget_feature = 'Most Active Current News Type'#get unique values
unique = np.unique(data_frame[target_feature])
map = {textual_value:index for index,textual_value in enumerate(map)}#apply map
#this will map categorical to numerical values
data_frame[target_feature] = data_frame[target_feature].map(map)Sci Kit Learn Encoders
from sklearn.preprocessing import LabelEncodertarget_feature = 'Most Active Current News Type'#use encoder and transform
encoder = LabelEncoder()
encoded_values = encoder.fit_transform(data_frame[target_feature].values)
data_frame[target_feature] = pd.Series(encoded_values, index=data_frame.index)#to inverse, use inverse method
decoded = encoder.inverse_transform(data_frame[target_feature].values)
data_frame[target_feature] = pd.Series(decoded, index=data_frame.index)A Gotcha
After the textual values are encoded to numerical values, we will see some values which are greater than the other values. Higher values imply they have higher importance.
This can lead to our models treating features differently. As an instance, Fashion news type might get a value of 1 and Economical news type might get a value of 10. This makes the machine learning model assume that Economical news type has more importance than Fashion news type.
Solution: We can solve this by using One-Hot Encoding
One Hot Encoding
To prevent some categorical values getting higher importance than the others, we could use the one hot encoding technique before we feed encoded data into our machine learning model.
One hot encoding technique essentially creates a replica (dummy) feature for each distinct value in our target categorical feature. Once the dummy values are created, a boolean (0 or 1) is populated to indicate whether the value is true or false for the feature. As a consequence, we end up get a wide sparse matrix which has 0/1 values populated.
As an instance, if your feature has values “A”, “B” and “C” then three new features (columns) will be created: Feature A, Feature B and Feature C. If first row’s feature value was A then for feature A, you will see 1 and for feature B and C, it will be 0 and so on.
Solution:
We can use Pandas get_dummies() method that only converts categorical values to integers.
data_frame = pd.get_dummies(data_frame)Additionally, we could use sklearn.preprocessing.OneHotEncoder
Tip: Always One Hot Encode After Encoding Textual Values To Prevent Ordering
Summary
This article explained how we can convert data in text into numerical values which our models can consume.
Hope it helps.






