The Hitchhiker’s Guide to Feature Extraction
Some Tricks and Code for Kaggle and Everyday work

Good Features are the backbone of any machine learning model.
And good feature creation often needs domain knowledge, creativity, and lots of time.
In this post, I am going to talk about:
- Various methods of feature creation- Both Automated and manual
- Different Ways to handle categorical features
- Longitude and Latitude features
- Some kaggle tricks
- And some other ideas to think about feature creation.
TLDR; this post is about useful feature engineering methods and tricks that I have learned and end up using often.
1. Automatic Feature Creation using featuretools:
Have you read about featuretools yet? If not, then you are going to be delighted.
Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.
How? Let us work with a toy example to show you the power of featuretools.
Let us say that we have three tables in our database: Customers, Sessions, and Transactions.




This is a reasonably good toy dataset to work on since it has time-based columns as well as categorical and numerical columns.
If we were to create features on this data, we would need to do a lot of merging and aggregations using Pandas.
Featuretools makes it so easy for us. Though there are a few things, we will need to learn before our life gets easier.
Featuretools works with entitysets.
You can understand an entityset as a bucket for dataframes as well as relationships between them.
So without further ado, let us create an empty entityset. I just gave the name as customers. You can use any name here. It is just an empty bucket right now.
# Create new entityset
es = ft.EntitySet(id = 'customers')Let us add our dataframes to it. The order of adding dataframes is not important. To add a dataframe to an existing entityset, we do the below operation.
# Create an entity from the customers dataframees = es.entity_from_dataframe(entity_id = 'customers', dataframe = customers_df, index = 'customer_id', time_index = 'join_date' ,variable_types = {"zip_code": ft.variable_types.ZIPCode})So here are a few things we did here to add our dataframe to the empty entityset bucket.
- Provided a
entity_id: This is just a name. Put it as customers. dataframename set as customers_dfindex: This argument takes as input the primary key in the tabletime_index: The time index is defined as the first time that any information from a row can be used. For customers, it is the joining date. For transactions, it will be the transaction time.variable_types: This is used to specify if a particular variable must be handled differently. In our Dataframe, we have thezip_codevariable, and we want to treat it differently, so we use this. These are the different variable types we could use:
[featuretools.variable_types.variable.Datetime,
featuretools.variable_types.variable.Numeric,
featuretools.variable_types.variable.Timedelta,
featuretools.variable_types.variable.Categorical,
featuretools.variable_types.variable.Text,
featuretools.variable_types.variable.Ordinal,
featuretools.variable_types.variable.Boolean,
featuretools.variable_types.variable.LatLong,
featuretools.variable_types.variable.ZIPCode,
featuretools.variable_types.variable.IPAddress,
featuretools.variable_types.variable.EmailAddress,
featuretools.variable_types.variable.URL,
featuretools.variable_types.variable.PhoneNumber,
featuretools.variable_types.variable.DateOfBirth,
featuretools.variable_types.variable.CountryCode,
featuretools.variable_types.variable.SubRegionCode,
featuretools.variable_types.variable.FilePath]This is how our entityset bucket looks right now. It has just got one dataframe in it. And no relationships

Let us add all our dataframes:
# adding the transactions_df
es = es.entity_from_dataframe(entity_id="transactions",
dataframe=transactions_df,
index="transaction_id",
time_index="transaction_time",
variable_types={"product_id": ft.variable_types.Categorical})# adding sessions_df
es = es.entity_from_dataframe(entity_id="sessions",
dataframe=sessions_df,
index="session_id", time_index = 'session_start')This is how our entityset buckets look now.

All three dataframes but no relationships. By relationships, I mean that my bucket doesn’t know that customer_id in customers_df and session_df are the same columns.
We can provide this information to our entityset as:
# adding the customer_id relationship
cust_relationship = ft.Relationship(es["customers"]["customer_id"],
es["sessions"]["customer_id"])# Add the relationship to the entity set
es = es.add_relationship(cust_relationship)# adding the session_id relationship
sess_relationship = ft.Relationship(es["sessions"]["session_id"],
es["transactions"]["session_id"])# Add the relationship to the entity set
es = es.add_relationship(sess_relationship)After this our entityset looks like:

We can see the datasets as well as the relationships. Most of our work here is done. We are ready to cook features.
Creating features is as simple as:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="customers",max_depth = 2)feature_matrix.head()
And we end up with 73 new features. You can see the feature names from feature_defs. Some of the features that we end up creating are:
[<Feature: NUM_UNIQUE(sessions.device)>,
<Feature: MODE(sessions.device)>,
<Feature: SUM(transactions.amount)>,
<Feature: STD(transactions.amount)>,
<Feature: MAX(transactions.amount)>,
<Feature: SKEW(transactions.amount)>,
<Feature: DAY(join_date)>,
<Feature: YEAR(join_date)>,
<Feature: MONTH(join_date)>,
<Feature: WEEKDAY(join_date)>,
<Feature: SUM(sessions.STD(transactions.amount))>,
<Feature: SUM(sessions.MAX(transactions.amount))>,
<Feature: SUM(sessions.SKEW(transactions.amount))>,
<Feature: SUM(sessions.MIN(transactions.amount))>,
<Feature: SUM(sessions.MEAN(transactions.amount))>,
<Feature: SUM(sessions.NUM_UNIQUE(transactions.product_id))>,
<Feature: STD(sessions.SUM(transactions.amount))>,
<Feature: STD(sessions.MAX(transactions.amount))>,
<Feature: STD(sessions.SKEW(transactions.amount))>,
<Feature: STD(sessions.MIN(transactions.amount))>,
<Feature: STD(sessions.MEAN(transactions.amount))>,
<Feature: STD(sessions.COUNT(transactions))>,
<Feature: STD(sessions.NUM_UNIQUE(transactions.product_id))>]You can get features like the Sum of std of amount(SUM(sessions.STD(transactions.amount))) or std of the sum of amount(STD(sessions.SUM(transactions.amount))) This is what max_depth parameter means in the function call. Here we specify it as 2 to get two level aggregations.
If we change max_depth to 3 we can get features like: MAX(sessions.NUM_UNIQUE(transactions.YEAR(transaction_time)))
Just think of how much time you would have to spend if you had to write code to get such features. Also, a caveat is that increasing the max_depth might take longer times.
2. Handling Categorical Features: Label/Binary/Hashing and Target/Mean Encoding
Creating automated features has its perks. But why would we data scientists be required if a simple library could do all our work?
This is the section where I will talk about handling categorical features.
One hot encoding
We can use One hot encoding to encode our categorical features. So if we have n levels in a category, we will get n-1 features.
In our sessions_df table, we have a column named device, which contains three levels — desktop, mobile, or tablet. We can get two columns from such a column using:
pd.get_dummies(sessions_df['device'],drop_first=True)
This is the most natural thing that comes to mind when talking about categorical features and works well in many cases.
OrdinalEncoding
Sometimes there is an order associated with categories. In such a case, I usually use a simple map/apply function in pandas to create a new ordinal column.
For example, if I had a dataframe containing temperature as three levels: high medium and low, I would encode that as:
map_dict = {'low':0,'medium':1,'high':2}
def map_values(x):
return map_dict[x]
df['Temperature_oe'] = df['Temperature'].apply(lambda x: map_values(x))
Using this I preserve the information that low
LabelEncoder
We could also have used LabelEncoder to encode our variable to numbers. What a label encoder essentially does is that it sees the first value in the column and converts it to 0, next value to 1 and so on. This approach works reasonably well with tree models, and I end up using it when I have a lot of levels in the categorical variable. We can use this as:
from sklearn.preprocessing import LabelEncoder
# create a labelencoder object
le = LabelEncoder()
# fit and transform on the data
sessions_df['device_le'] = le.fit_transform(sessions_df['device'])
sessions_df.head()
BinaryEncoder
BinaryEncoder is another method that one can use to encode categorical variables. It is an excellent method to use if you have many levels in a column. While we can encode a column with 1024 levels using 1023 columns using One Hot Encoding, using Binary encoding we can do it by just using ten columns.
Let us say we have a column in our FIFA 19 player data that contains all club names. This column has 652 unique values. One Hot encoding means creating 651 columns that would mean a lot of memory usage and a lot of sparse columns.
If we use Binary encoder, we will only need ten columns as 2⁹<652 <2¹⁰.
We can binaryEncode this variable easily by using BinaryEncoder object from category_encoders:
from category_encoders.binary import BinaryEncoder
# create a Binaryencoder object
be = BinaryEncoder(cols = ['Club'])
# fit and transform on the data
players = be.fit_transform(players)
HashingEncoder

One can think of Hashing Encoder as a black box function that converts a string to a number between 0 to some prespecified value.
It differs from binary encoding as in binary encoding two or more of the club parameters could have been 1 while in hashing only one value is 1.
We can use hashing as:
players = pd.read_csv("../input/fifa19/data.csv")from category_encoders.hashing import HashingEncoder
# create a HashingEncoder object
he = HashingEncoder(cols = ['Club'])
# fit and transform on the data
players = he.fit_transform(players)
There are bound to be collisions(two clubs having the same encoding. For example, Juventus and PSG have the same encoding) but sometimes this technique works well.
Target/Mean Encoding
This is a technique that I found works pretty well in Kaggle competitions. If both training/test comes from the same dataset from the same time period(cross-sectional), we can get crafty with features.
For example: In the Titanic knowledge challenge, the test data is randomly sampled from the train data. In this case, we can use the target variable averaged over different categorical variable as a feature.
In Titanic, we can create a target encoded feature over the PassengerClass variable.
We have to be careful when using Target encoding as it might induce overfitting in our models. Thus we use k-fold target encoding when we use it.














