avatarIbrahim Kovan

Summary

This context provides a detailed guide on how to create an algorithm pipeline with Grid Search, ColumnTransformer, and Feature Selection using Python and the sklearn library.

Abstract

The context discusses the use of pipelines in machine learning to implement data preprocessing operations and models into a dataset with a single line of code. It introduces the concept of using Grid Search with pipelines to evaluate hyperparameter combinations and compile them easily. The guide also covers the use of ColumnTransformer to apply different preprocessing methods to different columns of a dataset, and Feature Selection to select the most relevant features for a model. The guide uses the iris dataset and a toy dataset containing both numerical and categorical data to demonstrate these concepts.

Opinions

  • The author believes that using pipelines and GridSearchCV is a very effective way to evaluate hyperparameter combinations and compile them easily.
  • The author suggests that using ColumnTransformer is a useful way to apply different preprocessing methods to different columns of a dataset.
  • The author recommends using Feature Selection to select the most relevant features for a model.
  • The author provides code examples and explanations to help readers understand how to implement these concepts using Python and the sklearn library.
  • The author emphasizes the importance of data preprocessing and feature selection in machine learning.
  • The author suggests that using pipelines can help to simplify the process of creating a machine learning model.
  • The author provides a link to a previous article for more information on Grid Search.

Easy Way to Create an Algorithm Chains -Pipelines with Grid Search, ColumnTransformer, Feature Selection

Designing a pipeline for compiling the processes, with python implementation

Table of Contents 
1. Introduction
2. Pipeline
3. Pipeline with Grid Search
4. Pipeline with ColumnTransformer, GridSearchCV
5. Pipeline with Feature Selection
Photo by Patrick Hendry on Unsplash

1. Introduction

Preparing the dataset for the algorithm, designing the model, and adjusting the hyperparameters of the algorithm, which is at the discretion of the developer, to generalize the model and to reach the optimum accuracy value were mentioned in the previous articles. As we know, there are alternative solutions at the developer’s disposal for both model preprocessing, data preprocessing and adjusting the hyperparameters of the algorithm. The developer is responsible for applying the most appropriate combinations and keeping his project optimum in terms of both accuracy and generalization. This article includes implementing all these mentioned operations and more in one go with the pipeline offered by sklearn. All headers are supported with python implementation.

2. Pipeline

In its most basic form, the pipeline is to implement the specified data preprocessing operations and the model into the dataset with a single line:

IN[1]
iris=load_iris()
iris_data  =iris.data
iris_target=iris.target
IN[2]
x_train,x_test,y_train,y_test = train_test_split(iris_data, iris_target,test_size=0.2, random_state=2021)
pip_iris = Pipeline([("scaler", RobustScaler()),("lr",LogisticRegression())])
pip_iris.fit(x_train,y_train)
iris_score=pip_iris.score(x_test,y_test)
print(iris_score)
OUT[2]
0.9333333333333333

Iris dataset was separated with train_test_split as usual and RobustScaler() was chosen as the scaler method for the dataset known to be a numeric dataset, and LogisticRegression was chosen as the classifier. The pipeline also contains various attributes such as .fit, .score just like grid search. Train dataset was fitted with the .fit command in the created pipeline and the score was created with .score.

3. Pipeline with Grid Search

Grid Search evaluated hyperparameter combinations in the algorithm or any operation with defined hyperparameters, informing the user about the accuracy rate or best hyperparameter combinations with various attributes(more information click here). Using GridSearchCV with the Pipeline is a very effective way of eliminating workload and confusion. Now let’s test the Logistic Regression algorithm we implemented above with various combinations of hyperparameters for it:

IN[3]
x_train,x_test,y_train,y_test = train_test_split(iris_data, iris_target,test_size=0.2, random_state=2021)
pip_iris_gs = Pipeline([("scaler", RobustScaler()),("lr",LogisticRegression(solver='saga'))])
param_grids={'lr__C':[0.001,0.1,2,10],
             'lr__penalty':['l1','l2']}
gs=GridSearchCV(pip_iris_gs,param_grids)
gs.fit(x_train,y_train)
test_score = gs.score(x_test,y_test)
print("test score:",test_score)
print("best parameters: ",gs.best_params_)
print("best score: ", gs.best_score_)
OUT[3]
test score: 0.9333333333333333
best parameters:  {'lr__C': 2, 'lr__penalty': 'l1'}
best score:  0.9583333333333334

In addition to the above, ‘C’ and ‘penalty’ values are defined as param_grids by creating a dictionary by the user. Later, the pipeline containing the algorithm and scaling was added to the GridSearchCV as an estimator. The training dataset is trained with .fit command and evaluated with .score. In addition, information was obtained about the model created with the help of various attributes in GridSearchCV.

4. Pipeline with ColumnTransformer, GridSearchCV

So far, only the dataset -iris dataset- which is containing only numerical data has been used. To make the situation more complex, let’s use the toy dataset, which contains both numerical and categorical data, and apply:

  • Normalize the ‘Income’ column with MinMaxScaler()
  • Encode Categorical Columns with OneHotEncoder()
  • Group the ‘Age’ column with binning.

First, let’s take a quick look at the dataset:

IN[4]
toy = pd.read_csv('toy_dataset.csv')
toy_final=toy.drop(['Number'],axis=1)
IN[5]
toy_final.isna().sum()
OUT[5]
City       0
Gender     0
Age        0
Income     0
Illness    0
dtype: int64
IN[6]
numeric_cols=toy.select_dtypes(include=np.number).columns
print("numeric_cols:",numeric_cols)
categorical_cols=toy.select_dtypes(exclude=np.number).columns
print("categorical_cols:",categorical_cols)
print("shape:",toy_final.shape)
OUT[6]
numeric_cols: Index(['Number', 'Age', 'Income'], dtype='object')
categorical_cols: Index(['City', 'Gender', 'Illness'], dtype='object')
shape: (150000, 5)

Now let’s perform the operations mentioned above:

IN[7]
bins = KBinsDiscretizer(n_bins=5, encode='onehot-dense', strategy='uniform')
ct = ColumnTransformer([
    ('normalization', MinMaxScaler(), ['Income']),
    ('binning', bins, ['Age']),
    ('categorical-to-numeric', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['City','Gender'])
], remainder='drop')
x_train, x_test, y_train, y_test = train_test_split(toy_final.drop('Illness', axis=1), toy_final.Illness,
                                                   test_size=0.2, random_state=0)
param_grid_lr=[{'lr__solver':['saga'],'lr__C':[0.1,1,10],'lr__penalty':['elasticnet','l1','l2']},
               {'lr__solver':['lbfgs'],'lr__C':[0.1,1,10],'lr__penalty':['l2']}]
IN[8]
pipe_lr = Pipeline([
    ('columntransform', ct),
    ('lr', LogisticRegression()),
    ])
gs_lr =GridSearchCV(pipe_lr,param_grid_lr,cv=5)
gs_lr.fit(x_train,y_train)
test_score_lr = gs_lr.score(x_test,y_test)
print("test score:",test_score_lr)
print("best parameters: ",gs_lr.best_params_)
print("best score: ", gs_lr.best_score_)
OUT[8]
test score: 0.9198666666666667
best parameters:  {'lr__C': 0.1, 'lr__penalty': 'l1', 'lr__solver': 'saga'}
best score:  0.9188750000000001

The bins method with the KBinsDiscretizer() presented in the sklearn library is set to 5 groups and encode by OneHotEncoder. Preprocessing processes to be applied with ColumnTransformer() were gathered in one hand. These operations are: -Normalization for ‘Income’ column, -Discretization for ‘Age’ column, -Encode with OneHotEncoder() for categorical columns

The dataset was then split into training and testing. A dictionary (param_grids_lr) was created with the selected hyperparameters to evaluate the parameter combinations. The data preprocessing methods to be applied were collected in one hand with ColumnTransformer(more information click here)and the algorithm-LogisticRegression- were placed in the pipeline. As in the examples above, the model is completed by selecting the Cross-Validation value of 5 in the GridSearchCV.

The param_grid_lr dictionary is created as algorithm+double underscore + hyperparameter. LogisticRegression() is defined as lr and we know the ‘C’ is the hyperparameter of LogisticRegression so, lr__C is used. To see the all available hyperparameter that can be used, lr.get_params().keys() is applied.

Now let’s try the model we prepared with DecisionTreeClassifier():

IN[9]
pipe_dt = Pipeline([
    ('columntransform', ct),
    ('dt', DecisionTreeClassifier()),
])
param_grid_dt={'dt__max_depth':[2,3,4,5,6,7,8]}
gs_dt =GridSearchCV(pipe_dt,param_grid_dt,cv=5)
gs_dt.fit(x_train,y_train)
test_score_dt = gs_dt.score(x_test,y_test)
print("test score:",test_score_dt)
print("best parameters: ",gs_dt.best_params_)
print("best score: ", gs_dt.best_score_)
OUT[9]
test score: 0.9198333333333333
best parameters:  {'dt__max_depth': 2}
best score:  0.9188750000000001

The max_depth values we selected were fit one by one and the most successful one was determined by grid search.

5. Pipeline with Feature Selection

As mentioned in the introduction, using the pipeline and GridSearchCV is a very effective way to evaluate hyperparameter combinations and compile them easily. It is very useful not only for data preprocessing and algorithms but also for data cleaning(SimpleImputer), feature processing(SelectKBest, SelectPercentile , more information click here), etc. Now let’s apply the following to the breast_cancer dataset containing 30 features:

— Standardization to numerical values with StandardScaler()

PolynomialFeatures() to numerical values

— ANOVA with SelectPercentile()

— LogisticRegression hyperparameters(‘C’ and ‘penalty’)

— tune Cross-Validation=3

IN[10]
cancer=load_breast_cancer()
cancer_data   =cancer.data
cancer_target =cancer.target
IN[11]
anova = SelectPercentile()
poly = PolynomialFeatures()
lr=LogisticRegression(solver='saga')
param_grid_cancer=dict(poly__degree=[2,3,4],
                   anova__percentile=[20, 30, 40, 50],
                   lr__C=[0.01,0.1,1,10],
                   lr__penalty=['l1','l2']
                   )
pipe_cancer = Pipeline([
    ('standardization',StandardScaler()),
    ('poly',poly),
    ('anova',anova),
    ('lr',lr)
    ])
gs_final = GridSearchCV(pipe_cancer,param_grid_cancer,cv=3,n_jobs=-1)
x_train, x_test, y_train, y_test = train_test_split(cancer_data, cancer_target,test_size=0.2,random_state=2021)
gs_final.fit(x_train,y_train)
test_score_final = gs_final.score(x_test,y_test)
print("test score:",test_score_final)
print("best parameters: ",gs_final.best_params_)
print("best score: ", gs_final.best_score_)
OUT[11]
test score: 0.9736842105263158
best parameters:  {'anova__percentile': 20, 'lr__C': 0.1, 'lr__penalty': 'l1', 'poly__degree': 2}
best score:  0.9626612059951203

Hyperparameter combinations to be tested with param_grid_cancer have been defined:

degree=[2,3,4] for PolynomialFeatures()

percentile= [20, 30, 40, 50] for SelectPercentile()

C=[0.01,0.1,1,10] for LogisticRegression()

penalty=[‘l1’,’l2’] for LogisticRegression()

All these were piped in with StandardScaler(). Then the cross-validation value was set to 3 in GridSearchCV. Dataset was split with train_test_split and was fitted with .fit as always. When the ‘percentile ’which is in SelectPercentile is set 20%, ‘C’ value in LogisticRegression is set 0.1, ‘penalty’ parameter in LogisticRegression is set ‘L1’, and ‘degree ’in PolynomialFeatures is set 2, the accuracy is highest.

The pipeline is useful in evaluating many things that are required when creating a pipeline model, collectively, from a single source. make_pipeline can be used as well as Pipeline. make_pipeline automatically creates the necessary names for the steps, so just adding the process is sufficient.

Back to the guideline click here.

Pipeline
Feature Selection
Data Science
Machine Learning
Gridsearchcv
Recommended from ReadMedium