Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

="2e8e">After segregating the columns, we now get a list of unique values from the initial dataset for each categorical column, so that the same can be used across training, test, and validation datasets. This will also give an exhaustive list of all values. It is often found that some values are present in training and not in the validation (or vice-versa). To avoid this problem, we are taking the unique values from the initial dataset. Some might question that there might be data leak while modeling purpose, but we are applying transformation separately.We now define the kind of transformers to be used — OneHotEncoder, OrdinalEncoder, SimpleImputer — we could also use some kind of scaling functions like MinMaxScaler / StandardScaler to this step.<div id="cdc8"><pre>ohe_unique_list = [data[c].unique().tolist() for c in ohe_cols] oe_unique_list = [data[c].unique().tolist() for c in oe_cols]</pre></div><div id="5bea"><pre>ohe = OneHotEncoder(categories=ohe_unique_list) oe = OrdinalEncoder(categories=oe_unique_list) imp = SimpleImputer(strategy='constant', fill_value=0)</pre></div>We use <code>scikit-learn</code>’s <code>make_column_transformer</code> function to create the preprocess column transformers. Also, we define a parameter <code>remainder='passthrough'</code> to let all other columns that don't have any transformer criteria to passthrough. We can have other values like <code>drop</code> to drop any such columns that don’t have preprocess steps.As I said initially, I’ve so many predictor variables, I have to do feature selection and for this, I’ve tried 2 functions — SelectKBest and RecursiveFeatureElimination (RFE). SelectKBest is simple and fast. And it takes functions like <code>f_classif</code>, <code>chi2</code> to find the best features. Whereas RecursiveFeatureElimination is a slow process trying to remove features one by one to find the best features.On top of the feature selection step, I’ve used XGBoost as an estimator, to predict the probabilities.Now all these are defined as steps of a Pipeline. So if we call the pipeline, it will do preprocess, feature selection, and model fit on the dataset.<div id="974d"><pre>preprocess = make_column_transformer( (oe, oe_cols), (ohe, ohe_cols), (imp, num_cols), remainder='passthrough' ) estimator = XGBClassifier(learning_rate=0.05, max_depth=3, n_estimators=2500, random_state=1234) fs = SelectKBest(score_func=f_classif, k=5) selector = RFE(estimator, n_features_to_select=5, step=1)</pre></div><div id="68fa"><pre>steps = [ ('preprocess', preprocess), ('select', fs), ('clf', estimator) ] pipeline = Pipeline(steps)</pre></div>Now, It’s as simple as any other machine learning algorithm, we first fit and then use predict. Predict function does all other preprocessing and then applies the t

Options

rained model.<div id="5a14"><pre>pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) pred_df = pd.DataFrame({'y': y_test,'y_pred': y_pred}) gini = 2*roc_auc_score(y_test, y_pred)-1</pre></div>For my use case, I need to evaluate Gini index and for this I’ve <code>43.33</code>.Now let’s do randomized search cross-validation to find the best AUC (Gini) and for this, I’m passing some search parameters to the XGBoostRegressor.<div id="27c3"><pre>param_grid = { 'clf__learning_rate': np.arange(0.05, 1, 0.05), 'clf__max_depth': np.arange(3,10,1), 'clf__n_estimators': np.arange(50,250,50) }</pre></div><div id="0287"><pre>rand_auc = RandomizedSearchCV(estimator=pipeline, param_distributions=param_grid, n_iter=5, scoring='roc_auc', cv=5, verbose=False) rand_auc.fit(X_train, y_train) rand_auc.best_score_</pre></div><div id="7b0e"><pre>y_pred = rand_auc.predict(X_test) pred_df = pd.DataFrame({'y': y_test,'y_pred': y_pred}) gini = 2*roc_auc_score(y_test, y_pred)-1</pre></div>Now I’ve Gini or <code>46.48</code>, which is marginally better than the previous method — probably needs more fine-tuning to the model. But that is not the focus of this tutorial.We can now test this same pipeline on a variety of classifiers in a for loop and compare the scores and pick the best model.<div id="236c"><pre>classifiers = [ KNeighborsClassifier(3), SVC(kernel="rbf", C=0.025, probability=True), NuSVC(probability=True), DecisionTreeClassifier(), RandomForestClassifier(), AdaBoostClassifier(), GradientBoostingClassifier() ]</pre></div><div id="cdda"><pre>for classifier in classifiers: steps = [ ('preprocess', preprocess), ('select', fs), ('clf', classifier) ] pipeline = Pipeline(steps) pipeline.fit(X_train, y_train)
print(classifier) print("model score: %.3f" % pipeline.score(X_test, y_test))</pre></div>As you can see, using scikit-learn’s Pipeline feature helps a lot in streamlining machine learning workflow and makes a data scientist's job easier and can focus their time on fine-tuning models, rather than doing data pre-processing steps repetitively.</article></body>

Scikit-learn Pipeline Tutorial with Parameter Tuning and Cross-Validation

It is often a problem, working on machine learning projects, to apply preprocessing steps on different datasets used for training and validation purposes — the scikit-learn Pipeline feature helps to address this problem

What is a machine learning workflow? — It includes all the preprocessing steps like one-hot encoding, label encoding, missing value imputation, then any feature selection steps like SelectKBest or Recursive Feature Elimination (RFE), and then model development and validation steps — all put together becomes a machine learning workflow.

What are the challenges faced in this workflow? — One major challenge is applying the same transformation functions to training, test, and validation datasets. In a clean code, the optimized way is with user-defined functions, that not every data scientist is comfortable with to write effective functions.

To this problem, the scikit-learn Pipeline feature is an out-of-the-box solution, which enables a clean code without any user-defined functions.

Let me demonstrate how Pipeline works with an example dataset. I’ve taken a UCI machine learning data set on credit approval with a mix of categorical and numerical columns.

data = pd.read_csv('bank-full.csv', sep=';')
target = data.pop('y')
target = target.map({'yes': 1, 'no':0})
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=1234)

In my real project, I had 40+ categorical features and some have more than 50 categories, for which I had to use OrdinalEncoder. Creating one-hot encoding for such features will cause a memory error. For the demo dataset, I’ve 2 columns that I want to apply OrdinalEncoder — month and educational qualification — and both of them make sense, as they are ordered values.

categorical_mask = (data.dtypes=='object')
categorical_columns = data.columns[categorical_mask].tolist()
num_cols = data.select_dtypes(include=['int64','float64']).columns.tolist()

oe_cols = [c for c in categorical_columns if data[c].nunique()>5]
ohe_cols = [c for c in categorical_columns if data[c].nunique()<=5]
len(oe_cols), len(ohe_cols), len(num_cols)

After segregating the columns, we now get a list of unique values from the initial dataset for each categorical column, so that the same can be used across training, test, and validation datasets. This will also give an exhaustive list of all values. It is often found that some values are present in training and not in the validation (or vice-versa). To avoid this problem, we are taking the unique values from the initial dataset. Some might question that there might be data leak while modeling purpose, but we are applying transformation separately.

We now define the kind of transformers to be used — OneHotEncoder, OrdinalEncoder, SimpleImputer — we could also use some kind of scaling functions like MinMaxScaler / StandardScaler to this step.

ohe_unique_list = [data[c].unique().tolist() for c in ohe_cols]
oe_unique_list = [data[c].unique().tolist() for c in oe_cols]

ohe = OneHotEncoder(categories=ohe_unique_list)
oe = OrdinalEncoder(categories=oe_unique_list)
imp = SimpleImputer(strategy='constant', fill_value=0)

We use scikit-learn’s make_column_transformer function to create the preprocess column transformers. Also, we define a parameter remainder='passthrough' to let all other columns that don't have any transformer criteria to passthrough. We can have other values like drop to drop any such columns that don’t have preprocess steps.

As I said initially, I’ve so many predictor variables, I have to do feature selection and for this, I’ve tried 2 functions — SelectKBest and RecursiveFeatureElimination (RFE). SelectKBest is simple and fast. And it takes functions like f_classif, chi2 to find the best features. Whereas RecursiveFeatureElimination is a slow process trying to remove features one by one to find the best features.

On top of the feature selection step, I’ve used XGBoost as an estimator, to predict the probabilities.

Now all these are defined as steps of a Pipeline. So if we call the pipeline, it will do preprocess, feature selection, and model fit on the dataset.

preprocess = make_column_transformer(
    (oe, oe_cols),
    (ohe, ohe_cols),
    (imp, num_cols),
    remainder='passthrough'
)
estimator = XGBClassifier(learning_rate=0.05, max_depth=3, n_estimators=2500, random_state=1234)
fs = SelectKBest(score_func=f_classif, k=5)
selector = RFE(estimator, n_features_to_select=5, step=1)

steps = [
    ('preprocess', preprocess),
    ('select', fs),
    ('clf', estimator)
]
pipeline = Pipeline(steps)

Now, It’s as simple as any other machine learning algorithm, we first fit and then use predict. Predict function does all other preprocessing and then applies the trained model.

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
pred_df = pd.DataFrame({'y': y_test,'y_pred': y_pred})
gini = 2*roc_auc_score(y_test, y_pred)-1

For my use case, I need to evaluate Gini index and for this I’ve 43.33.

Now let’s do randomized search cross-validation to find the best AUC (Gini) and for this, I’m passing some search parameters to the XGBoostRegressor.

param_grid = {
    'clf__learning_rate': np.arange(0.05, 1, 0.05),
    'clf__max_depth': np.arange(3,10,1),
    'clf__n_estimators': np.arange(50,250,50)
}

rand_auc = RandomizedSearchCV(estimator=pipeline, param_distributions=param_grid, n_iter=5, scoring='roc_auc', cv=5, verbose=False)
rand_auc.fit(X_train, y_train)
rand_auc.best_score_

y_pred = rand_auc.predict(X_test)
pred_df = pd.DataFrame({'y': y_test,'y_pred': y_pred})
gini = 2*roc_auc_score(y_test, y_pred)-1

Now I’ve Gini or 46.48, which is marginally better than the previous method — probably needs more fine-tuning to the model. But that is not the focus of this tutorial.

We can now test this same pipeline on a variety of classifiers in a for loop and compare the scores and pick the best model.

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]

for classifier in classifiers:
    steps = [
        ('preprocess', preprocess),
        ('select', fs),
        ('clf', classifier)
    ]
    pipeline = Pipeline(steps)
    pipeline.fit(X_train, y_train)   
    print(classifier)
    print("model score: %.3f" % pipeline.score(X_test, y_test))

As you can see, using scikit-learn’s Pipeline feature helps a lot in streamlining machine learning workflow and makes a data scientist's job easier and can focus their time on fine-tuning models, rather than doing data pre-processing steps repetitively.