EvalML Library for Machine Learning Automation Pipeline with Python
Build machine learning models by using automation pipelines
EvalML library is an automation tool that builds machine learning models by using pipelines. It evaluates the feature engineering automatically making the work easier for data scientists. It also does hyper-parameter tuning inside this automation technique.
To install the evalML library use the command shown below:
pip install eval
If the above command doesn’t install then try:
pip install eval — user
Importing the evalml library
import evalml
Here, we will download the demo data set from the evalml library.
X, y = evalml.demos.load_breast_cancer()
Here, we will use the in-built demo dataset of evalML. After this, we will try to split data into train and test with the help of split_data method that is available in evalML library.
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y, problem_type='binary')
X_train.head()
Let’s check the type of the data after splitting.
type(X_train)
#output:
pandas.core.frame.DataFrame
Now check the different types of problems that the model can run.
evalml.problem_types.ProblemTypes.all_problem_types
#output:
[<ProblemTypes.BINARY: 'binary'>,
<ProblemTypes.MULTICLASS: 'multiclass'>,
<ProblemTypes.REGRESSION: 'regression'>,
<ProblemTypes.TIME_SERIES_REGRESSION: 'time series regression'>,
<ProblemTypes.TIME_SERIES_BINARY: 'time series binary'>,
<ProblemTypes.TIME_SERIES_MULTICLASS: 'time series multiclass'>,
<ProblemTypes.MULTISERIES_TIME_SERIES_REGRESSION: 'multiseries time series regression'>]
In the above part, there are different types of problems we can choose to make the output by the model.
As this library fits models automatically, then we don’t need to do any feature engineering, the library will do everything but we need to give the type of problem. The AutoMLSearch is the class to get the automated pipeline of the model by providing various parameters.
from evalml.automl import AutoMLSearch
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary', verbose=True)
automl.search()
We can see that the model ran on different models to find the accuracy comparison. We can use the ranking of the models based on the highest accuracy.
automl.rankings
We can get the description of the best model with its automated generated pipeline structure.
automl.describe_pipeline(automl.rankings.iloc[0]["id"])
We can also get the score of different parameters that can be described from the confusion matrix.
### Evaluate on hold out data
best_pipeline.score(X_test, y_test, objectives=["auc","f1","Precision","Recall"])
#output:
OrderedDict([('AUC', 0.9828042328042328),
('F1', 0.9069767441860465),
('Precision', 0.8863636363636364),
('Recall', 0.9285714285714286)])
Conclusion:
This simple library can be useful to get a quick comparison chart for different model accuracy based on its problem type.
I hope you like the article. Reach me on my LinkedIn and twitter.
Recommended Articles
- Most Usable NumPy Methods with Python 2. NumPy: Linear Algebra on Images 3. Exception Handling Concepts in Python 4. Pandas: Dealing with Categorical Data 5. Hyper-parameters: RandomSeachCV and GridSearchCV in Machine Learning 6. Fully Explained Linear Regression with Python 7. Fully Explained Logistic Regression with Python 8. Data Distribution using Numpy with Python 9. 40 Most Insanely Usable Methods in Python 10. 20 Most Usable Pandas Shortcut Methods in Python