The provided content discusses the use of object-oriented programming (OOP) in data science workflows, emphasizing the benefits of using helper classes to manage complexity in machine learning (ML) projects.
Abstract
The article explores the application of OOP principles in organizing data science tasks, such as exploratory data analysis (EDA), feature engineering, and machine learning model training. Initially, a single class, MLworkflow, is introduced to encapsulate these tasks, but as complexity grows, the class becomes harder to read and maintain. The author then demonstrates how breaking down the workflow into specialized helper classes for EDA, feature engineering, and model training can improve code readability and maintainability. The article provides detailed code examples using Python and illustrates the process of refactoring a monolithic class into a more modular and extensible design. This approach facilitates running experiments with different models and data categories, as shown with experiments on a medical cost dataset. The article concludes by advocating for the use of helper classes in ML projects to enhance clarity and manageability, and it encourages readers to apply similar strategies in their own projects.
Opinions
The author believes that using a single class for all ML workflow tasks becomes impractical as project complexity increases.
Helper classes are presented as a solution to improve the readability and maintainability of code in complex ML projects.
The article suggests that a hierarchy of classes, each responsible for a specific part of the ML workflow, leads to better-organized code.
The author emphasizes the importance of being able to easily interpret and modify code, especially when collaborating with others.
The use of helper classes is recommended as a best practice for structuring ML projects, making it easier to perform tasks such as EDA, feature engineering, and model validation.
The author encourages readers to adopt the demonstrated techniques in their own ML work to achieve more efficient and understandable codebases.
Mastering Data Science Workflows with Helper Classes
Python Helper Classes for EDA, Feature Engineering and Machine Learning
In computer programming, classes are a useful way to organize data (attributes) and functions (methods). For example, you can define a class that defines attributes and methods related to a machine learning model. An instance of this type of class may have attributes such as training data file name, model type, and more. Methods associated with these attributes can be fit, predict and validate.
In addition to machine learning, classes have a wide range of applications across data science in general. You can use classes to organize a variety of EDA tasks, feature engineering operations, and machine learning model training. This is ideal because, if written well, classes make it easy to understand, modify and debug existing attributes and methods. This is particularly true if class methods are defined to complete a single well-defined task. It is generally good practice to define functions that do one thing and classes make understanding and maintaining these methods more straight-forward.
While using classes can make maintaining code more straightforward, it can also become harder to understand as you add complexity. If you like to organize attributes and methods for basic EDA, feature engineering and model training, a single class probably suffices. But as you add more attributes and methods for each type of task, initiation of these objects can become quite obscure, especially for collaborators reading your code. With this in mind, it is ideal to have helper classes for each type of task (EDA, feature engineering, machine learning) instead of a single class as complexity increases. When developing complex ML workflows, there should be separate EDA, feature engineering, and machine learning classes instead of a single class.
Here we will consider each of these types of tasks and see how to write a single class that enables us to perform them. For EDA, our class will allow us to read in data, generate histograms and scatter plots. For feature engineering, our class will have a methods taking the log transform. Finally for machine learning, our class will have fit, predict and validate methods.
From there we will see how as we add additional attributes and methods, class instantiation and method calls become harder to read. We will add additional methods and attributes for each task type and illustrate how readability get compromised as we add complexity. From there we will see how we can separate parts of our classes into helper classes that are easier to understand and manage.
For this work, I will be writing code in Deepnote, which is a collaborative data science notebook that makes running reproducible experiments very easy. We will be working with the Medical Cost dataset. We will use patient attributes such as age, body mass index, and number of children to predict medical costs. The data is publicly free to use, modify and share under the Database Contents License (DbCL: Public Domain).
Bookkeeping Model Type with OOP
To start, let’s navigate to Deepnote and create a new project (you can sign-up for free if you don’t already have an account).
Let’s create a project called ‘helper_classes’ and a notebook within this project called ‘helper_classes_ds’. Also, lets drag and drop the insurance.csv file on the left hand panel on the page where it says ‘FILES’:
Screenshot taken by Author
We will proceed by defining a class that contains, at a high level, some of the basic steps within a machine learning workflow. Let’s start by importing all of the packages we will be working with:
Let’s define a class called ‘MLworkflow’ which contains an init method that initializes dictionaries which we will use to store model predictions and the performance. We will also define a class attribute that stores our medical cost data:
Next we’ll define a method called ‘eda’ that performs some simple visualizations. If you pass a value of ‘True’ for the variable histogram, it will generate a histogram for the numerical feature specified. If you pass a value of ‘True’ for the variable scatter_plot, it will generate a scatter plot of the numerical feature against the target:
classMLworkflow(object):
...
defeda(self, feature, target, histogram, scatter_plot):
self.corr = self.data[feature].corr(self.data[target])
if histogram:
self.data[feature].hist()
plt.show()
if scatter_plot:
plt.scatter(self.data[feature], self.data[target])
plt.show()
Next, we’ll define another method called ‘data_prep’ that defines our inputs and output. We will also define a parameter called transform which we can use to take the log-transform of numerical columns:
classMLworkflow(object):
...
defdata_prep(self, features, target, transform):
for feature in features:
if transform:
self.data[feature] = np.log(self.data[feature])
self.X = self.data[features]
self.y = self.data[target]
We will also define a fit method. It will split the data for training and testing, where the test_size can be specified by the ‘split’ parameter. We will also provide the option to fit to a linear regression or random forest model. This can obviously be extended to any number of model types:
We will then define a predict method that generates predictions on our test set. We will store the results in our predictions dictionary, where the dictionary keys will be the model type:
And finally calculate performance for each model type. We will use mean absolute error as our performance metric and store the values in our performance dictionary using a method called validate:
We can define an instance of this class and generate some visualizations:
We can then define an instance and build linear regression and random forests models. We start by defining an instance of our class and calling the data prep method with the inputs and output we wish to use:
model = MLworkflow()
features = ['bmi', 'age']
model.data_prep(features, 'charges', True)
We can then build a linear regression model by calling the fit method with a model_name parameter value ‘lr’ for linear regression and a test_size of 20%. We then call the predict and validate methods on our model instance:
As a result our model object will have an attribute called _performance. We can access it through our model object and print the dictionary:
We see that we have a dictionary with keys ‘lr’ and ‘rf’ with mean absolute error values of 9232 and 9161 respectively.
Bookkeeping Model Type and Categorically Segmented Training Data with a Single Class
While the code used to define this class is simple enough, it can be come difficult to read and interpret with increasing complexity. For example, what if in addition to being able to monitor model_types, we’d like to be able to build models on distinct categories within the data. For example, what if we wish to train a linear regression model on only female patients or a random forest model trained on only male patients. Let’s walk through how to write this modified class. Similar to before we define an init method where we initialize necessary dictionaries. We will add a new dictionary called models:
The eda and data prep methods remain mostly unchanged:
classMLworkflowExtended(object):
...
defeda(self, feature, target, histogram, scatter_plot):
self.corr = self.data[feature].corr(self.data[target])
if histogram:
self.data[feature].hist()
plt.show()
if scatter_plot:
plt.scatter(self.data[feature], self.data[target])
plt.show()
defdata_prep(self, features, target, transform):
self.target = target
for feature in features:
if transform:
self.data[feature] = np.log(self.data[feature])
The fit method contains quite a few changes. It now takes variable model_category and category_values as well as default values for our random forest algorithm. It also checks if the category values are in the initialized dictionaries. If they aren’t, they are initialized with an empty dictionary. The result is a dictionary of dictionaries where the outer most keys are the categorical values. The values that they categorical keys map to are dictionaries containing the algorithm type and their performance. The structure is as follows:
We can then run experiments that vary by model type and category. For example, let build some linear regression and random forest models on separate female and male data sets:
We can do the same for the region category. Let’s run experiments for southwest and northwest:
While this works just fine the code for running certain experiments become difficult to read. For example, when fitting our random forest, it can be unclear to someone reading our code for the first time what all of the values passed to the fit method mean:
This can get even more complicated as we increase the functionality of our class.
Bookkeeping Model Type and Categorically Segmented Training Data with Helper Classes
To avoid this increasing complexity, it is often helpful to resort to helper classes that are defined based on each part of the ML workflow.
We can start by defining an EDA helper class:
We can then use the eda class to access our data in a feature engineering class:
Next we will define our data prep class. In the init method of our data prep class we will initialize our dictionaries to store models, predictions and performance. We will also use the feature engineering class to apply log transforms to bmi and age. Finally, we will store the modified data and the target variable in data prep attributes:
Next we will define a data prep method within our data prep class. We will start by defining attributes for train/test split, model category, and category values. We will then check if the category values are present in our prediction, performance and model dictionaries. If they are not we will store an empty dictionary for the new category:
Finally, we define a model training class, that allows us to access our prepared data, train our models, generate predictions and calculate performance:
We can now run a series of experiments with our hierarchy of classes. For example, we can build a random forest model trained on only data corresponding to female patients:
We can also build a linear regression model trained on only data corresponding to female patients. The performance for this model will be added to the existing performance dictionary:
We can do the same for male patients. These are the results for linear regression:
and for random forest:
We see that we have a dictionary of several experiments and their corresponding model types, category levels and model performance values.
The code used in this post is available on GitHub.
CONCLUSIONS
In this post we discussed how to use object oriented programming to streamline parts of the data science workflow. First we defined a single ML workflow class that enabled simple eda, data prep, model training and validation. We then saw how as we added functionality to our class, method calls on class instances became difficult to read. To avoid issues with reading and interpreting code, we designed a class hierarchy made up of a series of helper classes. Each helper class corresponded to a step within the ML workflow. This makes it easy to understand methods as they relate to high level tasks, which helps with readability and maintainability. I encourage you to try this with some of your own ML projects.