PYTHON | DATA SCIENCE | ARTIFICIAL INTELLIGENCE
Unleashing the Power of Python: The Top 4 Must-Know Packages of 2023
Transform your data science workflow with these cutting-edge libraries
Python is a versatile and powerful programming language that is widely used in the field of data science. With its vast ecosystem of libraries and frameworks, Python offers an extensive range of tools and functionality for data scientists, engineers, and developers.
In this article, we will discuss 4 of the most popular and influential Python libraries released or updated in 2023 and are worth keeping an eye on.
Hugging Face’s Transformers
Hugging Face’s Transformers is a Python library for state-of-the-art natural language processing (NLP) that allows for easy and fast processing of text-based data.
The library is built on top of the popular deep learning framework PyTorch. It provides a wide range of pre-trained models for various NLP tasks such as text classification, language translation, text generation, and more.
One of the main strengths of the Transformers library is the ability to fine-tune pre-trained models on specific tasks and datasets. This allows users to quickly and easily adapt state-of-the-art models to their use cases without extensive training. The library also supports multi-language and multi-model architectures, making it a versatile tool for NLP tasks.
One of the most popular models within the library is BERT (Bidirectional Encoder Representations from Transformers), which has achieved state-of-the-art results on a wide range of NLP tasks.
BERT is a pre-trained model that can be fine-tuned on specific tasks, such as text classification and named entity recognition.
To use the Transformers library, you first need to install it by running the following:
pip install transformers
Once installed, you can begin importing the library and selecting a pre-trained model.
For example, to fine-tune BERT on a text classification task, you would use the following code:
from transformers import BertForSequenceClassification, AdamW, BertConfig
# Load pre-trained model and configure it for the task
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Fine-tune the model on your task-specific dataset
model.fit(train_data, train_labels)
# Use the model to make predictions on new data
predictions = model.predict(test_data)In this example, we first import the BertForSequenceClassification model from the library and initialize it with the pre-trained “bert-base-uncased” model.
Using the fit method, we then fine-tune the model on our task-specific dataset (train_data and train_labels).
Finally, we use the fine-tuned model to predict new data (test_data).
Overall, the Hugging Face’s Transformers library is a powerful tool for NLP tasks, offering a wide range of pre-trained models and the ability to fine-tune them on specific tasks.
Its easy-to-use API and support for multiple languages and models make it a valuable resource for data scientists and NLP practitioners.
Optuna
Optuna is a powerful Python library for automating the hyperparameter tuning process in machine learning. It allows for the efficient optimization of simple and complex models, making it a valuable tool for data scientists and machine learning practitioners.
One of the critical features of Optuna is its ability to handle large and complex search spaces. It uses tree-structured Parzen Estimator (TPE) techniques and the Gaussian Process to efficiently explore the search space and find optimal parameter values.
Additionally, Optuna supports the parallelization of the optimization process, allowing for faster convergence and improved performance.
Another great feature of Optuna is its flexibility. It can be used with any machine learning library, such as scikit-learn, Keras, and XGBoost, and is compatible with both Python 2 and 3.
It also provides a convenient API for storing and loading trials, making it easy to resume interrupted or long-running studies.
To give an example of Optuna’s usage, let’s say we want to optimize the hyperparameters of a random forest classifier using scikit-learn. First, we need to install the library with pip install optuna.
Next, we import the necessary libraries and define the objective function. The objective function is the function that Optuna will optimize. It takes in a set of hyperparameters and returns a performance metric, such as accuracy or F1-score.
In this example, we are using accuracy:
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)
def objective(trial):
n_estimators = trial.suggest_int('n_estimators', 50, 200)
max_depth = trial.suggest_int('max_depth', 2, 10)
clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
return accuracyWe then create an Optuna study and set the objective function:
study = optuna.create_study()
study.optimize(objective, n_trials=100)Finally, we can access the best hyperparameters and the corresponding accuracy:
best_params = study.best_params
best_accuracy = study.best_value
print("Best params: ", best_params)
print("Best accuracy: ", best_accuracy)PyTorch Lightning
PyTorch Lightning is a lightweight framework for PyTorch that allows for more efficient and organized research and development. It is designed to remove the boilerplate code commonly found in traditional PyTorch projects, making the development process faster and easier.
One of the critical features of PyTorch Lightning is the ability to quickly scale your models to multiple GPUs and TPUs with minimal code changes. This allows for faster training times and more efficient use of resources.
Additionally, PyTorch Lightning includes several callbacks and metrics that can be used to monitor the progress of your training and make adjustments as needed.
Another great feature of PyTorch Lightning is its support for distributed training. With just a few lines of code, you can efficiently train your models on multiple machines, further speeding up the training process. This is especially useful for large models or datasets that may not fit on a single machine.
PyTorch Lightning includes many other valuable features, such as early stopping, checkpointing, and automatic logging. These features help to keep your code organized and make it easier to track the progress of your research.
We can install this package through pip as follows: pip install pytorch-lightning.
Here’s an example of how you can use PyTorch Lightning to train a simple model:
import pytorch_lightning as pl
from torch.nn import Linear
class MyModel(pl.LightningModule):
def __init__(self):
super(MyModel, self).__init__()
self.layer = Linear(10, 10)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = self.loss(logits, y)
return {'loss': loss}
def configure_optimizers(self):
return Adam(self.parameters(), lr=0.01)
model = MyModel()
trainer = pl.Trainer()
trainer.fit(model)In this example, we define a simple PyTorch model MyModel inherited from pl.LightningModule.
With PyTorch Lightning, training a model is as simple as calling the fit method of the trainer. We define the forward pass, training step, and optimizer for the model and then use the pl.Trainer to train the model.
Overall, PyTorch Lightning is a powerful and easy-to-use framework that can help you to improve the efficiency and organization of your PyTorch projects.
Its support for distributed training, automatic logging, and other features can help speed up your research and make it easier to track the progress of your experiments.
Dask
Dask is a parallel computing library that makes it easy to work with large datasets. It is designed to work seamlessly with existing Python libraries such as NumPy and Pandas, and it can be used to scale computations to a cluster of machines. It can also perform distributed computing with other libraries, such as PyTorch and TensorFlow.
It allows developers to harness their hardware’s full power, whether a single machine or a cluster, by providing an easy-to-use interface for parallel processing.
One of the critical features of Dask is its ability to scale up the computational power of Python libraries such as NumPy, Pandas, and Scikit-learn. This makes it an ideal tool for working with large datasets and complex computations, as it allows you to parallelize your code and take advantage of all available resources.
Dask also provides a robust array computation library, similar to NumPy, called Dask arrays. Dask arrays support all the same operations as NumPy arrays but can handle larger-than-memory and out-of-core computations. This allows you to work with large arrays that would not fit into memory, making it an excellent tool for big data and scientific computing.
Dask offers a powerful dataframe library, similar to Pandas, called Dask dataframes. Dask dataframes can handle larger-than-memory and out-of-core computations and provide a similar API to the Pandas dataframe. This allows you to work with large datasets that would not fit into memory and perform complex calculations.
Dask also provides a powerful task scheduler that efficiently parallelizes your code and schedules tasks across multiple CPU cores or machines. This allows you to quickly scale your computations to take advantage of all available resources, whether working on a single machine or a cluster.
We can install Dask using: pip install dask[complete]
Here is a simple example of using Dask to parallelize a computation:
import dask
import dask.array as da
x = da.random.random((1000, 1000), chunks=(100, 100))
y = da.random.random((1000, 1000), chunks=(100, 100))
z = x + y
result = dask.compute(z)In this example, we create two large arrays, xand y, and add them together using the + operator.
Depending on your configuration, the dask.compute()function triggers the computation, executed in parallel across multiple cores or machines.
In conclusion, Dask is a powerful and flexible library that makes it easy to parallelize and distribute computations in Python. It allows you to scale up the computational power of your code, whether you’re working with large datasets or complex computations, by providing an easy-to-use interface for parallel processing.
Closing Remarks
These four libraries are just a few examples of the powerful tools available to data scientists in 2023. With the constant evolution of the field, it’s always worth keeping an eye out for new and exciting libraries that can help streamline your work and improve your results.
For this article, we haven’t even scratched the surface of what these libraries can offer. I recommend you visit their respective pages and spend some time playing around with them. They might save you one day! Going into greater detail per library would actually merit separate articles!
Did you enjoy this post? For $5/month, you can become a member to unlock unlimited access to Medium. You will directly support me and all your other favourite writers on Medium. So huge thanks for that!
Perhaps you might also consider subscribing to my email list to get notified whenever I publish new content. It’s free :)
Want to Get in Touch?
I would love to hear your thoughts on the topic or anything AI and Data.
Drop me an email at [email protected] should you wish to get in touch.
More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.
Interested in scaling your software startup? Check out Circuit.
