avatarEsteban Thilliez

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4396

Abstract

Just import 2–3 classes and you’re ready to develop a model capable of predicting data with great accuracy.</p><p id="9734">Scikit-learn is used for all model-related tasks:</p><ul><li><b>Model development</b>: As I said, all you need to do is import the necessary classes, and in just a few lines of code you can develop a very powerful model.</li><li><b>Model selection</b>: Rather than bothering with looping and evaluating several models, Scikit-learn can do it all by itself and find the best model for you.</li><li><b>Model evaluation</b>: Scikit-learn can calculate the main metrics for evaluating your model (accuracy, precision, F1 score, etc.).</li></ul><p id="aa72">In addition, Scikit-learn lets you perform the following tasks:</p><ul><li><b>Data preprocessing:</b> you have functions for normalizing your data, separating it into test and training data…</li><li><b>Feature engineering</b>: to optimize your models, you need to go through feature engineering. It’s often a pain, but Scikit-learn makes the process pretty straightforward.</li></ul><h2 id="ae75">Keras</h2><p id="553c">Keras is an open-source library used to develop neural networks. It works using Tensorflow (which won’t be included in this top list, even if it could). Keras is used for all tasks involving deep learning, for example:</p><ul><li><b>Image recognition</b>: Keras has been extensively used in image recognition tasks such as object detection, image classification, and image segmentation.</li><li><b>Natural language processing (NLP)</b>: Keras supports various architectures, such as recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, which are commonly used for NLP tasks.</li><li><b>Recommender systems</b>: Keras can be used to build recommendation systems that provide personalized recommendations to users.</li><li>And more, this list is not exhaustive.</li></ul><h2 id="e84d">Seaborn</h2><p id="5b4b">Seaborn is a library much like Matplotlib (in fact, it’s based on Matplotlib), i.e. it’s used to visualize your data. So, why use Seaborn rather than Matplotlib?</p><p id="113d">Seaborn integrates better with DataFrames than Matplotlib. On top of that, it offers different color palettes to Matplotlib, and the graphs generated don’t have quite the same visuals.</p><p id="9c34">So if you need a simple way of visualizing your DataFrames, opt for Seaborn. The applications are the same as those for Matplotlib.</p><h2 id="0c4d">Beautiful Soup & Selenium</h2><p id="6aa1">It may seem strange that I’m including these libraries in an article on data science, but I’m convinced that web scraping and data science can be linked. For those unfamiliar with web scraping, it’s all about extracting data from the web, and BS & Selenium are the two main libraries used to perform this task.</p><p id="87c9">Beautiful Soup and Selenium are very useful when you want to apply data science tasks to data contained on a web site that you can’t retrieve other than by retrieving the site’s code. So you see, Beautiful Soup and Selenium are used for the data extraction stage.</p><p id="aac9">As an example of how web scraping and data science are linked, I was able to retrieve 35,000 dreams posted on a forum dedicated to dreams and analyze them to produce statistics. Try to find a dataset containing 35,000 dreams, I assure you it’s no easy task.</p><h2 id="c6e8">Natural Language Toolkit</h2><p id="67cb">Natural Language Toolkit (NLTK) is a library used for natural language processing. It provides a wide range of tools, data sets, and resources for tasks such as tokenization, stemming, tagging, parsing, semantic reasoning, and more.</p><p id="57bd">It suits well for the following NLP tasks:</p><ul><li><b>Tokenization</b>: Breaking text into individual words or sentences.</li><li><b>Stemming</b>: Reducing words to their base or root form.</li><li><b>Part-of-speech tagging</b>: Assigning grammatical tags to words (e.g., noun, verb, adjective).</li><li><b>Chunking and parsing</b>: Analyzing sentence structure and extracting meaningful chunks.</li><li><b>Named entity recognition</b>: Identifying named entities like names, organizations, locations, etc.</li><li><b>Sentiment analysis</b>: Determining the sentiment or opinion expressed in a text.</li></ul><h2 id="9e2b">Pillow</h2><p id="2f5d">Pillow is a fork of PIL (Python Imaging Library), used to work w

Options

ith images. With Pillow, you can perform various image manipulation operations such as opening and saving images, resizing and cropping images, applying filters and effects, adjusting colors and contrast, adding text and annotations, and much more. It’s more convenient to use Pillow than NumPy when working with images.</p><p id="883b">Pillow is used for:</p><ul><li><b>Opening and Saving Images</b>: Pillow allows you to open images in different file formats, such as JPEG, PNG, GIF, BMP, and more. You can also save images in various formats, convert between different formats, and compress images if needed.</li><li><b>Resizing and Cropping</b>: Pillow enables you to resize images to specific dimensions or scale them proportionally. You can also crop images to extract a specific region of interest.</li><li><b>Image Filtering and Enhancement</b>: Pillow provides numerous filters and effects to enhance and modify images. You can apply operations like blurring, sharpening, adjusting brightness, contrast, and saturation, and performing color transformations.</li><li><b>Image Manipulation</b>: Pillow allows you to perform various operations like rotating, flipping, and mirroring images. You can also overlay images, blend multiple images together, and extract specific color channels.</li><li>And more…</li></ul><h2 id="e14f">Statsmodel</h2><p id="b1b0">Statsmodels is a library that provides a wide range of statistical models and statistical data analysis tools. It is built on top of NumPy, SciPy, and Pandas, and it is designed to complement these libraries by providing statistical modeling capabilities.</p><p id="cc3d">It allows you to perform statistical analyses such as:</p><ul><li><b>Regression analysis</b>: You can fit various types of regression models, such as ordinary least squares (OLS), generalized linear models (GLM), and robust linear models.</li><li><b>Time series analysis</b>: Statsmodels provides functionality for analyzing time series data, including autoregressive integrated moving average (ARIMA) models, seasonal decomposition of time series (SARIMA), and vector autoregression (VAR) models.</li><li><b>ANOVA (Analysis of Variance)</b>: You can perform ANOVA tests to analyze the differences between groups and determine if these differences are statistically significant.</li><li><b>Multivariate analysis</b>: Statsmodels supports multivariate statistical models, such as principal component analysis (PCA), factor analysis, and multivariate analysis of variance (MANOVA).</li><li><b>Nonparametric methods:</b> The library includes nonparametric methods, such as kernel density estimation, kernel regression, and rank-based tests.</li></ul><h2 id="e277">Final Note</h2><p id="6b7b">This top is a bit biased, as the libraries I prefer are simply the ones I use the most. But objectively speaking, these are indeed widely used and recognized libraries, so if you’ve mastered them, you can be sure they’ll come in handy.</p><p id="e1b5"><b>Thanks for reading! </b>Here are some links that may interest you:</p><ul><li><a href="https://readmedium.com/data-science-with-python-32da1e5c3d2f">📊 Data Science with Python — The whole series</a></li><li><a href="https://readmedium.com/python-index-79257c082fe1">🐍 All my Python articles</a></li><li><a href="https://readmedium.com/python-index-79257c082fe1">💻 All my tech articles</a></li><li><a href="https://readmedium.com/about-me-d63607c8c341"><i> Know more about me and my articles</i></a><i>!</i></li><li><a href="https://medium.com/subscribe/@estebanthi">🔔<i> Become an email subscriber</i></a><i>!</i></li><li><a href="https://medium.com/@estebanthi/membership">🤝<i> Support me by subscribing with my referal link</i></a><i>:</i></li></ul><div id="cdcb" class="link-block"> <a href="https://medium.com/@estebanthi/membership"> <div> <div> <h2>Join Medium with my referral link — Esteban Thilliez</h2> <div><h3>Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*IoN4BofrwCNWA_bS)"></div> </div> </div> </a> </div></article></body>

10 Python Libraries you should Master for Data Science

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

I’ve talked about a lot of bookshops in this series on data science. Today, I’m going to give you a rundown of the most important ones for data science, which you really need to know about.

There will be a few new ones that I haven’t yet mentioned in this series.

NumPy

Obviously, I start with NumPy. This library is so powerful, especially for data science. It lets you work with arrays and matrices of any dimension.

NumPy is therefore widely used to perform operations on arrays, something that happens all the time when you’re doing data science.

Here’s how NumPy is mainly used:

  • Computations: NumPy offers a high-performance and effective interface for executing numerical computations. It furnishes a comprehensive collection of mathematical functions and algorithms that enable users to carry out various operations, including matrix multiplication, dot products, and linear algebra operations, with great efficiency. Additionally, NumPy provides extensive support for complex numbers and facilitates Fourier transforms.
  • Data Analysis: NumPy also offers functions for calculating statistics, such as mean, variance, etc…
  • Machine Learning: NumPy is not directly used for machine learning, but most machine learning libraries are based on it.
  • Image processing: NumPy also features functions for working with images. Whether you want to transform them by enlarging or cropping them, or change their colors, and much more…

Matplotlib

Often, when we have data, we like to visualize it. It makes them more concrete. That’s what matplotlib lets you do. You can easily create all kinds of graphs, from simple line plots to more complicated graphs such as heatmaps.

Matplotlib is mainly used for:

  • Data visualization: of course, that’s his main role…
  • Exploratory data analysis: This step often requires you to visualize your data to better understand it. With Matplotlib, you have plenty of ways to visualize your data by summarizing it and viewing its statistics.
  • Model validation: Matplotlib lets you visualize certain important metrics to help you evaluate your model. I’m thinking in particular of the confusion matrix.

Pandas

You’ve all heard of Pandas. This library comes in handy because it lets you work with data structures that are easy to understand and manipulate. Where a NumPy array can sometimes seem a little abstract, a Pandas DataFrame is very easy to understand.

Pandas is widely used for the following tasks:

  • Data exploration: Pandas offers a host of functions for summarizing your data and finding out its statistics.
  • Data cleaning: Pandas can easily replace missing data, delete duplicates, transform your data and apply the operations of your choice…
  • Data preparation: you can easily merge multiple DataFrames together, extract data from CSV or Excel files and format your data.
  • Machine learning: when you need to feed your model with data, this data can be in the form of Pandas DataFrame. It’s all very practical and easy to manipulate.

Scikit-learn

This library makes it so easy to create machine learning models. Just import 2–3 classes and you’re ready to develop a model capable of predicting data with great accuracy.

Scikit-learn is used for all model-related tasks:

  • Model development: As I said, all you need to do is import the necessary classes, and in just a few lines of code you can develop a very powerful model.
  • Model selection: Rather than bothering with looping and evaluating several models, Scikit-learn can do it all by itself and find the best model for you.
  • Model evaluation: Scikit-learn can calculate the main metrics for evaluating your model (accuracy, precision, F1 score, etc.).

In addition, Scikit-learn lets you perform the following tasks:

  • Data preprocessing: you have functions for normalizing your data, separating it into test and training data…
  • Feature engineering: to optimize your models, you need to go through feature engineering. It’s often a pain, but Scikit-learn makes the process pretty straightforward.

Keras

Keras is an open-source library used to develop neural networks. It works using Tensorflow (which won’t be included in this top list, even if it could). Keras is used for all tasks involving deep learning, for example:

  • Image recognition: Keras has been extensively used in image recognition tasks such as object detection, image classification, and image segmentation.
  • Natural language processing (NLP): Keras supports various architectures, such as recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, which are commonly used for NLP tasks.
  • Recommender systems: Keras can be used to build recommendation systems that provide personalized recommendations to users.
  • And more, this list is not exhaustive.

Seaborn

Seaborn is a library much like Matplotlib (in fact, it’s based on Matplotlib), i.e. it’s used to visualize your data. So, why use Seaborn rather than Matplotlib?

Seaborn integrates better with DataFrames than Matplotlib. On top of that, it offers different color palettes to Matplotlib, and the graphs generated don’t have quite the same visuals.

So if you need a simple way of visualizing your DataFrames, opt for Seaborn. The applications are the same as those for Matplotlib.

Beautiful Soup & Selenium

It may seem strange that I’m including these libraries in an article on data science, but I’m convinced that web scraping and data science can be linked. For those unfamiliar with web scraping, it’s all about extracting data from the web, and BS & Selenium are the two main libraries used to perform this task.

Beautiful Soup and Selenium are very useful when you want to apply data science tasks to data contained on a web site that you can’t retrieve other than by retrieving the site’s code. So you see, Beautiful Soup and Selenium are used for the data extraction stage.

As an example of how web scraping and data science are linked, I was able to retrieve 35,000 dreams posted on a forum dedicated to dreams and analyze them to produce statistics. Try to find a dataset containing 35,000 dreams, I assure you it’s no easy task.

Natural Language Toolkit

Natural Language Toolkit (NLTK) is a library used for natural language processing. It provides a wide range of tools, data sets, and resources for tasks such as tokenization, stemming, tagging, parsing, semantic reasoning, and more.

It suits well for the following NLP tasks:

  • Tokenization: Breaking text into individual words or sentences.
  • Stemming: Reducing words to their base or root form.
  • Part-of-speech tagging: Assigning grammatical tags to words (e.g., noun, verb, adjective).
  • Chunking and parsing: Analyzing sentence structure and extracting meaningful chunks.
  • Named entity recognition: Identifying named entities like names, organizations, locations, etc.
  • Sentiment analysis: Determining the sentiment or opinion expressed in a text.

Pillow

Pillow is a fork of PIL (Python Imaging Library), used to work with images. With Pillow, you can perform various image manipulation operations such as opening and saving images, resizing and cropping images, applying filters and effects, adjusting colors and contrast, adding text and annotations, and much more. It’s more convenient to use Pillow than NumPy when working with images.

Pillow is used for:

  • Opening and Saving Images: Pillow allows you to open images in different file formats, such as JPEG, PNG, GIF, BMP, and more. You can also save images in various formats, convert between different formats, and compress images if needed.
  • Resizing and Cropping: Pillow enables you to resize images to specific dimensions or scale them proportionally. You can also crop images to extract a specific region of interest.
  • Image Filtering and Enhancement: Pillow provides numerous filters and effects to enhance and modify images. You can apply operations like blurring, sharpening, adjusting brightness, contrast, and saturation, and performing color transformations.
  • Image Manipulation: Pillow allows you to perform various operations like rotating, flipping, and mirroring images. You can also overlay images, blend multiple images together, and extract specific color channels.
  • And more…

Statsmodel

Statsmodels is a library that provides a wide range of statistical models and statistical data analysis tools. It is built on top of NumPy, SciPy, and Pandas, and it is designed to complement these libraries by providing statistical modeling capabilities.

It allows you to perform statistical analyses such as:

  • Regression analysis: You can fit various types of regression models, such as ordinary least squares (OLS), generalized linear models (GLM), and robust linear models.
  • Time series analysis: Statsmodels provides functionality for analyzing time series data, including autoregressive integrated moving average (ARIMA) models, seasonal decomposition of time series (SARIMA), and vector autoregression (VAR) models.
  • ANOVA (Analysis of Variance): You can perform ANOVA tests to analyze the differences between groups and determine if these differences are statistically significant.
  • Multivariate analysis: Statsmodels supports multivariate statistical models, such as principal component analysis (PCA), factor analysis, and multivariate analysis of variance (MANOVA).
  • Nonparametric methods: The library includes nonparametric methods, such as kernel density estimation, kernel regression, and rank-based tests.

Final Note

This top is a bit biased, as the libraries I prefer are simply the ones I use the most. But objectively speaking, these are indeed widely used and recognized libraries, so if you’ve mastered them, you can be sure they’ll come in handy.

Thanks for reading! Here are some links that may interest you:

Data
Data Science
AI
Python
Programming
Recommended from ReadMedium