Python Libraries Every Data Scientist and Data Analyst Should Know
Why Python in Data Science?
For many people, python has a strong reputation. Since its first appearance in 1991, python has become one of the most popular interpreted languages, along with Perl, Ruby, and others.
Python and Ruby have become particularly popular since 2005 or so, due to their increased use for building websites using various web frameworks, such as Rails (Ruby) and Django (Python).
Often called Scripting Languages, these languages can be used to write small programs quickly, or to automate some tasks. The term “scripting language” denotes that the language cannot be used to build huge programs, but this meaning is far from reality, as python can be used to build programs of any size, for example instagram and dropbox were built using python. Among the interpreted languages, and for many reasons, python has developed a large and active community in the field of computing and data analysis. In the past ten years, python has moved from the language of advanced scientific computing to one of the most important languages for data science, machine learning, artificial intelligence, and general software development in academia and industry.
Python definitely outperforms other open source commercial programming languages and tools widely used for data analysis, interactive computing and data visualization, such as R, MATLAB, SAS, Stata, etc., due to its ease and availability of a huge library system to support data science (pandas, numpy, matplotlib, scikit- learn,… etc), besides Python’s overall strength in general-purpose software engineering, it is an excellent choice as a primary language for building data science applications.
Another important reason for Python’s success in data science and scientific computing is its use to easily integrate C, C ++, and FORTRAN code, as most modern computing environments share a similar set of old FORTRAN and C libraries for linear algebra, calculus, fast Fourier transforms, and other algorithms. The importance of using low-level languages for math calculations, such as C and FORTRAN, comes from their speed and ability to improve computational bottlenecks.
Python Libraries for Data Science
In the context of the current article I’ll just give a quick overview of some of the most commonly used python libraries in data science:
NumPy
NumPy is a short for Numerical Python, it has long been the cornerstone of digital computing in Python. NumPy provides data structures, algorithms and integrates libraries for most scientific applications involving digital data in Python.
You can install it using pip as follows
pip install numpy
Pandas for Data Science
Pandas provides high-level data structures and algorithms designed to make working with structured, or tabular, data fast, easy and expressive.
Since its inception in 2010, Pandas has helped enable Python to be a powerful and productive data analysis environment. Pandas blends the ideas of NumPy’s high-performance computing with flexible data-processing capabilities for spreadsheets and databases (SQL). You can install pandas using the command:
pip install pandas
Matplotlib
Matplotlib is the most popular Python library for producing 2D visualizations of data. It was originally created by John D. Hunter and is now maintained by a large team of developers. Matplotlib is the most widely used data scientist for its ability to produce shapes with high accuracy, making it easier to publish and use these shapes.


I have written a nice story about mastering matplotlib here:
IPython and Jupyter
The IPython project started in 2001 as a side project for Fernando Pérez to provide an interactive Python compiler.
In later years, IPython became one of the most important tools in the modern Python system. Although it does not provide any computational or analytical tools for data per se, IPython was designed from the ground up to increase the productivity of Python users in both interactive computing and software development. In 2014, Fernando Pérez and the IPython team announced Project Jupyter, a broader initiative to design language-neutral interactive computing tools.

Then IPython web notebook became the Jupyter Notebook, now with support for more than 40 programming languages. Jupyter Notebook is an interactive web-based “notebook” tool that provides support for dozens of programming languages. Jupyter Notebooks are especially useful for exploring and visualizing data. The Jupyter Notebook system also enables content authoring using Markdown and HTML, providing a means for creating rich documents that contain code and text.

SciPy
SciPy (Scientific Python) is a collection of packages that address a number of different standard problem areas in scientific computing, such as procedures for numerical integration, solutions of differential equations, procedures for linear algebra and matrix analyzes, and many more. Together, NumPy and SciPy form a reasonably complete and mature computational basis for many traditional data science and scientific computing applications.

Statsmodels
This is a statistical analysis package initiated by work from Stanford University statistics professor Jonathan Taylor. Then in 2010 Skipper Seabold and Josef Perktold formally created the new statsmodels project, and since then the project has grown into a huge mass of users and co-contributors. Statsmodels contain both classical and econometric algorithms. This library includes sub-units such as: Regression Models, Analysis of Variance, and Time Series Analysis, in addition to data visualization.
There are many nice examples demonstrate the use of the library here.
Scikit-learn
Since the project’s inception in 2010, scikit-learn has become a major general purpose machine learning libraries for Python programmers. In just seven years, it had more than 1,500 shareholders from all over the world. This library includes several machine learning models such as Classification, Regression, Clustering, Dimensionality reduction, Model selection, and Preprocessing models. Alongside Pandas, statsmodels, and IPython, scikit-learn has been important to enabling Python to be the preferred and most productive programming language in data science.

Read Also:
Conclusion
In this article, I presented major libraries in python that I use in my data science projects, the amazing thing about those libraries and python in general is that the development community is huge, and its popularity in github and stackoverflow makes it very good choice when it comes to open source tools and free online-support.
So, what is your favorite library?






