avatarEsteban Thilliez

Summarize

Data Science with Python — Data Visualization

Photo by Алекс Арцибашев on Unsplash

A key component of data science is data visualization, which enables us to extract insightful knowledge from large datasets. Python offers strong capabilities for building useful and aesthetically pleasing visualizations because of its diverse ecosystem of libraries.

Today, we will look at how to use Python for data visualization!

Understanding Data Visualization

Data visualization is the graphical representation of data using visual elements such as charts, graphs, and maps.

Its goal is to deliver difficult information in a way that is both visually appealing and approachable to a variety of audiences. Data visualization makes it simple to spot patterns, trends, and linkages, which facilitates improved decision-making and insights.

There are several advantages to using data visualization techniques in data science and analysis. It mostly aids in the investigation and comprehension of facts. Analysts may immediately understand the general organization and distribution of the information at hand by visualizing the data. They can use it to find outliers, clusters, and other noteworthy data points that could be missed in the raw data.

Additionally, data visualization helps stakeholders get insights and conclusions in an efficient manner. Since humans are primarily visual creatures, displaying information graphically aids in its more effective and memorable transmission. It is possible to simplify complicated data and present it in a clear, succinct manner that is simpler for non-technical people to understand by utilizing the right visualizations.

Data visualization also makes it easier to spot patterns and trends. The ability to visualize data across time or across several factors makes it easier to spot patterns and relationships. This makes it possible for analysts to get insightful information that may guide corporate strategies and strategic decision-making.

There are several sorts of data visualization approaches that may be used, each with distinct goals and types of data. Typical examples of data visualization are as follows:

  • Bar graphs: These are used to demonstrate the distribution of a single variable or to compare categorical data. Ideal for showing relationships between two continuous variables or trends over time, line charts.
  • Scatter plots: These plots are helpful for identifying correlations or clusters by helping to see the relationship between two numerical variables.
  • Pie charts: Typically used to represent the distribution of a whole and the relative weights of several categories.
  • Heatmaps: Effective for highlighting patterns or correlations in huge datasets or matrices by presenting values using color gradients.
  • Geographic maps: These maps are useful for spatial data representation and can reveal regional or global patterns.
  • Tree diagrams: excellent for displaying hierarchical relationships and data structure.

Getting Started with Data Visualization with Python

You’ll need to install and configure a few libraries before you can start using Python for data visualization. These libraries offer the resources and features necessary to produce attractive and useful visualizations.

One of the most well-liked Python libraries for data visualization is Matplotlib. If you’ve read all my articles about data science, you should already know it. For making static, dynamic, and interactive visualizations, it offers a wide range of possibilities. Run the following line in your terminal or command prompt to install Matplotlib using pip:

pip install matplotlib

There’s also seaborn, a higher-level library that is built on top of Matplotlib and provides an easier-to-use interface for producing statistical visuals. It offers more plot types and improves the built-in Matplotlib styles

pip install seaborn

Additionally, you can also install Plotly and Bokeh, libraries used for creating interactive visualizations and dashboards. I will talk a bit about them later.

pip install plotly

And finally, there’s also Pandas but you should already have it installed.

pip install pandas

Basic Data Visualization Techniques in Python

Let’s start with line plots and scatter plots. They are fundamental data visualization techniques in Python. They allow us to explore and understand patterns and relationships within our data.

When working with a single variable, a line plot is an effective way to visualize its trends and changes over time or other ordered values.

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 15, 7, 12, 9]

# Create a line plot
plt.plot(x, y)

# Add labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')

# Display the plot
plt.show()

When dealing with multiple variables, scatter plots are a valuable tool to visualize the relationship between them. Each point on the plot represents a data point, with the x and y coordinates representing the values of two different variables.

x = [1, 2, 3, 4, 5]
y1 = [10, 15, 7, 12, 9]
y2 = [5, 8, 6, 10, 7]

# Create a scatter plot
plt.scatter(x, y1, label='Variable 1')
plt.scatter(x, y2, label='Variable 2')

plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')

# Add legend
plt.legend()

plt.show()

Then, we have bar charts and histograms. Bar charts are an effective way to represent categorical data. Each category is represented by a bar whose height represents the frequency or count of that category.

categories = ['A', 'B', 'C', 'D']
counts = [10, 7, 12, 5]

# Create a bar chart
plt.bar(categories, counts)

plt.xlabel('Categories')
plt.ylabel('Counts')
plt.title('Bar Chart')

plt.show()

Finally, histograms are particularly useful for visualizing the distribution of numerical data. They display the frequency or count of data points within different intervals, or “bins.”

data = [1, 2, 2, 3, 4, 5, 5, 5, 6, 7]
# Create a histogram
plt.hist(data, bins=5)

plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram')

plt.show()

Advanced Data Visualization Techniques

When we want to get into more advanced plots, we have heatmaps and correlation matrices. They’re powerful tools for visualizing relationships between variables, identifying patterns, and uncovering trends in your data.

Heatmaps are particularly useful when you have a large dataset and want to understand how variables are related to each other. Heatmaps can effectively represent the strength and direction of the relationship between two variables because they use color scales.

import seaborn as sns
import matplotlib.pyplot as plt

# Load your dataset
data = ...

# Calculate the correlation matrix
correlation_matrix = data.corr()

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, cmap='coolwarm', annot=True)
plt.title('Correlation Matrix Heatmap')
plt.show()

You can also just use correlation matrices without using heatmaps. Ideed, correlation matrices are a specific type of heatmap that display the correlation coefficients between all pairs of variables in your dataset. So, they’re perfect if you want to identify patterns!

Then, a bit more advanced, we have box plots and violin plots. They are used to visualize statistical distributions and compare multiple groups or categories. These plots provide insights into the central tendency, spread, and skewness of the data.

Box plots are commonly used to represent the distribution of a single variable or compare distributions between multiple groups. They display statistics such as the median, quartiles, and outliers, making it easy to assess the spread and skewness of the data.

data = ...

plt.figure(figsize=(10, 8))
plt.boxplot(data, labels=['Group 1', 'Group 2', 'Group 3'])
plt.title('Box Plot')
plt.show()

Violin plots are similar to box plots but provide additional information about the density of the data at different values. They are useful when you want to compare distributions across multiple groups or categories.

data = ...

plt.figure(figsize=(10, 8))
sns.violinplot(x='group', y='value', data=data)
plt.title('Violin Plot')
plt.show()

Customizing and Enhancing Data Visualizations

The audience can more readily grasp the information offered when the visualizations are formatted properly, which improves their readability and clarity. So, let’s see how we can do this.

The Matplotlib library’s methods plt.xlabel(), plt.ylabel(), and plt.title()can be used to format labels. You may customize the labels’ and titles’ text, font size, font style, and other elements with these methods.

For example, to set the x-axis label to “Year” and the y-axis label to “Sales”:

plt.xlabel("Year", fontsize=12, fontweight="bold")
plt.ylabel("Sales", fontsize=12, fontweight="bold")

Similarly, you can customize the appearance of axes by adjusting their limits, ticks, and tick labels. The plt.xlim() and plt.ylim() functions allow you to set the lower and upper bounds of the x-axis and y-axis, respectively.

To change the ticks and tick labels, you can use the plt.xticks() and plt.yticks() functions. These functions enable you to specify the positions and labels of the ticks on the axes.

Then, you probably want to add legends. Indeed, legends and annotations play a crucial role in data visualizations as they provide additional context and help in understanding/interpreting the plots. Matplotlib provides various functions to add legends and annotations to your visualizations.

To create a legend, you can use the plt.legend() function. This function takes a list of labels as an argument and assigns them to the respective elements in the plot. You can customize the position, size, and appearance of the legend using additional parameters.

plt.scatter(x, y, color='blue', label='Male')
plt.scatter(x, z, color='red', label='Female')
plt.legend(loc='upper right', fontsize=10)

Annotations can be used to highlight specific points or provide additional information in the visualization. The plt.annotate() function allows you to place text annotations at specified coordinates on the plot.

Finally, you can customize your plots using color schemes. They greatly impact the visual appeal and interpretation of data visualizations.

You can apply a specific color map to your plot using the plt.colormap() function. This function takes the name of the color map as an argument.

To change the style of your visualizations, you can use the plt.style.use() function. Matplotlib offers several built-in styles, such as "ggplot" and "seaborn".

import matplotlib.pyplot as plt

plt.plot(x, y)
plt.style.use('seaborn')

Interactive Data Visualization

Interactive data visualization consists in being able to interact with the charts. For example, being able to zoom in or out, and move around the chart. An interactive chart is very convenient.

There are many libraries for creating interactive charts. First, Matplotlib can do it in Jupyter Notebook. You just have to use the command %matplotlib notebook .

Then, there’s Plotly. It is yet another well-liked Python package for interactive data visualization. It allows for the creation of interactive dashboards and online apps and offers a variety of interactive plot kinds. Plotly provides a variety of rendering backends, such as interactive HTML plots and the creation of static images.

import plotly.graph_objects as go

x = [1, 2, 3, 4, 5]
y = [1, 3, 2, 4, 5]

fig = go.Figure(data=go.Scatter(x=x, y=y, mode='markers'))
fig.update_layout(title='Interactive Scatter Plot', xaxis_title='X-axis', yaxis_title='Y-axis')
fig.show()

Finally, you can also use Bokeh.

Bokeh is a powerful library for creating interactive visualizations in Python. It offers a simple user interface for building apps, dashboards, and interactive graphs. Bokeh supports a number of output formats, including standalone programs, HTML, and notebooks.

from bokeh.plotting import figure, show

x = [1, 2, 3, 4, 5]
y = [1, 3, 2, 4, 5]

p = figure(title='Interactive Line Plot', x_axis_label='X-axis', y_axis_label='Y-axis')
p.line(x, y)
show(p)

Best Practices for Building Beautiful Charts

If you want to convey information easily, I think there are some best practices to follow. Don’t want a chart like the one below, do you?

Source: Ugly Charts

First, you should understand your data. Learn about the features, organization, and connections included in your dataset. Choose the appropriate visualization technique for the type of data you’re working with — whether it’s numerical, categorized, or time series.

Then, you should match the visualization to the message. Decide the important trends or ideas you wish to emphasize. Different visualization methods are more effective for various activities. Line charts are excellent for displaying trends, bar charts and pie charts for comparing categories, and scatter plots for examining relationships.

Also, keep it basic by avoiding adding too many aspects to your visualizations. Concentrate on the most important data. To improve readability, use plain titles, simple titles, and clear labels.

Finally, choose an appropriate color scheme. Don’t build something too crazy! Instead, opt for color combinations that are visually appealing and facilitate comprehension. Avoid overly bright or contrasting colors that can distract or confuse viewers.

Final Note

Now you know a bit more about data visualization with Python! For the other articles of this series, I always illustrated with an example, but here I don’t know if it’s necessary, see.

In the meantime, be sure to follow me if you don’t want to miss the other articles of this series!

To explore the other stories of this series, click below!

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Python
Data
Data Science
Data Visualization
Programming
Recommended from ReadMedium