Free AI web copilot to create summaries, insights and extended knowledge, download it at here

9506

Abstract

pan class="hljs-number">10, 6)) corr = wines.corr() hm = sns.heatmap(round(corr,2), annot=True, ax=ax, cmap="coolwarm",fmt='.2f', linewidths=.05) f.subplots_adjust(top=0.93) t= f.suptitle('Wine Attributes Correlation Heatmap', fontsize=14)</pre></div><figure id="200d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*LJgjBiTlLRbHkeGZu_SWZg.png"><figcaption></figcaption></figure>For the specific attributes, we can use pairplot to see their relationship:<div id="a94e"><pre>specific_atts=['density','residual sugar','total sulfur dioxide','fixed acidity']

sns.pairplot(wines[specific_atts],diag_kind='kde')</pre></div><figure id="a30f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*i3nZ0iYDlINykMh6LOmbjw.png"><figcaption></figcaption></figure>Or we can use parallel coordinates to display the relationship between categories:<div id="b8d2"><pre>specific_atts=specific_atts+['wine_type']

pd.plotting.parallel_coordinates(wines[specific_atts],'wine_type')</pre></div><figure id="9770"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*raQAGnCA5aiJfKYf0z8ylQ.png"><figcaption></figcaption></figure>For 2 continuous attributes, scatter plot and joint plot are the best options to see the relationship and distributions.<div id="84ff"><pre># Scatter Plot plt.scatter(wines['sulphates'], wines['alcohol'], alpha=0.4, edgecolors='w')

plt.xlabel('Sulphates') plt.ylabel('Alcohol') plt.title('Wine Sulphates - Alcohol Content',y=1.05)

# Joint Plot jp = sns.jointplot(x='sulphates', y='alcohol', data=wines, kind='reg', space=0, size=5, ratio=4)</pre></div><figure id="e75d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*XBCh5jTJJQkqdGDreGKplg.png"><figcaption></figcaption></figure><figure id="2189"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hkVKqMc32WtAlqsKfDXBHw.png"><figcaption></figcaption></figure>But for the 2 discrete attributes, we have to choose another type— bar chart:<div id="af9b"><pre># Multi-bar Plot cp = sns.countplot(x="quality", hue="wine_type", data=wines, palette={"red": "#FF9999", "white": "#FFE888"})</pre></div><figure id="8154"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*gQKJoav-RCi1KjEYOf61lA.png"><figcaption></figcaption></figure>For the mixed attributes, multi-histograms are easy to use:<div id="16e5"><pre># Using multiple Histograms g = sns.FacetGrid(wines, hue='wine_type', palette={"red": "r", "white": "y"},size=4) g.map(sns.distplot, 'sulphates', kde=False, bins=15, ax=ax).add_legend(title='Wine Type') g.fig.suptitle("Sulphates Content in Wine", fontsize=14) g.set_axis_labels('Sulphates','Frequency') g.fig.subplots_adjust(top=0.85, wspace=0.3)</pre></div><figure id="9fca"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*qblDVKQ9F7cdtvcqe4AjYg.png"><figcaption></figcaption></figure>Further more, box plots and violin plots are used for see the outliners and kernels respectively:<div id="5c16"><pre># Box Plots f, (ax) = plt.subplots(1, 1, figsize=(12, 4)) f.suptitle('Wine Quality - Alcohol Content', fontsize=14)

sns.boxplot(x="quality", y="alcohol", data=wines, ax=ax) ax.set_xlabel("Wine Quality",size = 12,alpha=0.8) ax.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8)</pre></div><figure id="9a99"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*nGVIDAh5JRjVunY0_ZiFrA.png"><figcaption></figcaption></figure><div id="f782"><pre># Violin Plots f, (ax) = plt.subplots(1, 1, figsize=(12, 4)) f.suptitle('Wine Quality - Sulphates Content', fontsize=14)

sns.violinplot(x="quality", y="sulphates", data=wines, ax=ax) ax.set_xlabel("Wine Quality",size = 12,alpha=0.8) ax.set_ylabel("Wine Sulphates",size = 12,alpha=0.8)</pre></div><figure id="f473"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*vvat_mGwGH6A4dp1lvbiug.png"><figcaption></figcaption></figure>3. 3-D visualizationThe 3rd dimension usually is the categorical attribute, displaying by different colors like “hue” parameter in Seaborn:<div id="40d9"><pre># Scatter Plot with Hue for visualizing data in 3-D cols = ['density', 'residual sugar', 'total sulfur dioxide', 'fixed acidity', 'wine_type'] pp = sns.pairplot(wines[cols], hue='wine_type', size=1.8, aspect=1.8, palette={"red": "#FF9999", "white": "#FFE888"}, plot_kws=dict(edgecolor="black", linewidth=0.5)) fig = pp.fig fig.subplots_adjust(top=0.93, wspace=0.3) t = fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14)</pre></div><figure id="ab4a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*L1ysAzHFrExcSzTTBp2RCA.png"><figcaption></figcaption></figure>For 3 continuous attributes, we can use space’s 3-D scatter plot with x, y, z axis:<div id="0f10"><pre># Visualizing 3-D numeric data with Scatter Plots # length, breadth and depth fig = plt.figure(figsize=(8, 6)) ax = fig.add_subplot(111, projection='3d')

xs = wines['residual sugar'] ys = wines['fixed acidity'] zs = wines['alcohol'] ax.scatter(xs, ys, zs, s=50, alpha=0.6, edgecolors='w')

ax.set_xlabel('Residual Sugar') ax.set_ylabel('Fixed Acidity') ax.set_zlabel('Alcohol')</pre></div><figure id="24b8"><img src="https://cdn-images-1.

Options

readmedium.com/v2/resize:fit:800/1IpDy1H4ISmA49tCcIXoSPw.png"><figcaption></figcaption></figure>Sometimes the xyz style maybe not the best, we can use size as the 3rd dimension with a bubble chart:<div id="67f0"><pre># Visualizing 3-D numeric data with a bubble chart # length, breadth and size plt.scatter(wines['fixed acidity'], wines['alcohol'], s=wines['residual sugar']25, alpha=0.4, edgecolors='w')

plt.xlabel('Fixed Acidity') plt.ylabel('Alcohol') plt.title('Wine Alcohol Content - Fixed Acidity - Residual Sugar',y=1.05)</pre></div><figure id="62a3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*-iTaR3IR2k9jH21WtAozAQ.png"><figcaption></figcaption></figure>For 3 discrete attributes, subplots are the easiest way:<div id="c97f"><pre># Visualizing 3-D categorical data using bar plots # leveraging the concepts of hue and facets fc = sns.factorplot(x="quality", hue="wine_type", col="quality_label", data=wines, kind="count", palette={"red": "#FF9999", "white": "#FFE888"})</pre></div><figure id="2d19"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*1ME4tK8IYZoqChgjsXdNPQ.png"><figcaption></figcaption></figure>For mixed attributes, categorical colors will be the 3rd dimension:<div id="e7ac"><pre># Visualizing 3-D mix data using scatter plots # leveraging the concepts of hue for categorical dimension jp = sns.pairplot(wines, x_vars=["sulphates"], y_vars=["alcohol"], size=4.5, hue="wine_type", palette={"red": "#FF9999", "white": "#FFE888"}, plot_kws=dict(edgecolor="k", linewidth=0.5))

# we can also view relationships\correlations as needed lp = sns.lmplot(x='sulphates', y='alcohol', hue='wine_type', palette={"red": "#FF9999", "white": "#FFE888"}, data=wines, fit_reg=True, legend=True, scatter_kws=dict(edgecolor="k", linewidth=0.5)) </pre></div><figure id="b7ed"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*_cnB1AgWIigH-muafp-mtg.png"><figcaption></figcaption></figure><figure id="bf70"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*5DIPWqRSBCGgIAlcdV0rxw.png"><figcaption></figcaption></figure>Instead, KDE can be used as well:<div id="8567"><pre># Visualizing 3-D mix data using kernel density plots # leveraging the concepts of hue for categorical dimension ax = sns.kdeplot(white_wine['sulphates'], white_wine['alcohol'], cmap="YlOrBr", shade=True, shade_lowest=False) ax = sns.kdeplot(red_wine['sulphates'], red_wine['alcohol'], cmap="Reds", shade=True, shade_lowest=False)</pre></div><figure id="b1d3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*h92QKia71F7Y-GdcHyIYyA.png"><figcaption></figcaption></figure>From the above, we can see red wines have more sulfate than white wines.Also, box plots and violin plots sometimes are used in 3-D subplots:<div id="c3a3"><pre># Visualizing 3-D mix data using violin plots # leveraging the concepts of hue and axes for > 1 categorical dimensions f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4)) f.suptitle('Wine Type - Quality - Acidity', fontsize=14)

sns.violinplot(x="quality", y="volatile acidity", data=wines, inner="quart", linewidth=1.3,ax=ax1) ax1.set_xlabel("Wine Quality",size = 12,alpha=0.8) ax1.set_ylabel("Wine Volatile Acidity",size = 12,alpha=0.8)

sns.violinplot(x="quality", y="volatile acidity", hue="wine_type", data=wines, split=True, inner="quart", linewidth=1.3, palette={"red": "#FF9999", "white": "white"}, ax=ax2) ax2.set_xlabel("Wine Quality",size = 12,alpha=0.8) ax2.set_ylabel("Wine Volatile Acidity",size = 12,alpha=0.8) l = plt.legend(loc='upper right', title='Wine Type') </pre></div><figure id="564f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*mqEb47KCYi9oAG0LPojLuQ.png"><figcaption></figcaption></figure><div id="0b84"><pre># Visualizing 3-D mix data using box plots # leveraging the concepts of hue and axes for > 1 categorical dimensions f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4)) f.suptitle('Wine Type - Quality - Alcohol Content', fontsize=14)

sns.boxplot(x="quality", y="alcohol", hue="wine_type", data=wines, palette={"red": "#FF9999", "white": "white"}, ax=ax1) ax1.set_xlabel("Wine Quality",size = 12,alpha=0.8) ax1.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8)

sns.boxplot(x="quality_label", y="alcohol", hue="wine_type", data=wines, palette={"red": "#FF9999", "white": "white"}, ax=ax2) ax2.set_xlabel("Wine Quality Class",size = 12,alpha=0.8) ax2.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8) l = plt.legend(loc='best', title='Wine Type') </pre></div><figure id="4c07"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*9T9A77F2OXGBHQVFDHLpMA.png"><figcaption></figcaption></figure>To be continued…Thank you for reading.</article></body>

Multi-Dimension Visualization in Python Part I

Data visualization helps us to explore and review data easily. It is much better than a table or a long long paper. Most of time the visualization work is limited to 2-D. In this article, I am going to introduce how to plot and explore multi-dimension data (1-D to 6-D).

Firstly import the libraries we need:

import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib as mpl
import numpy as np
import seaborn as sns
%matplotlib inline

Here we use winequanlity dataset as an example: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

Before exploring the data, we do some manipulation:

# store wine type as an attribute
red_wine['wine_type'] = 'red'   
white_wine['wine_type'] = 'white'

# bucket wine quality scores into qualitative quality labels
red_wine['quality_label'] = red_wine['quality'].apply(lambda value: 'low' 
                                                          if value <= 5 else 'medium' 
                                                              if value <= 7 else 'high')
red_wine['quality_label'] = pd.Categorical(red_wine['quality_label'], 
                                           categories=['low', 'medium', 'high'])
white_wine['quality_label'] = white_wine['quality'].apply(lambda value: 'low' 
                                                              if value <= 5 else 'medium' 
                                                                  if value <= 7 else 'high')
white_wine['quality_label'] = pd.Categorical(white_wine['quality_label'], 
                                             categories=['low', 'medium', 'high'])

# union red and white wine datasets
wines = pd.concat([red_wine, white_wine])

# re-shuffle records just to randomize data points
wines = wines.sample(frac=1, random_state=42).reset_index(drop=True)

We concatenate both red and white wine to create a single wine dataset, and create a feature as “quality_label”.

1-D visualization

To display the distribution for all the attributes, Histogram is the best choice:

wines.hist(bins=15, color='steelblue', edgecolor='black', linewidth=1.0,
           xlabelsize=8, ylabelsize=8, grid=False)    
plt.tight_layout(rect=(0, 0, 1.2, 1.2))

It is a good idea for exploring all the attributes. Further more, let’s see another option for a continuous attribute:

# Histogram
fig = plt.figure(figsize = (6,4))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax = fig.add_subplot(1,1, 1)
ax.set_xlabel("Sulphates")
ax.set_ylabel("Frequency") 
ax.text(1.2, 800, r'$\mu$='+str(round(wines['sulphates'].mean(),2)), 
         fontsize=12)
freq, bins, patches = ax.hist(wines['sulphates'], color='steelblue', bins=15,
                                    edgecolor='black', linewidth=1)


# Density Plot
fig = plt.figure(figsize = (6, 4))
title = fig.suptitle("Sulphates Content in Wine", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,1, 1)
ax1.set_xlabel("Sulphates")
ax1.set_ylabel("Frequency") 
sns.kdeplot(wines['sulphates'], ax=ax1, shade=True, color='steelblue')

We can see that the distribution of sulphates is right skew.

For a discrete attribute, a Histogram usually is the best one. Sometimes we can use a pie chart:

fig = plt.figure(figsize = (6, 4))
title = fig.suptitle("Wine Quality Frequency", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,1, 1)
ax1.set_xlabel("Quality")
ax1.set_ylabel("Frequency") 
ax1.pie(wines.groupby('quality').count()['quality_label'],labels=wines.groupby('quality').count().index)

2. 2-D visualization

2-D visualization is the most commonly used method.

Usually we use heatmap to display the correlations between attributes:

# Correlation Matrix Heatmap
f, ax = plt.subplots(figsize=(10, 6))
corr = wines.corr()
hm = sns.heatmap(round(corr,2), annot=True, ax=ax, cmap="coolwarm",fmt='.2f',
                 linewidths=.05)
f.subplots_adjust(top=0.93)
t= f.suptitle('Wine Attributes Correlation Heatmap', fontsize=14)

For the specific attributes, we can use pairplot to see their relationship:

specific_atts=['density','residual sugar','total sulfur dioxide','fixed acidity']

sns.pairplot(wines[specific_atts],diag_kind='kde')

Or we can use parallel coordinates to display the relationship between categories:

specific_atts=specific_atts+['wine_type']

pd.plotting.parallel_coordinates(wines[specific_atts],'wine_type')

For 2 continuous attributes, scatter plot and joint plot are the best options to see the relationship and distributions.

# Scatter Plot
plt.scatter(wines['sulphates'], wines['alcohol'],
            alpha=0.4, edgecolors='w')

plt.xlabel('Sulphates')
plt.ylabel('Alcohol')
plt.title('Wine Sulphates - Alcohol Content',y=1.05)


# Joint Plot
jp = sns.jointplot(x='sulphates', y='alcohol', data=wines,
                   kind='reg', space=0, size=5, ratio=4)

But for the 2 discrete attributes, we have to choose another type— bar chart:

# Multi-bar Plot
cp = sns.countplot(x="quality", hue="wine_type", data=wines, 
                   palette={"red": "#FF9999", "white": "#FFE888"})

For the mixed attributes, multi-histograms are easy to use:

# Using multiple Histograms
g = sns.FacetGrid(wines, hue='wine_type', palette={"red": "r", "white": "y"},size=4)
g.map(sns.distplot, 'sulphates', kde=False, bins=15, ax=ax).add_legend(title='Wine Type')
g.fig.suptitle("Sulphates Content in Wine", fontsize=14)
g.set_axis_labels('Sulphates','Frequency')
g.fig.subplots_adjust(top=0.85, wspace=0.3)

Further more, box plots and violin plots are used for see the outliners and kernels respectively:

# Box Plots
f, (ax) = plt.subplots(1, 1, figsize=(12, 4))
f.suptitle('Wine Quality - Alcohol Content', fontsize=14)

sns.boxplot(x="quality", y="alcohol", data=wines,  ax=ax)
ax.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8)

# Violin Plots
f, (ax) = plt.subplots(1, 1, figsize=(12, 4))
f.suptitle('Wine Quality - Sulphates Content', fontsize=14)

sns.violinplot(x="quality", y="sulphates", data=wines,  ax=ax)
ax.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax.set_ylabel("Wine Sulphates",size = 12,alpha=0.8)

3. 3-D visualization

The 3rd dimension usually is the categorical attribute, displaying by different colors like “hue” parameter in Seaborn:

# Scatter Plot with Hue for visualizing data in 3-D
cols = ['density', 'residual sugar', 'total sulfur dioxide', 'fixed acidity', 'wine_type']
pp = sns.pairplot(wines[cols], hue='wine_type', size=1.8, aspect=1.8, 
                  palette={"red": "#FF9999", "white": "#FFE888"},
                  plot_kws=dict(edgecolor="black", linewidth=0.5))
fig = pp.fig 
fig.subplots_adjust(top=0.93, wspace=0.3)
t = fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14)

For 3 continuous attributes, we can use space’s 3-D scatter plot with x, y, z axis:

# Visualizing 3-D numeric data with Scatter Plots
# length, breadth and depth
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')

xs = wines['residual sugar']
ys = wines['fixed acidity']
zs = wines['alcohol']
ax.scatter(xs, ys, zs, s=50, alpha=0.6, edgecolors='w')

ax.set_xlabel('Residual Sugar')
ax.set_ylabel('Fixed Acidity')
ax.set_zlabel('Alcohol')

Sometimes the xyz style maybe not the best, we can use size as the 3rd dimension with a bubble chart:

# Visualizing 3-D numeric data with a bubble chart
# length, breadth and size
plt.scatter(wines['fixed acidity'], wines['alcohol'], s=wines['residual sugar']*25, 
            alpha=0.4, edgecolors='w')

plt.xlabel('Fixed Acidity')
plt.ylabel('Alcohol')
plt.title('Wine Alcohol Content - Fixed Acidity - Residual Sugar',y=1.05)

For 3 discrete attributes, subplots are the easiest way:

# Visualizing 3-D categorical data using bar plots
# leveraging the concepts of hue and facets
fc = sns.factorplot(x="quality", hue="wine_type", col="quality_label", 
                    data=wines, kind="count",
                    palette={"red": "#FF9999", "white": "#FFE888"})

For mixed attributes, categorical colors will be the 3rd dimension:

# Visualizing 3-D mix data using scatter plots
# leveraging the concepts of hue for categorical dimension
jp = sns.pairplot(wines, x_vars=["sulphates"], y_vars=["alcohol"], size=4.5,
                  hue="wine_type", palette={"red": "#FF9999", "white": "#FFE888"},
                  plot_kws=dict(edgecolor="k", linewidth=0.5))

# we can also view relationships\correlations as needed                  
lp = sns.lmplot(x='sulphates', y='alcohol', hue='wine_type', 
                palette={"red": "#FF9999", "white": "#FFE888"},
                data=wines, fit_reg=True, legend=True,
                scatter_kws=dict(edgecolor="k", linewidth=0.5))

Instead, KDE can be used as well:

# Visualizing 3-D mix data using kernel density plots
# leveraging the concepts of hue for categorical dimension
ax = sns.kdeplot(white_wine['sulphates'], white_wine['alcohol'],
                  cmap="YlOrBr", shade=True, shade_lowest=False)
ax = sns.kdeplot(red_wine['sulphates'], red_wine['alcohol'],
                  cmap="Reds", shade=True, shade_lowest=False)

From the above, we can see red wines have more sulfate than white wines.

Also, box plots and violin plots sometimes are used in 3-D subplots:

# Visualizing 3-D mix data using violin plots
# leveraging the concepts of hue and axes for > 1 categorical dimensions
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
f.suptitle('Wine Type - Quality - Acidity', fontsize=14)

sns.violinplot(x="quality", y="volatile acidity",
               data=wines, inner="quart", linewidth=1.3,ax=ax1)
ax1.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax1.set_ylabel("Wine Volatile Acidity",size = 12,alpha=0.8)

sns.violinplot(x="quality", y="volatile acidity", hue="wine_type", 
               data=wines, split=True, inner="quart", linewidth=1.3,
               palette={"red": "#FF9999", "white": "white"}, ax=ax2)
ax2.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax2.set_ylabel("Wine Volatile Acidity",size = 12,alpha=0.8)
l = plt.legend(loc='upper right', title='Wine Type')

# Visualizing 3-D mix data using box plots
# leveraging the concepts of hue and axes for > 1 categorical dimensions
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
f.suptitle('Wine Type - Quality - Alcohol Content', fontsize=14)

sns.boxplot(x="quality", y="alcohol", hue="wine_type",
               data=wines, palette={"red": "#FF9999", "white": "white"}, ax=ax1)
ax1.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax1.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8)

sns.boxplot(x="quality_label", y="alcohol", hue="wine_type",
               data=wines, palette={"red": "#FF9999", "white": "white"}, ax=ax2)
ax2.set_xlabel("Wine Quality Class",size = 12,alpha=0.8)
ax2.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8)
l = plt.legend(loc='best', title='Wine Type')

To be continued…

Thank you for reading.