This article was originally written by Bao Kazuo @Kosai data analyst.
Seaborn is a great visualization library, especially when the data dimensions are large. It allows us to draw descriptive statistical graphs with minimal code to find features between dimensional variables.
Following on from the previous Python visualization: Seaborn (I), have used Seaborn for Distribution Visualization, and we will share Categorial Visualization using Seaborn, including Stripplot & Swarmplot involved. Boxplot & Violinplot, Barplot & Pointplot, and abstract Factorplot.
Here we combine Iris Iris data set publicly available on Corsai for demonstration.
All the complete source code is available
K – Lab onlineData analysis collaboration tools
Repetition. it
It covers mainstream languages such as Python and R, and has completed the deployment of more than 90% data analysis and mining libraries, including Seaborn, Pandas, Numpy, etc., to help data professionals focus on data analysis and improve their efficiency.
Iris Iris data set: is a commonly used classification experimental data set, collected and sorted by Fisher, 1936. Is a kind of data set for multivariate analysis. A total of 150 data sets are included, which are divided into 3 categories with 50 data in each category and 4 attributes in each category. Four attributes of calyx length (Sepal_length), calyx width (Sepal_width), petal length (petal_length) and petal width (petal_width) can be used to predict iris flowers belonging to (Setosa, Versicolour, Virginica) of the three species.
Import libraries
import warnings warnings.filter
warnings(“ignore”)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt %matplotlib inline
import seaborn as sns
Stripplot
The essence of Stripplot is to make Scatterplot according to categories of variables with Quantitative attributes in a data set.
We visualized the Stripplot of Sepal Length of different types of flowers in the kite dataset.
PLT. Figure (1, figsize = (12, 6))
PLT. Subplot (1, 2, 1)
sns.stripplot(x=’species’,y=’sepal_length’,data=iris) #stripplot
plt.title(‘Striplot of sepal length of Iris species’)with sns.axes_style(“whitegrid”): # This is a temporary style setting command, if not written, the default format ‘darkGrid’ will be drawn
PLT. Subplot (1,2,2)
plt.title(‘Striplot of sepal length of Iris species’) sns.stripplot(x=’species’,y=’sepal_length’,data=iris,jitter=True) # jitterplot
plt.show()
The top image on the left is a scatter plot drawn with Stripplot in the default style. In many cases, points in Stripplot overlap, making it difficult to see where the points are distributed. A simple solution is to plot the Jitterplot based on the Stripplot, showing the distribution by randomly fine-tuning the positions of the hour points along the category axes.
Swarmplot
Another way to solve the problem of overlapping points in Stripplot is to draw Swarmplot, which is essentially to draw these overlapping points by “stretching” them along the axis of the category through an algorithm. We visualized Swarmplot of Petal Length and Petal width of different flower species in the flower data set.
PLT. Figure (1, figsize = (12, 6))
PLT. Subplot (1, 2, 1)
sns.swarmplot(x=’species’,y=’petal_length’,data=iris)
With sns.axes_style(“ticks”)
PLT. Subplot (1,2,2)
sns.swarmplot(x=’species’,y=’petal_width’,data=iris)
plt.show()
Boxplot
A box plot, consisting mainly of six data nodes, arranges a set of data from largest to smallest and calculates the upper edge, upper quartile Q3, median, lower quartile Q1, lower edge, and outliers respectively. Below, the four variables sepal_length, sepal_width, PEtal_LENGTH and petal_width in the kite dataset are visualized in the box diagram.
var = [‘sepal_length’,’sepal_width’,’petal_length’,’petal_width’]
axes_style = [‘ticks’,’white’,’whitegrid’, ‘dark’]
Figure = plt.figure(1,figsize=(12,12))for I in range(4): with sns.axes_style(axes_style[I]): #
PLT. Subplot (2, 2, I + 1) SNS. Boxplot (x = ‘species’, y = var [I], data = iris)
plt.show()
Violinplot
Violinplot is equivalent to combining box plot and kernel density plot to better show the quantitative form of data.
context= [‘notebook’,’paper’,’talk’,’poster’]
axes_style = [‘ticks’,’white’,’whitegrid’, ‘dark’]
Plt. figure(1,figsize=(12,12))for I in range(4): with SNS. Axes_style (axes_style[I]):#
Sns.set_context (context[I]) # set the default context style to notebook
PLT. Subplot (2, 2, I + 1)
plt.title(str(var[i])+ ‘ in Iris species’)
sns.violinplot(x=’species’,y=var[i],data=iris)
plt.show()
Violinplot used Kernel Density Estimate to better describe the distribution of quantitative variables.
At the same time, Swarmplot and Boxplot or Violinplot can also be combined to describe Quantitative variables. The iris data set is shown as follows:
context= [‘notebook’,’paper’,’talk’,’poster’]
axes_style = [‘ticks’,’white’,’whitegrid’, ‘dark’]
Plt. figure(1,figsize=(12,12))for I in range(4): with
Sns.axes_style (axes_style[I]):# set axes_style sns.set_context(context[I]
PLT. Subplot (2, 2, I + 1)
plt.title(str(var[i])+ ‘ in Iris species’)
sns.swarmplot(x=’species’, y=var[i], data=iris, color=”w”, alpha=.5)
sns.violinplot(x=’species’, y=var[i], data=iris, inner=None) if i%2 ==0 \ else sns.boxplot(x=’species’, y=var[i], Data =iris) # swarmPlot + Violinplot and swarmPlot + boxplot
plt.show()
Barplot
Barplot is mainly the average value of Quantitative variables in classification, and Boostrapping algorithm is used to calculate the confidence interval and Error bar of the estimated value. Using iris data sets.
Plt. figure(1,figsize=(12,12))for I in range(4): with SNS. Axes_style (axes_style[I]):#
Subplot (2,2, I +1) ssn.set_context (context[I]) # set context style (default: notebook)
plt.title(str(var[i])+ ‘ in Iris species’) sns.barplot(x=’species’,y=var[i],data=iris)
plt.show()
Countplot
If you want to know how many observations there are under each category, you can use Countplot, which is equivalent to an Observation count, as shown in the iris data set below:
Plt.figure (figsize=(5,5)) sns.countplot(y=”species”, data=iris) # set y=’species’ and place countplot horizontal
plt.title(‘Iris species count’)
plt.show()
Pointplot
Pointplot is a horizontal extension of Barplot. On the one hand, Barplot is presented with a Point Estimate and Confidence Level. Pointplot, on the other hand, makes it easy to see how different sub-categories relate to each major Category when there are sub-categories that are more subdivided under each major Category. The display is as follows:
Plt. figure(1,figsize=(12,12))for I in range(4): with SNS. Axes_style (axes_style[I]):#
Subplot (2,2, I +1) ssn.set_context (context[I]) # set context style (default: notebook)
plt.title(str(var[i])+ ‘ in Iris species’) sns.pointplot(x=’species’,y=var[i],data=iris)
plt.show()
Factorplot
Factorplot can be said to be the essence of Seaborn to do Category Visualization. All the plots mentioned above can be said to be the concrete demonstration of Factorplot. We can use PariGrid to visualize the numerical features of multiple categories using the same Plot.
sns.set(style=”ticks”) g = sns.PairGrid(iris, x_vars = [‘sepal_length’,’sepal_width’,’petal_length’,’petal_width’], Y_vars =’ species’, aspect=0.75,size=4) # Set spacing and image size g.map. (SNS. Violinplot, Palette =’pastel’)
plt.show()
In this data set, Quantitative variables mainly include Area of housing, unit Price per square meter, and total housing Price Tprice.
Kesci.com is an online community for data talents and industry problems. The k-Lab online data analysis and collaboration platform focuses on creating a brand new experience for data workers’ study and work.