This article is from Towardsdatascience. Written by George Seif. Data visualization is an important part of what a data scientist does. In the early stages of a project, exploratory data analysis is often required to gain insight into the data. Data visualization makes this process easier to understand, especially when dealing with large, high-dimensional data sets. In this article, we introduced the basic five types of data visualization charts, and after showing their pros and cons, we provided Matplotlib code to draw the corresponding charts.
Matplotlib is a popular Python library that helps you build data visualizations quickly and easily. However, having to reset data, parameters, graphics, and drawing patterns every time you start a new project can be tedious. This article introduces five ways to visualize data, and uses Python and Matplotlib to write some quick and easy to use visualization functions. The following figure shows a guide for choosing the right visualization method.
A scatter diagram
Because you can see the distribution of the raw data directly, scatter plots are useful for showing the relationship between two variables. You can also see the relationship between different groups of data by grouping them by color, as shown below. You can also add another parameter, such as the radius of the data point, to encode a third variable to visualize the relationship between the three variables, as shown in the second figure below.
Next comes the code. We first import Matplotlib’s PyPlot as PLT and call the function plt.subplots() to create a new plot. We pass x and y data to this function, which is then passed to Ax.Scatter () to draw scatter graphs. We can also set point radius, point color, alpha transparency, and even set the Y-axis to a logarithmic size, and finally specify a title and axis label for the graph.
import matplotlib.pyplot as plt
import numpy as np
def scatterplot(x_data, y_data, x_label="", y_label="", title="", color = "r", yscale_log=False):
# Create the plot object
_, ax = plt.subplots()
# Plot the data, set the size (s), color and transparency (alpha)
# of the pointsAx. Scatter (x_data, y_data, s = 10, color = color, alpha = 0.75)if yscale_log == True:
ax.set_yscale('log')
# Label the axes and provide a title
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
Copy the code
diagram
Graphs are useful when one variable varies greatly from another, that is, when they have high covariance. As shown in the chart below, we can see that the relative percentage of all professional courses varies greatly over time. Drawing the data on a scatter plot would be too messy to see its nature. A graph is great for this because it quickly summarizes the covariance of two variables. Here, we can also group data by color.
The following is the implementation code for the line graph, which is similar to the code structure for the scatter plot, with a few changes in variable Settings.
def lineplot(x_data, y_data, x_label="", y_label="", title="") :# Create the plot object
_, ax = plt.subplots()
# Plot the best fit line, set the linewidth (lw), color and
# transparency (alpha) of the line
ax.plot(x_data, y_data, lw = 2, color = '#539caf', alpha = 1)
# Label the axes and provide a title
ax.set_title(title)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
Copy the code
histogram
Histograms are useful for observing or really understanding the distribution of data points. The histogram of frequency and IQ drawn by us is shown below. We can intuitively understand the concentration (variance) and median of the distribution, and we can also understand that the shape of the distribution approximately follows the Gaussian distribution. Using such bars (as opposed to scatterplots, etc.) allows you to clearly visualize the frequency variation between each box (an isometric interval of the X-axis). Using a box (discretization) does help us to observe a “more complete picture” because using all data points without discretization would not observe an approximate data distribution, and there might be a lot of noise in the visualization that would only approximate rather than describe the true data distribution.
The code to draw a histogram in Matplotlib is shown below. There are two steps to note here. First, the n_bins parameter controls the number or degree of discretization of the histogram. More boxes or columns give us more information, but also introduce noise and make the global distribution we observe less regular. Fewer boxes will give us more global information, and we can observe the shape of the overall distribution without the details. Second, the cumulative parameter is a Boolean value that allows us to choose whether the histogram is cumulative, that is, choosing the probability density function (PDF) or the cumulative density function (CDF).
def histogram(data, n_bins, cumulative=False, x_label = "", y_label = "", title = ""):
_, ax = plt.subplots()
ax.hist(data, n_bins = n_bins, cumulative = cumulative, color = '#539caf')
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
Copy the code
If we want to compare the distribution of two variables in the data, one might think that we need to make two separate histograms and splice them together for comparison. But Matplotlib actually has a better way, we can overlay multiple histograms with different transparencies. As shown in the figure below, the uniform distribution sets opacity to 0.5 so that we can overlay it on a Gaussian distribution, which allows the user to plot and compare two distributions on the same chart.
There are a few things to be aware of in the code for stacking histograms. First, the horizontal interval we set should satisfy the distribution of two variables at the same time. According to the range of horizontal interval and the number of boxes, we can calculate the width of each box. Secondly, when we draw two histograms on a chart, we need to ensure that one histogram has greater transparency.
# Overlay 2 histograms to compare them
def overlaid_histogram(data1, data2, n_bins = 0, data1_name="", data1_color="#539caf", data2_name="", data2_color="#7663b0", x_label="", y_label="", title="") :# Set the bounds for the bins so that the two distributions are fairly compared
max_nbins = 10
data_range = [min(min(data1), min(data2)), max(max(data1), max(data2))]
binwidth = (data_range[1] - data_range[0]) / max_nbins
if n_bins == 0
bins = np.arange(data_range[0], data_range[1] + binwidth, binwidth)
else:
bins = n_bins
# Create the plot
_, ax = plt.subplots()
ax.hist(data1, bins = bins, color = data1_color, alpha = 1, label = data1_name)
ax.hist(data2, bins = bins, color = data2_color, alpha = 0.75, label = data2_name)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc = 'best')
Copy the code
The bar chart
Bar charts are most effective when visualizing categorical data with a small number of categories (<10). When there are too many categories, the bar chart becomes cluttered and difficult to understand. You can see the difference between categories based on the number of bars, which can be easily separated and color-grouped. We will cover three types of bar charts: regular, grouped, and stacked bar charts.
A general bar chart is shown in Figure 1. In the barplot() function, X_data represents different categories on the X-axis and y_data represents bar heights on the Y-axis. Error bars are extra lines added to the center of each bar that can be used to indicate standard deviation.
Grouping bar charts allow us to compare multiple category variables. As shown in the figure below, our first variable varies with different groupings (G1, G2, etc.), and we compare different genders in each group. As the code shows, the y_data_list variable is now actually a set of lists, with each sublist representing a different group. We then loop through each group and plot the columns and corresponding values on the X-axis. The different categories of each group will be represented in different colors.
Stacked bar charts are great for visualizing the categorization of different variables. In the stacked bar chart below, we compare the server load on a workday. By stacking blocks of different colors on the same bar chart, we can easily see and understand which servers are working most efficiently each day, and how much load the same server is doing on different days. The code to draw the graph has the same style as the grouping bar chart, we iterate through each group, but this time we draw new columns on top of the old one instead of next to it.
def barplot(x_data, y_data, error_data, x_label="", y_label="", title=""):
_, ax = plt.subplots()
# Draw bars, position them in the center of the tick mark on the x-axis
ax.bar(x_data, y_data, color = '#539caf', align = 'center')
# Draw error bars to show standard deviation, set ls to 'none'
# to remove line between points
ax.errorbar(x_data, y_data, yerr = error_data, color = '# 297083', ls = 'none', lw = 2, capthick = 2)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
def stackedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""):
_, ax = plt.subplots()
# Draw bars, one category at a time
for i in range(0, len(y_data_list)):
if i == 0:
ax.bar(x_data, y_data_list[i], color = colors[i], align = 'center', label = y_data_names[i])
else:
# For each category after the first, the bottom of the
# bar will be the top of the last category
ax.bar(x_data, y_data_list[i], color = colors[i], bottom = y_data_list[i - 1], align = 'center', label = y_data_names[i])
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc = 'upper right')
def groupedbarplot(x_data, y_data_list, colors, y_data_names="", x_label="", y_label="", title=""):
_, ax = plt.subplots()
# Total width for all bars at one x locationTotal_width = 0.8# Width of each individual bar
ind_width = total_width / len(y_data_list)
# This centers each cluster of bars about the x tick mark
alteration = np.arange(-(total_width/2), total_width/2, ind_width)
# Draw bars, one category at a time
for i in range(0, len(y_data_list)):
# Move the bar to the right on the x-axis so it doesn't
# overlap with previously drawn ones
ax.bar(x_data + alteration[i], y_data_list[i], color = colors[i], label = y_data_names[i], width = ind_width)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
ax.legend(loc = 'upper right')
Copy the code
boxplot
The above histogram is useful for visualizing variable distribution, but what do we do when we need more information? We may need to visualize the standard deviation clearly, or we may have a large difference between the median and the mean (with many outliers), so more detailed information is required. You can have very uneven data distribution and so on.
The boxplot will give us all the information we need. The bottom of the solid box represents the first quartile, the top represents the third quartile, and the line inside the box represents the second quartile (median). The dotted line shows the distribution of the data.
Because a boxplot is a visualization of a single variable, its setup is simple. X_data is a list of variables. The Matplotlib function boxplot() draws a boxplot for each column of y_data or for each vector in the y_data sequence, so each value in X_DATA corresponds to a column/vector in y_data.
def boxplot(x_data, y_data, base_color="#539caf", median_color="# 297083", x_label="", y_label="", title=""):
_, ax = plt.subplots()
# Draw boxplots, specifying desired style
ax.boxplot(y_data
# patch_artist must be True to control box fill
, patch_artist = True
# Properties of median line
, medianprops = {'color': median_color}
# Properties of box
, boxprops = {'color': base_color, 'facecolor': base_color}
# Properties of whiskers
, whiskerprops = {'color': base_color}
# Properties of whisker caps
, capprops = {'color': base_color})
# By default, the tick label starts at 1 and increments by 1 for
# each box drawn. This sets the labels to the ones we want
ax.set_xticklabels(x_data)
ax.set_ylabel(y_label)
ax.set_xlabel(x_label)
ax.set_title(title)
Copy the code
conclusion
This article introduces five easy to use Matplotlib data visualization methods. Abstracting visual processes into functions makes code easy to read and use. Hope you enjoyed!
Original address: