Author | Soner Y ı ld ı r ı m compile | source of vitamin k | forward Data Science

Exploratory data analysis (EDA) is an important part of the data science or machine learning pipeline. To use data to create a robust and valuable product, you need to study the data, understand the relationships between variables, and understand the underlying structure of the data. Data visualization is one of the most effective tools in EDA.

In this article, we will try to use the visual function to study the customer churn data sets: www.kaggle.com/sonalidasgu…

We will create many different visualizations and try to introduce a feature of the Matplotlib or Seaborn libraries into each one.

We first import the library and read the dataset into the PANDAS data frame.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
%matplotlib inline

df = pd.read_csv("/content/Churn_Modelling.csv")

df.head()
Copy the code

This dataset contains 10,000 customers (i.e., rows) and 14 characteristics about bank customers and their products. The goal here is to use the provided characteristics to predict customer churn (i.e., exit =1).

Let’s start with catplot, a classification diagram of the Seaborn library.

sns.catplot(x='Gender', y='Age', data=df, hue='Exited', height=8, aspect=1.2)
Copy the code

The study found that people aged 45 to 60 were more likely to leave a job (ie, leave a company) than any other age group. There is not much difference between women and men.

The Hue parameter is used to distinguish data points based on category variables.

The next visualization is a scatter plot, which shows the relationship between two numerical variables. Let’s see if the customer’s salary relates to the balance.

plt.figure(figsize=(12.8))

plt.title("Estimated Salary vs Balance", fontsize=16)

sns.scatterplot(x='Balance', y='EstimatedSalary', data=df)
Copy the code

For the first time, we use the Matplotlib.pyplot interface to create a Figure object and set the title. We then use Seaborn to draw the actual diagram on the graph object.

Results: There was no meaningful relationship or correlation between estimated wages and balances. Balances appear to be normally distributed (excluding customers with zero balances).

The next visualization is the boxplot, which shows the distribution of a variable across the median and quartiles.

plt.figure(figsize=(12.8))

ax = sns.boxplot(x='Geography', y='Age', data=df)

ax.set_xlabel("Country", fontsize=16)
ax.set_ylabel("Age", fontsize=16)
Copy the code

We also adjusted the font size on the X and y axes using set_xlabel and set_ylabel.

Here is the box diagram:

The median is the point in the middle of all points sorted. Q1 (first or lower quartile) is the median of the lower half of the data set. Q3 (third or upper quartile) is the median for the top half of the data set.

Thus, the boxplot provides us with an idea of distribution and outliers. In the boxplot we created, there are many outliers (indicated by dots) at the top.

Findings: The distribution of age variables is skewed to the right. The mean is greater than the median due to outliers on the upper side.

Right skewness is observed in the univariate distribution of variables. Let’s create a distplot to observe the distribution.

plt.figure(figsize=(12.8))

plt.title("Distribution of Age", fontsize=16)

sns.distplot(df['Age'], hist=False)
Copy the code

The tail on the right is heavier than the tail on the left. The reason is the outliers that we observed on the boxplot.

Distplot also provides a histogram by default, but we changed it using the hist parameter.

The Seaborn library also provides different types of pair diagrams that provide an overview of pairings between variables. Let’s start by taking a random sample from the data set to make the graph more attractive. With 10,000 observations in the original data set, we will select a sample containing 100 observations and four features.

subset=df[['CreditScore'.'Age'.'Balance'.'EstimatedSalary']].sample(n=100)

g = sns.pairplot(subset, height=2.5)
Copy the code

On the diagonal, we can see the histogram of the variables. The other part of the grid represents the relationship between variables.

Another tool for viewing pair-wise relationships is heat maps, which take matrices and generate color-coded maps. Heat maps are used to check correlations between features and target variables.

Let’s start by using pandas’ CORR function to create correlation matrices for some features.

corr_matrix = df[['CreditScore'.'Age'.'Tenure'.'Balance'.'EstimatedSalary'.'Exited']].corr()
Copy the code

We can now plot this matrix.

plt.figure(figsize=(12.8))

sns.heatmap(corr_matrix, cmap='Blues_r', annot=True)
Copy the code

Results: “age” and “balance” columns are positively correlated with customer churn.


As the volume of data increases, it becomes more and more difficult to analyze and explore the data. Visualization is an important tool in exploratory data analysis, and it has great power when used effectively and appropriately. Visualization can also help convey information to your audience or tell them what you’ve found.

There is no one-size-fits-all approach to visualization, so some tasks require different types of visualization. Depending on the task, different options may be more appropriate. One thing all visualizations have in common is that they are great tools for exploratory data analysis and the storytelling part of data science.

Original link: towardsdatascience.com/a-practical…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/