Small and comprehensive Pandas data analysis case

I have written many articles about Pandas. In this article, I provide a brief overview of the use of Pandas.

  • How do I simulate the data myself
  • Multiple data processing methods
  • Data statistics and visualization
  • User RFM model
  • User repurchase cycle

Build a data

The data used in this case is simulated by the small editor himself, mainly including two data: order data and fruit information data, and the two data will be combined

import pandas as pd
import numpy as np
import random
from datetime import *
import time

import plotly.express as px
import plotly.graph_objects as go
import plotly as py

# Draw a subgraph
from plotly.subplots import make_subplots
Copy the code

1. Time field

2. Fruit and users

3. Generate order data

order = pd.DataFrame({
    "time":time_range,  # Order time
    "fruit":fruit_list,  # Fruit name
    "name":name_list,  # the customer name
    # purchases
    "kilogram":np.random.choice(list(range(50.100)), size=len(time_range),replace=True) 
})

order
Copy the code

4. Generate fruit information data

infortmation = pd.DataFrame({
    "fruit":fruits,
    "price": [3.8.8.9.12.8.6.8.15.8.4.9.5.8.7]."region": ["South China"."North China"."Northwest"."Central China"."Northwest"."South China"."North China"."Central China"]
})

infortmation
Copy the code

5. Data consolidation

Merge the order information and fruit information directly into a complete DataFrame. This DF is the data to be processed next

6. Generate a new field: order amount

Here you can learn:

  • How do I generate time-dependent data
  • How do I generate random data from a list (iterable)
  • Pandas’ DataFrame is created by itself and contains new fields
  • Pandas Data Merge

Analysis dimension 1: time

Monthly sales trend 2019-2021

1. Extract the year and month first:

df["year"] = df["time"].dt.year
df["month"] = df["time"].dt.month
# Extract both year and month
df["year_month"] = df["time"].dt.strftime('%Y%m')

df
Copy the code

2, check the field type:

3. Monthly statistics and display:

# Monthly sales statistics
df1 = df.groupby(["year_month"[])"kilogram"].sum().reset_index()

fig = px.bar(df1,x="year_month",y="kilogram",color="kilogram")
fig.update_layout(xaxis_tickangle=45)   # Tilt Angle

fig.show()
Copy the code

2019-2021 sales trend

df2 = df.groupby(["year_month"[])"amount"].sum().reset_index()

df2["amount"] = df2["amount"].apply(lambda x:round(x,2))

fig = go.Figure()
fig.add_trace(go.Scatter(  #
    x=df2["year_month"],
    y=df2["amount"],
    mode='lines+markers'.# mode Mode selection
    name='lines')) # the name

fig.update_layout(xaxis_tickangle=45)   # Tilt Angle

fig.show()
Copy the code

Annual sales, sales and average sales

Analytical dimension 2: goods

Percentage of annual sales of fruit

df4 = df.groupby(["year"."fruit"]).agg({"kilogram":"sum"."amount":"sum"}).reset_index()
df4["year"] = df4["year"].astype(str)
df4["amount"] = df4["amount"].apply(lambda x: round(x,2))

from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(
    rows=1, 
    cols=3,
    subplot_titles=["2019"."2020"."2021"],
    specs=[[{"type": "domain"},   # specify the type by type
           {"type": "domain"},
           {"type": "domain"}]]
)  

years = df4["year"].unique().tolist()

for i, year in enumerate(years):
    name = df4[df4["year"] == year].fruit
    value = df4[df4["year"] == year].kilogram
    
    fig.add_traces(go.Pie(labels=name,
                        values=value
                       ),
                 rows=1,cols=i+1
                )

fig.update_traces(
    textposition='inside'.# 'inside','outside','auto','none'
    textinfo='percent+label',
    insidetextorientation='radial'.Horizontal, radial, tangential
    hole=3.,
    hoverinfo="label+percent+name"
)

# fig.update_layout(title_text=" Sublayout ")

fig.show()
Copy the code

Comparison of annual sales sum of fruits

years = df4["year"].unique().tolist()

for _, year in enumerate(years):
    
    df5 = df4[df4["year"]==year]
    fig = go.Figure(go.Treemap( 
        labels = df5["fruit"].tolist(),
        parents = df5["year"].tolist(),
        values = df5["amount"].tolist(),
        textinfo = "label+value+percent root"
    ))
    
    fig.show()
Copy the code

Monthly change in sales volume of goods

fig = px.bar(df5,x="year_month",y="amount",color="fruit")
fig.update_layout(xaxis_tickangle=45)   # Tilt Angle
fig.show()
Copy the code

The line chart shows the changes:

Analysis dimension 3: region

Sales by region

Average annual sales by region

df7 = df.groupby(["year"."region"[])"amount"].mean().reset_index()
Copy the code

Analysis Dimension 4: Users

Comparison of user order quantity and amount

df8 = df.groupby(["name"]).agg({"time":"count"."amount":"sum"}).reset_index().rename(columns={"time":"order_number"})

df8.style.background_gradient(cmap="Spectral_r")
Copy the code

User’s fruit preferences

According to the order quantity and order amount of each user for each fruit:

df9 = df.groupby(["name"."fruit"]).agg({"time":"count"."amount":"sum"}).reset_index().rename(columns={"time":"number"})

df10 = df9.sort_values(["name"."number"."amount"],ascending=[True.False.False])

df10.style.bar(subset=["number"."amount"],color="#a97fcf")
Copy the code

px.bar(df10,
       x="fruit",
       y="amount".# color="number",
       facet_col="name"
      )
Copy the code

User layering – RFM model

RFM model is an important tool and means to measure customer value and profitability.

This model can reflect a user’s delivery transaction behavior, transaction frequency and total transaction amount of three indicators, through three indicators to describe the value of the customer; At the same time, customers are divided into 8 types of customer value according to the three indicators:

  • Recency (R) is the number of days since the last purchase date of the customer. This index is related to the time point of analysis and therefore is variable. Theoretically, the more recent a customer purchases, the more likely it is to repurchase

  • Frequency (F) is the number of times a customer makes a purchase — those who buy the most often tend to be more loyal. Increasing the number of times a customer buys means getting a larger share of the time.

  • Monetary value (M) is the total amount spent by customers on purchases.

Pandas uses several methods to solve for the three metrics, starting with F and M: the number of orders per customer and the total amount

How do we solve for R indices?

1. First solve the difference between each order and the current time

2. In ascending order according to the difference value R of each user, the first data is his recent purchase record: Take xiaoming user as an example, the latest purchase was on December 15th, and the difference value from the current time is 25 days

3. According to the user, the first data is retained, so as to obtain the R index of each user:

4. Three indicators are obtained by data combination:

When the amount of data is large enough and the number of users is large enough, it is possible to divide users into eight types using only the RFM model

User repurchase analysis

The repurchase period is the interval between two purchases by the user: Taking Xiaoming user as an example, the repurchase period for the first two purchases is 4 days and 22 days respectively

The following is the process of solving the repurchase cycle for each user:

1. The purchase time of each user is in ascending order

2. Move the time by one unit:

3. Difference after merger:

Null value is the first record of each user before there is no data, then directly delete the null value part

Take out the numerical part of the days directly:

5. Comparison of repurchase cycle

px.bar(df16,
       x="day",
       y="name",
       orientation="h",
       color="day",
       color_continuous_scale="spectral"   # purples
      )
Copy the code

In the figure above, the narrower the rectangle, the smaller the interval; The entire repurchase cycle for each user is determined by the length of the rectangle. View the sum of the total repurchase cycle and the average repurchase cycle of each user:

A conclusion can be drawn: Michk and Mike are loyal users in the long run because of their overall repurchase cycle. Moreover, the average repurchase period is relatively low, indicating that the repurchase is active in a short time.

It can also be observed from the violin below that the repurchase cycle distribution of Michk and Mike is the most concentrated.