“This is the 26th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

The data set of this paper is taken from the “Black Friday” data set of Kaggle website. The background of the data set is based on Black Friday. Black Friday is a big shopping activity in western countries, which is similar to The “Double 11” activity in China. To maximize profits at the end of the year. In order to ensure that the privacy rights of the operators and consumers of the data set are not violated, some features of the data set are desensitized. A retail limited company wants to know how its customers buy various products in different categories (and in particular how much they buy). Provides purchasing information for the customer groups that select high-volume products on Black Friday. The dataset contains two pieces of information: customer demographic information (age, sex, marital status, city type, residence duration) and product details (item ID and item type) and total purchase amount. Kaggle data sets: www.kaggle.com/sdolezel/bl…

The specific explanation of each field is shown in Table 11:

Table 1-1 Data set field interpretation table

field meaning
User_ID The user code
Product_ID Commodity code
Gender Gender (M for male, F for female)
Age Age (0-17,18-25,26-35,36-45,46-50,51-55,55+, 7 kinds)
Occupation Occupations (20 occupations in total)
City_Category City classification (divided into three types of cities: ABC)
Stay_in_Current_City_Years Number of years of residence in current city (0,1,2,3,4+, 5 types)
Marital_Status Marital status (0 is unmarried, 1 is married)
Product_Category_1 Product classification is 1 (clothing, 20 kinds, represented by 1-20)
Product_Category_2 Product Category 2 (Electronic products)
Product_Category_3 Product category 3 (Furniture supplies)
Purchase Purchase Amount (in US dollars)
# import data
df = pd.read_csv("product_data.csv")
df.head()
Copy the code

Fundamental analysis

Info Basic information

Use INFO for simple data information

df.info()
Copy the code

What does info() show us?

User_ID = int, Product_ID = object, Product_ID = P, Product_ID = P, Product_ID = P Or is there something else? We need to check and analyze it later.

The characteristics of Gender/Age/City_Category/Stay_In_Current_City_Years are also object because they contain strings. There is no big problem. We will deal with the structure later.

Occupation/Marital_Status is a purely numerical variable and int is appropriate as a categorical variable.

Product_Category_1/Product_Category_2/Product_Category_3 /Product_Category_3 However, it is not possible to convert a class variable to a floating-point number, so look at the data later and decide whether to convert the data type

In conclusion, the data types of all data are chaotic. A large number of unstructured string data need to be converted into corresponding numeric data, and some data need to be converted into data types.

2. Non-null data missing \

Product_Category_2/Product_Category_3 is missing, and it looks like Product_Category_3 is missing.

With such a large amount of data, deletion is definitely not feasible, so you need to think of some effective padding strategy to deal with these missing values.

import missingno
df.isna().sum()/df.shape[0]
# Bar chart to see missing degree
missingno.bar(df)
Copy the code

# Matrix graph looking at missing distribution
missingno.matrix(df)
Copy the code

To summarize the known information: there are 537,577 transaction records for int64, Object, and Float64. 31% of the values of Product_Category_2 and 69% of the values of Product_Category_3 are missing

Mining field information

# check the number of users and items consumed
print('Number of consumers:',df.User_ID.nunique())
print('Type of commodity:',df.Product_ID.nunique())
# Total amount spent
print('Total consumption {}'.format(df.Purchase.sum()),"$")
# Let me see who bought that
user_Purchase_top10 = df.groupby('User_ID').Purchase.count().reset_index(name='Order times')
user_Purchase_top10['Gross consumption'] = df.groupby('User_ID').Purchase.sum().values
user_Purchase_top10.sort_values('Order times',ascending=False).head(10)
Copy the code

Number of consumers: 5,891

Type of merchandise: 3623

The good guys spent $5 billion!

See how many different items category 2 has
print(df.Product_Category_2.unique())
print(df.Product_Category_3.unique())
Copy the code

[nan, 6., 14., 2., 8., 15., 16., 11., 5., 3., 4., 12., 9.,10., 17., 13., 7., 18.]

[nan, 14., 17., 5., 4., 16., 15., 8., 9., 13., 6., 12., 3.,18., 11., 10.]

Product_Category_2 and Product_Category_3 are both positive integers between 2 and 18. For nan, the median can be filled. However, it is also possible to guess that the item with the missing value does not belong to either category, so a new category is generated by replacing the missing value with 0.

Copy a copy of the source data to ensure that the original data does not move
# And want to get a new table record 5819 consumer information
UserInfo = df.copy()
UserInfo = UserInfo.pivot_table(index = ['User_ID'.'Gender'.'Age'.'Occupation'.'City_Category'.'Stay_In_Current_City_Years', \'Marital_Status'],values = ['Purchase'],aggfunc = 'sum').reset_index()
UserInfo.nunique()
Copy the code

Consumer characteristics: gender — 2 ages — 7 occupations — 21 cities — 3 years of residence — 5 marital status — 2 \

Data visualization

Create a function feature_group(data,group,text) to output a multi-layer pie chart sorted by consumer characteristics

colors = ['salmon'.'lightskyblue'.'#FFF68F'.'palegreen'.'lightpink'.'silver'.'burlywood'.'plum'.'rosybrown']
def feature_group(data,group,text) :    #data: data set, group: field to be classified
    data['Number'] = 1
    feature_class = data.groupby(group)[['Purchase'.'Number']].sum().sort_values('Purchase',ascending = False)
    # The inner circle is the proportion of consumers
    plt.pie(x = feature_class.Number,colors = colors,radius = 1.2,
            wedgeprops = {'width':1.'edgecolor':'w'},autopct = '%.f%%',pctdistance = 0.7)    
     # The outer circle is the proportion of consumption amount
    plt.pie(x = feature_class.Purchase,colors = colors,radius = 1.8,
            wedgeprops = {'width':0.6.'edgecolor':'w'},autopct = '%.f%%',pctdistance = 0.85)   
    # legend
    plt.legend(feature_class.index,bbox_to_anchor = (1.2.1.2),fontsize = 15)
    # chart of
    plt.title('different'+text+'Proportion chart of corresponding consumption number and consumption amount',y = 1.25,fontsize = 15)
    plt.show()
Copy the code

gender

feature_group(UserInfo,'Gender'.'gender')
Copy the code

malesPurchaserData = df.loc[df['Gender'] = ='M']
malesPurchaseMean = np.mean(malesPurchaserData['Purchase'])
print("Sales per male consumer =",malesPurchaseMean)

femalsPurchaserData = df.loc[df['Gender'] = ='F']
femalsPurchaseMean = np.mean(femalsPurchaserData['Purchase'])
print("Sales per female consumer =",femalsPurchaseMean)
Copy the code

Male consumer per capita sales = 9504.771712960679

Sales per female consumer = 8809.761348593387

The number of male consumers accounted for 72%, and the consumption amount accounted for 77%, both far higher than that of female, and the average consumption amount of male was also slightly higher than that of female

age

feature_group(UserInfo,'Age'.'age')
Copy the code

The proportion of consumers aged 26-35 is 35%, and the proportion of consumption amount is 40%. The average consumption amount is also higher than that of other age groups, the proportion of consumption amount is 20% for those aged 36-45, and the proportion of consumption amount is 18% for those aged 18-25. The consumption amount accounted for 18%. The consumption amount of young people aged 18-45 and middle-aged people accounted for 73%, and the consumption amount accounted for 78%, which was the main consumption force of this activity

professional

# Too many classes redefine a larger graph function
def feature_group_mul(data,group,text) :    #data: data set, group: field to be classified
    data['Number'] = 1
    feature_class = data.groupby(group)[['Purchase'.'Number']].sum().sort_values('Purchase',ascending = False)
    # The inner circle is the proportion of consumers
    plt.figure(figsize=(10.8))
    plt.pie(x = feature_class.Number,colors = colors,radius = 1.2,
            wedgeprops = {'width':1.'edgecolor':'w'},autopct = '%.f%%',pctdistance = 0.7)    
     # The outer circle is the proportion of consumption amount
    plt.pie(x = feature_class.Purchase,colors = colors,radius = 1.8,
            wedgeprops = {'width':0.6.'edgecolor':'w'},autopct = '%.f%%',pctdistance = 0.85)   
    # legend
    plt.legend(feature_class.index,bbox_to_anchor = (1.5.1.2),fontsize = 15)
    # chart of
    plt.title('different'+text+'Proportion chart of corresponding consumption number and consumption amount',y = 1.3,fontsize = 15)

feature_group_mul(UserInfo,'Occupation'.'career')
Copy the code

The occupations with the highest proportion of consumers were 4, 0 and 7, accounting for 13%, 12% and 11%, respectively

feature_group(UserInfo,'City_Category'.'city')
Copy the code

The city with the highest consumption proportion is C, which accounts for 53%, but the consumption proportion is only 33%. The city with the highest consumption proportion is B, which accounts for 42%, and the consumption proportion is 29%. The average consumption sum of CITY A and City B is relatively high, while the average consumption sum of city C is far lower than that of city A and City B

feature_group(UserInfo,'Stay_In_Current_City_Years'.'Years of residence')
Copy the code

The number of consumers who have lived for one year is the largest, and the amount of consumption is also the largest. The average consumption amount of consumers with different years of residence is almost the same

feature_group(UserInfo,'Marital_Status'.'Marital Status')
Copy the code

Unmarried people’s consumption number, consumption amount, average consumption amount are higher than married people, sure enough, not married to enjoy buying

# Further explore the two characteristics of city and age
fig = plt.figure(figsize=(10.5))
width_bar = 0.12
width_add = 0
for c,Age in enumerate(sorted(UserInfo.Age.unique())):
    City_Age = UserInfo[UserInfo.Age == Age].groupby('City_Category').Purchase.sum()\
        .reset_index(name = 'Purchase')
    City_Age['Purchase'] = round(City_Age['Purchase'] / 1000000.1)
    plt.bar(np.arange(UserInfo.City_Category.nunique()) + width_add,City_Age['Purchase'],width = width_bar,\
        color = colors[c],label = Age)
    for x,y in enumerate(City_Age.Purchase):
        plt.text(x + width_add,y + 8,y,ha = 'center')
    width_add += width_bar
plt.title('Consumption Sum comparison chart of Different Cities and ages',y = 1.02,fontsize = 13)
plt.xlabel('city',fontsize = 12)
plt.ylabel('Consumption Amount (US $million)',fontsize = 12)
plt.xlim(-0.1.3.2)
plt.ylim(0.900)
plt.xticks(np.arange(UserInfo.City_Category.nunique()) + 0.3.sorted(UserInfo.City_Category.unique()))
plt.tight_layout()
plt.legend()
plt.show()
Copy the code

Young people aged 26-35 are the main consumption force, and their consumption amount is higher than that of young and middle-aged consumers aged 18-45 in other age groups. The consumption amount contributed by young and middle-aged consumers accounts for A high proportion in cities A and B. In city C, the consumption amount contributed by minor consumers aged 1-17 and elderly consumers aged 55+ in city C is higher than that in city A and city B. It is preliminarily estimated that City C is A population outflow city

# Further explore the two characteristics of city and marital status
fig = plt.figure(figsize=(10.5))
width_bar = 0.12
width_add = 0
for c,M in enumerate(sorted(UserInfo.Marital_Status.unique())):
    City_M = UserInfo[UserInfo.Marital_Status == M].groupby('City_Category').Purchase.sum()\
        .reset_index(name = 'Purchase')
    City_M['Purchase'] = round(City_M['Purchase'] / 1000000.1)
    plt.bar(np.arange(UserInfo.City_Category.nunique()) + width_add,City_M['Purchase'],width = width_bar,\
        color = colors[c],label = M)
    for x,y in enumerate(City_M.Purchase):
        plt.text(x + width_add,y + 8,y,ha = 'center')
    width_add += width_bar
plt.title('Consumption Sum comparison chart of Different Cities and Marital Status',y = 1.02,fontsize = 13)
plt.xlabel('city',fontsize = 12)
plt.ylabel('Consumption Amount (US $million)',fontsize = 12)
plt.xlim(-0.1.3.2)
plt.ylim(0.1500)
plt.xticks(np.arange(UserInfo.City_Category.nunique()) + 0.1.sorted(UserInfo.City_Category.unique()))
plt.tight_layout()
plt.legend()
plt.show()
Copy the code

# Take a look at the top 20 items by sales
Product_Top20 = df.groupby('Product_ID').Purchase.sum().reset_index(name = 'Purchase')
Product_Top20 = Product_Top20.sort_values('Purchase',ascending = False)[:20]
Product_Top20['Purchase'] = round(Product_Top20['Purchase'] / 1000000.1)
fig = plt.figure(figsize=(10.4.5))
plt.bar(range(20),Product_Top20.Purchase,color = colors,width = 0.6)
for x,y in enumerate(Product_Top20.Purchase):
    plt.text(x,y + 0.5,y,ha = 'center')
plt.xticks(range(20),Product_Top20.Product_ID,rotation = -90)
plt.ylim(0.30)
plt.xlabel('Product Category',fontsize = 12)
plt.ylabel('Sales (US $million)')
plt.title('Top20 merchandise sales comparison chart ')
plt.tight_layout()
plt.show()
Copy the code

There are 16 categories of goods sales more than $20 million, the highest is P00025442, $27.5 million

# Look at the top 20% of sales as a percentage of total sales
Product_Group = df.groupby('Product_ID').Purchase.sum().sort_values(ascending = False)
Product_Group[:int(df.Product_ID.nunique() * 0.2)].sum() / Product_Group.sum(a)Copy the code

0.7317218081007266

The top 20 percent of sales accounted for 73 percent of sales, basically in line with the 80-20 rule

Because there are too many missing values for Product_Category_2 and Product_Category_3, only Product_Category_1 (category 1) will be analyzed
# Analyze category 1 sales for each category
Category_Group = round(df.groupby('Product_Category_1').Purchase.sum().sort_values(ascending = False) /10000000.1)
fig = plt.figure(figsize=(10.4))
plt.bar(range(df.Product_Category_1.nunique()),Category_Group,color = colors,width = 0.6)
for x,y in enumerate(Category_Group):
    plt.text(x,y + 3,y,ha = 'center')
plt.xticks(range(df.Product_Category_1.nunique()),Category_Group.index)
plt.ylim(0.200)
plt.xlabel('categories',fontsize = 12)
plt.ylabel('Sales (US $million)',fontsize = 12)
plt.title('Sales comparison chart by Category',fontsize = 13)
plt.tight_layout()
plt.show()

# Bar chart is not very convenient, through the pie chart to intuitively feel the proportion
plt.figure(figsize=(12.9))
plt.pie(x = Category_Group,colors = colors,radius = 1.5,wedgeprops = {'width':0.6.'edgecolor':'w'},\
            autopct = '%.f%%',pctdistance = 0.8,labels = Category_Group.index)
plt.title('Sales comparison chart by Category',y = 1.14,fontsize = 13)
plt.show()
Copy the code

The number one category in sales was $188.3 million, or 38 percent

The second and third categories in terms of sales are 5 and 8, accounting for 18% and 17% of sales

All five categories, 18, 9, 17, 12 and 13, had sales of less than $1 million

Sales of different categories vary widely

# Box graph view data distribution and outliers
fig = plt.figure(figsize=(14.6))
sns.boxplot(x = 'Product_Category_1',y = 'Purchase',data = df)
plt.tight_layout()
plt.title('Box chart of Unit Price for different categories of goods',fontsize = 14)
plt.show()
Copy the code

Most sales category 1, 5, 8 sales unit price are all in the position of the middle or lower, prove the three categories of commodity demand is very high class 10 highest selling price, but the sales accounts for only 2% than, other sales price higher category sales also generally is not high Class 12, 13 lowest selling price, sales is the least, That adds up to less than $1 million

Take a look at the consumption preferences of male and female consumers based on the analysis of product category and consumer gender
Category_Gender_pvt = pd.pivot_table(df,index = 'Product_Category_1',columns = 'Gender',\
values = 'Purchase',aggfunc = np.sum)
M_Percent = round(Category_Gender_pvt.M / df.groupby('Product_Category_1').Purchase.sum(*)100.1)
F_Percent = round(100 - M_Percent,1)
fig = plt.figure(figsize=(12.4.5))
width_bar = 0.35
plt.bar(np.arange(18) - width_bar * 0.5,M_Percent,width = width_bar,color = colors[0],label = 'M')
for x1,y1 in enumerate(M_Percent):
    plt.text(x1 - width_bar * 0.5,y1 + 2,y1,ha = 'center')
plt.bar(np.arange(18) + width_bar * 0.5,F_Percent,width = width_bar,color = colors[1],label = 'F')
for x2,y2 in enumerate(F_Percent):
    plt.text(x2,y2 + 2,y2)
plt.xticks(np.arange(18),Category_Gender_pvt.index)
plt.xlim(-0.8.19)
plt.ylim(0.100)
plt.xlabel('categories',fontsize = 12)
plt.ylabel('Percentage (%)',fontsize = 12)
plt.title('Proportion of Male and Female Sales by Category',fontsize = 14)
plt.legend()
plt.tight_layout()
plt.show()
Copy the code

The categories with the highest proportion of male consumption are 17 and 18, accounting for 89.6% and 88.2% respectively. The categories with the highest proportion of female consumption are 14 and 12, accounting for 42.8% and 40.7% respectively, which are still lower than the most popular categories 1, 5 and 8 for men, the consumption amount of men is much higher than that of women

summary

Men occupy a dominant position in consumption, young and middle-aged unmarried people have strong purchasing power, so they can be used as the target market. Unit price has a significant impact on sales, and different products can be launched precisely according to the population situation of different cities to maximize profits.