“This is the 26th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”
The data set of this paper is taken from the “Black Friday” data set of Kaggle website. The background of the data set is based on Black Friday. Black Friday is a big shopping activity in western countries, which is similar to The “Double 11” activity in China. To maximize profits at the end of the year. In order to ensure that the privacy rights of the operators and consumers of the data set are not violated, some features of the data set are desensitized. A retail limited company wants to know how its customers buy various products in different categories (and in particular how much they buy). Provides purchasing information for the customer groups that select high-volume products on Black Friday. The dataset contains two pieces of information: customer demographic information (age, sex, marital status, city type, residence duration) and product details (item ID and item type) and total purchase amount. Kaggle data sets: www.kaggle.com/sdolezel/bl…
The specific explanation of each field is shown in Table 11:
Table 1-1 Data set field interpretation table
field | meaning |
---|---|
User_ID | The user code |
Product_ID | Commodity code |
Gender | Gender (M for male, F for female) |
Age | Age (0-17,18-25,26-35,36-45,46-50,51-55,55+, 7 kinds) |
Occupation | Occupations (20 occupations in total) |
City_Category | City classification (divided into three types of cities: ABC) |
Stay_in_Current_City_Years | Number of years of residence in current city (0,1,2,3,4+, 5 types) |
Marital_Status | Marital status (0 is unmarried, 1 is married) |
Product_Category_1 | Product classification is 1 (clothing, 20 kinds, represented by 1-20) |
Product_Category_2 | Product Category 2 (Electronic products) |
Product_Category_3 | Product category 3 (Furniture supplies) |
Purchase | Purchase Amount (in US dollars) |
# import data
df = pd.read_csv("product_data.csv")
df.head()
Copy the code
Fundamental analysis
Info Basic information
Use INFO for simple data information
df.info()
Copy the code
What does info() show us?
User_ID = int, Product_ID = object, Product_ID = P, Product_ID = P, Product_ID = P Or is there something else? We need to check and analyze it later.
The characteristics of Gender/Age/City_Category/Stay_In_Current_City_Years are also object because they contain strings. There is no big problem. We will deal with the structure later.
Occupation/Marital_Status is a purely numerical variable and int is appropriate as a categorical variable.
Product_Category_1/Product_Category_2/Product_Category_3 /Product_Category_3 However, it is not possible to convert a class variable to a floating-point number, so look at the data later and decide whether to convert the data type
In conclusion, the data types of all data are chaotic. A large number of unstructured string data need to be converted into corresponding numeric data, and some data need to be converted into data types.
2. Non-null data missing \
Product_Category_2/Product_Category_3 is missing, and it looks like Product_Category_3 is missing.
With such a large amount of data, deletion is definitely not feasible, so you need to think of some effective padding strategy to deal with these missing values.
import missingno
df.isna().sum()/df.shape[0]
# Bar chart to see missing degree
missingno.bar(df)
Copy the code
# Matrix graph looking at missing distribution
missingno.matrix(df)
Copy the code
To summarize the known information: there are 537,577 transaction records for int64, Object, and Float64. 31% of the values of Product_Category_2 and 69% of the values of Product_Category_3 are missing
Mining field information
# check the number of users and items consumed
print('Number of consumers:',df.User_ID.nunique())
print('Type of commodity:',df.Product_ID.nunique())
# Total amount spent
print('Total consumption {}'.format(df.Purchase.sum()),"$")
# Let me see who bought that
user_Purchase_top10 = df.groupby('User_ID').Purchase.count().reset_index(name='Order times')
user_Purchase_top10['Gross consumption'] = df.groupby('User_ID').Purchase.sum().values
user_Purchase_top10.sort_values('Order times',ascending=False).head(10)
Copy the code
Number of consumers: 5,891
Type of merchandise: 3623
The good guys spent $5 billion!
See how many different items category 2 has
print(df.Product_Category_2.unique())
print(df.Product_Category_3.unique())
Copy the code
[nan, 6., 14., 2., 8., 15., 16., 11., 5., 3., 4., 12., 9.,10., 17., 13., 7., 18.]
[nan, 14., 17., 5., 4., 16., 15., 8., 9., 13., 6., 12., 3.,18., 11., 10.]
Product_Category_2 and Product_Category_3 are both positive integers between 2 and 18. For nan, the median can be filled. However, it is also possible to guess that the item with the missing value does not belong to either category, so a new category is generated by replacing the missing value with 0.
Copy a copy of the source data to ensure that the original data does not move
# And want to get a new table record 5819 consumer information
UserInfo = df.copy()
UserInfo = UserInfo.pivot_table(index = ['User_ID'.'Gender'.'Age'.'Occupation'.'City_Category'.'Stay_In_Current_City_Years', \'Marital_Status'],values = ['Purchase'],aggfunc = 'sum').reset_index()
UserInfo.nunique()
Copy the code
Consumer characteristics: gender — 2 ages — 7 occupations — 21 cities — 3 years of residence — 5 marital status — 2 \
Data visualization
Create a function feature_group(data,group,text) to output a multi-layer pie chart sorted by consumer characteristics
colors = ['salmon'.'lightskyblue'.'#FFF68F'.'palegreen'.'lightpink'.'silver'.'burlywood'.'plum'.'rosybrown']
def feature_group(data,group,text) : #data: data set, group: field to be classified
data['Number'] = 1
feature_class = data.groupby(group)[['Purchase'.'Number']].sum().sort_values('Purchase',ascending = False)
# The inner circle is the proportion of consumers
plt.pie(x = feature_class.Number,colors = colors,radius = 1.2,
wedgeprops = {'width':1.'edgecolor':'w'},autopct = '%.f%%',pctdistance = 0.7)
# The outer circle is the proportion of consumption amount
plt.pie(x = feature_class.Purchase,colors = colors,radius = 1.8,
wedgeprops = {'width':0.6.'edgecolor':'w'},autopct = '%.f%%',pctdistance = 0.85)
# legend
plt.legend(feature_class.index,bbox_to_anchor = (1.2.1.2),fontsize = 15)
# chart of
plt.title('different'+text+'Proportion chart of corresponding consumption number and consumption amount',y = 1.25,fontsize = 15)
plt.show()
Copy the code
gender
feature_group(UserInfo,'Gender'.'gender')
Copy the code
malesPurchaserData = df.loc[df['Gender'] = ='M']
malesPurchaseMean = np.mean(malesPurchaserData['Purchase'])
print("Sales per male consumer =",malesPurchaseMean)
femalsPurchaserData = df.loc[df['Gender'] = ='F']
femalsPurchaseMean = np.mean(femalsPurchaserData['Purchase'])
print("Sales per female consumer =",femalsPurchaseMean)
Copy the code
Male consumer per capita sales = 9504.771712960679
Sales per female consumer = 8809.761348593387
The number of male consumers accounted for 72%, and the consumption amount accounted for 77%, both far higher than that of female, and the average consumption amount of male was also slightly higher than that of female
age
feature_group(UserInfo,'Age'.'age')
Copy the code
The proportion of consumers aged 26-35 is 35%, and the proportion of consumption amount is 40%. The average consumption amount is also higher than that of other age groups, the proportion of consumption amount is 20% for those aged 36-45, and the proportion of consumption amount is 18% for those aged 18-25. The consumption amount accounted for 18%. The consumption amount of young people aged 18-45 and middle-aged people accounted for 73%, and the consumption amount accounted for 78%, which was the main consumption force of this activity
professional
# Too many classes redefine a larger graph function
def feature_group_mul(data,group,text) : #data: data set, group: field to be classified
data['Number'] = 1
feature_class = data.groupby(group)[['Purchase'.'Number']].sum().sort_values('Purchase',ascending = False)
# The inner circle is the proportion of consumers
plt.figure(figsize=(10.8))
plt.pie(x = feature_class.Number,colors = colors,radius = 1.2,
wedgeprops = {'width':1.'edgecolor':'w'},autopct = '%.f%%',pctdistance = 0.7)
# The outer circle is the proportion of consumption amount
plt.pie(x = feature_class.Purchase,colors = colors,radius = 1.8,
wedgeprops = {'width':0.6.'edgecolor':'w'},autopct = '%.f%%',pctdistance = 0.85)
# legend
plt.legend(feature_class.index,bbox_to_anchor = (1.5.1.2),fontsize = 15)
# chart of
plt.title('different'+text+'Proportion chart of corresponding consumption number and consumption amount',y = 1.3,fontsize = 15)
feature_group_mul(UserInfo,'Occupation'.'career')
Copy the code
The occupations with the highest proportion of consumers were 4, 0 and 7, accounting for 13%, 12% and 11%, respectively
feature_group(UserInfo,'City_Category'.'city')
Copy the code
The city with the highest consumption proportion is C, which accounts for 53%, but the consumption proportion is only 33%. The city with the highest consumption proportion is B, which accounts for 42%, and the consumption proportion is 29%. The average consumption sum of CITY A and City B is relatively high, while the average consumption sum of city C is far lower than that of city A and City B
feature_group(UserInfo,'Stay_In_Current_City_Years'.'Years of residence')
Copy the code
The number of consumers who have lived for one year is the largest, and the amount of consumption is also the largest. The average consumption amount of consumers with different years of residence is almost the same
feature_group(UserInfo,'Marital_Status'.'Marital Status')
Copy the code
Unmarried people’s consumption number, consumption amount, average consumption amount are higher than married people, sure enough, not married to enjoy buying
# Further explore the two characteristics of city and age
fig = plt.figure(figsize=(10.5))
width_bar = 0.12
width_add = 0
for c,Age in enumerate(sorted(UserInfo.Age.unique())):
City_Age = UserInfo[UserInfo.Age == Age].groupby('City_Category').Purchase.sum()\
.reset_index(name = 'Purchase')
City_Age['Purchase'] = round(City_Age['Purchase'] / 1000000.1)
plt.bar(np.arange(UserInfo.City_Category.nunique()) + width_add,City_Age['Purchase'],width = width_bar,\
color = colors[c],label = Age)
for x,y in enumerate(City_Age.Purchase):
plt.text(x + width_add,y + 8,y,ha = 'center')
width_add += width_bar
plt.title('Consumption Sum comparison chart of Different Cities and ages',y = 1.02,fontsize = 13)
plt.xlabel('city',fontsize = 12)
plt.ylabel('Consumption Amount (US $million)',fontsize = 12)
plt.xlim(-0.1.3.2)
plt.ylim(0.900)
plt.xticks(np.arange(UserInfo.City_Category.nunique()) + 0.3.sorted(UserInfo.City_Category.unique()))
plt.tight_layout()
plt.legend()
plt.show()
Copy the code
Young people aged 26-35 are the main consumption force, and their consumption amount is higher than that of young and middle-aged consumers aged 18-45 in other age groups. The consumption amount contributed by young and middle-aged consumers accounts for A high proportion in cities A and B. In city C, the consumption amount contributed by minor consumers aged 1-17 and elderly consumers aged 55+ in city C is higher than that in city A and city B. It is preliminarily estimated that City C is A population outflow city
# Further explore the two characteristics of city and marital status
fig = plt.figure(figsize=(10.5))
width_bar = 0.12
width_add = 0
for c,M in enumerate(sorted(UserInfo.Marital_Status.unique())):
City_M = UserInfo[UserInfo.Marital_Status == M].groupby('City_Category').Purchase.sum()\
.reset_index(name = 'Purchase')
City_M['Purchase'] = round(City_M['Purchase'] / 1000000.1)
plt.bar(np.arange(UserInfo.City_Category.nunique()) + width_add,City_M['Purchase'],width = width_bar,\
color = colors[c],label = M)
for x,y in enumerate(City_M.Purchase):
plt.text(x + width_add,y + 8,y,ha = 'center')
width_add += width_bar
plt.title('Consumption Sum comparison chart of Different Cities and Marital Status',y = 1.02,fontsize = 13)
plt.xlabel('city',fontsize = 12)
plt.ylabel('Consumption Amount (US $million)',fontsize = 12)
plt.xlim(-0.1.3.2)
plt.ylim(0.1500)
plt.xticks(np.arange(UserInfo.City_Category.nunique()) + 0.1.sorted(UserInfo.City_Category.unique()))
plt.tight_layout()
plt.legend()
plt.show()
Copy the code
# Take a look at the top 20 items by sales
Product_Top20 = df.groupby('Product_ID').Purchase.sum().reset_index(name = 'Purchase')
Product_Top20 = Product_Top20.sort_values('Purchase',ascending = False)[:20]
Product_Top20['Purchase'] = round(Product_Top20['Purchase'] / 1000000.1)
fig = plt.figure(figsize=(10.4.5))
plt.bar(range(20),Product_Top20.Purchase,color = colors,width = 0.6)
for x,y in enumerate(Product_Top20.Purchase):
plt.text(x,y + 0.5,y,ha = 'center')
plt.xticks(range(20),Product_Top20.Product_ID,rotation = -90)
plt.ylim(0.30)
plt.xlabel('Product Category',fontsize = 12)
plt.ylabel('Sales (US $million)')
plt.title('Top20 merchandise sales comparison chart ')
plt.tight_layout()
plt.show()
Copy the code
There are 16 categories of goods sales more than $20 million, the highest is P00025442, $27.5 million
# Look at the top 20% of sales as a percentage of total sales
Product_Group = df.groupby('Product_ID').Purchase.sum().sort_values(ascending = False)
Product_Group[:int(df.Product_ID.nunique() * 0.2)].sum() / Product_Group.sum(a)Copy the code
0.7317218081007266
The top 20 percent of sales accounted for 73 percent of sales, basically in line with the 80-20 rule
Because there are too many missing values for Product_Category_2 and Product_Category_3, only Product_Category_1 (category 1) will be analyzed
# Analyze category 1 sales for each category
Category_Group = round(df.groupby('Product_Category_1').Purchase.sum().sort_values(ascending = False) /10000000.1)
fig = plt.figure(figsize=(10.4))
plt.bar(range(df.Product_Category_1.nunique()),Category_Group,color = colors,width = 0.6)
for x,y in enumerate(Category_Group):
plt.text(x,y + 3,y,ha = 'center')
plt.xticks(range(df.Product_Category_1.nunique()),Category_Group.index)
plt.ylim(0.200)
plt.xlabel('categories',fontsize = 12)
plt.ylabel('Sales (US $million)',fontsize = 12)
plt.title('Sales comparison chart by Category',fontsize = 13)
plt.tight_layout()
plt.show()
# Bar chart is not very convenient, through the pie chart to intuitively feel the proportion
plt.figure(figsize=(12.9))
plt.pie(x = Category_Group,colors = colors,radius = 1.5,wedgeprops = {'width':0.6.'edgecolor':'w'},\
autopct = '%.f%%',pctdistance = 0.8,labels = Category_Group.index)
plt.title('Sales comparison chart by Category',y = 1.14,fontsize = 13)
plt.show()
Copy the code
The number one category in sales was $188.3 million, or 38 percent
The second and third categories in terms of sales are 5 and 8, accounting for 18% and 17% of sales
All five categories, 18, 9, 17, 12 and 13, had sales of less than $1 million
Sales of different categories vary widely
# Box graph view data distribution and outliers
fig = plt.figure(figsize=(14.6))
sns.boxplot(x = 'Product_Category_1',y = 'Purchase',data = df)
plt.tight_layout()
plt.title('Box chart of Unit Price for different categories of goods',fontsize = 14)
plt.show()
Copy the code
Most sales category 1, 5, 8 sales unit price are all in the position of the middle or lower, prove the three categories of commodity demand is very high class 10 highest selling price, but the sales accounts for only 2% than, other sales price higher category sales also generally is not high Class 12, 13 lowest selling price, sales is the least, That adds up to less than $1 million
Take a look at the consumption preferences of male and female consumers based on the analysis of product category and consumer gender
Category_Gender_pvt = pd.pivot_table(df,index = 'Product_Category_1',columns = 'Gender',\
values = 'Purchase',aggfunc = np.sum)
M_Percent = round(Category_Gender_pvt.M / df.groupby('Product_Category_1').Purchase.sum(*)100.1)
F_Percent = round(100 - M_Percent,1)
fig = plt.figure(figsize=(12.4.5))
width_bar = 0.35
plt.bar(np.arange(18) - width_bar * 0.5,M_Percent,width = width_bar,color = colors[0],label = 'M')
for x1,y1 in enumerate(M_Percent):
plt.text(x1 - width_bar * 0.5,y1 + 2,y1,ha = 'center')
plt.bar(np.arange(18) + width_bar * 0.5,F_Percent,width = width_bar,color = colors[1],label = 'F')
for x2,y2 in enumerate(F_Percent):
plt.text(x2,y2 + 2,y2)
plt.xticks(np.arange(18),Category_Gender_pvt.index)
plt.xlim(-0.8.19)
plt.ylim(0.100)
plt.xlabel('categories',fontsize = 12)
plt.ylabel('Percentage (%)',fontsize = 12)
plt.title('Proportion of Male and Female Sales by Category',fontsize = 14)
plt.legend()
plt.tight_layout()
plt.show()
Copy the code
The categories with the highest proportion of male consumption are 17 and 18, accounting for 89.6% and 88.2% respectively. The categories with the highest proportion of female consumption are 14 and 12, accounting for 42.8% and 40.7% respectively, which are still lower than the most popular categories 1, 5 and 8 for men, the consumption amount of men is much higher than that of women
summary
Men occupy a dominant position in consumption, young and middle-aged unmarried people have strong purchasing power, so they can be used as the target market. Unit price has a significant impact on sales, and different products can be launched precisely according to the population situation of different cities to maximize profits.