Author: JackTian

Wechat Official Account: Jack’s IT Journey (ID: Jake_Internet)

I. Case description

1. Case background

520 Valentine’s Day, do not understand to send girlfriend what brand of lipstick? It doesn’t matter! Python data analysis tells you.

We crawled the information of nearly 4000 lipsticks from jingdong Mall and analyzed the data, so that people can choose a reference when buying lipsticks for their girlfriends. The analysis is carried out from the following aspects:

1. Which price ranges sell best? 2. Distribution of lipstick sales 3. What are the top 10 selling lipsticks? 4. Top 10 stores by sales. 5. The relationship between commodity price and sales volume.

2. Task description

The data set jd_data.csv of all lipsticks on JD.com was crawled by Python crawler.

We hope to conduct statistical and analysis of different lipstick brands and stores through this data set, so as to solve the above questions.

3. Description of data fields

Parameter Meaning Chart:

4. Data analysis process

2. Data preprocessing

Data cleaning

1. Import data from the CSV file first

Import pandas as pd import matplotlib.pyplot as PLT # Read dataframe = pd.read_csv('jd_data.csv',encoding = Utf-8 print(dataframe.shape)Copy the code

(3816, 6) There are 3816 rows and 6 columns.

2. Missing value processing

data = dataframe.dropna(how='any')
data.head()
print(data.shape)
Copy the code

(3610, 6) As you can see here, there are some missing values

There are two main methods to deal with missing values:

delete

Fill: divide into mean, median, mode, nearby values to fill, and Newton’s method of difference, etc. Here is a lazy way to use a simpler deletion method to deal with the missing value, after all, there are not many missing values.

# inplace=True # inplace=True # inplace=True # inplace=True # inplace=TrueCopy the code

Data conversion

1, will comment + and 10,000-word modification

Def dealComment(comm_colum): def dealComment(comm_colum): STR (comm_colum).split('+')[0] if 'in' : if '.' in ': Num = num. Replace (', '). The replace (' ', '000') else: Num = num.replace('.','').replace(' 0000','0000') return num dataframe['comment'] = dataframe['comment'].apply(lambda x: Dataframe ['comment'] = dataframe.ment.astype ('int') data = dataframe.ment.astype ('int') dataframe.drop('comment',axis = 1) print(data.head(10))Copy the code

Processed data:

Data preprocessing is an important task of data analysis. Whether accurate data analysis results can be obtained cannot be separated from data preprocessing. Let’s begin to analyze lipstick data.

3. Data analysis

There is no sales information on JD.com, so let’s take the number of reviews as sales.

The name, price, comment, shop_name, and shop_type fields are used in this project.

They are the product title, price, number of comments, store name, store type to analyze.

1. Distribution range of lipstick price

Import pandas as pd import matplotlib.pyplot as PLT # Read data = pd.read_csv('jd_data.csv',encoding = 'gb18030') Figure (figsize=(10,8)) price = data[data['price'] < 1000] plt.rcparams ['font. Sans-serif ']=['SimHei'] Bins =10, color='brown') plt.hist(price['price'], bins=10, color='brown') plt.xlabel(' price') plt.ylabel(' price') plt.title(' price') plt.show()Copy the code

Here are the results:

From the picture above, it is clear that:

  • Most of the prices of lipsticks are in the range of 0-500 yuan, but some of them cost up to 1000 yuan.

  • Among them, the number of 200-300 yuan price is very high, more than 1200, and the price of more than 300 yuan has an obvious trend of reduction, ha ha price is king.

2. Sales distribution

Since there is no sales information to crawl, consider the number of reviews as sales

Figure (figsize=(10,8)) #print(len(sale_num)/len(data)) #print(len(sale_num)/len(data) Bins =20, color='blue') plt.hist(sale_num['comment'], bins=20, color='blue') plt.xlabel(' sale ') plt.ylabel(' sale ') plt.title(' sale ') plt.show()Copy the code

Here are the results:

According to the histogram, we can see:

  • Sales are basically within 200,000.

  • Sales of less than 100,000 accounted for the vast majority

  • A few stores sold more than a million

3. Top 10 selling lipsticks

Def get_title(item): title = item.split(' ')[0] return title data['small_name'] = data['name'].apply(lambda x: get_title(x)) data1 = data.drop('name',axis = 1) top10Lipstick = data1.sort_values('comment',ascending=False) print(top10Lipstick.head(10)) title = top10Lipstick['small_name'][:10] sale_num = top10Lipstick['comment'][:10] PLT. Figure (figsize = (10, 8), dpi = 80) PLT. Bar (range (10), sale_num, width = 0.6, color = 'red') Ticks (range(10),title,rotation=45) # ticks((9,9.7)) # ticks((9,9.7)) For x,y in enumerate(list(sale_num)): plt.text(x,float(y)+0.01,y,ha='center')Copy the code

Here are the results:

It can be found that the top three are:

  • Jingdong MAC Classic Lipstick Bullet Lipstick 3G Chili Pepper Color

Commodity images

  • 【520 gifts 】 Chinese style lipstick Set Gift Box Female Summer Palace the same lipstick lip glaze students non-sample lipstick set (6 pieces)

Commodity images

  • [520 gift] Dior Gorgeous Blue and Gold Lipstick – Matte 999# 3.5G Legend Red (Lipstick red Legend Red gift box)

Commodity images

4. Top 10 stores by sales

After analyzing the top 10 items, let’s take a look at the top 10 stores:

The code is as follows:

# top_shop = data.groupby('shop_name')['comment'].sum().sort_values(ascending=False)[:10] Figure (figsize=(10,8),dpi = 80) top_shop.plot(kind =' bar',color='red',width= 0.6) Ticks (rotation=45) for x,y in enumerate(list(top_shop)): PLT. Text (x, float (y) + 0.1, y, ha = 'center') PLT. The show ()Copy the code

Here are the results:

As can be seen from the above picture:

  • The overseas self-operated area of MAC took the first place, with sales of 1,365,308, and the sales of the top 10 stores were all above 50,000.

  • The top three had more than 1.3 million

  • Five of the top 10 are owned by JD.com

5. The relationship between commodity price and sales volume

Let’s use a scatter plot to see the distribution of prices and sales

PLT. Figure (figsize = (10, 8)) PLT. Scatter (data [' price '], data [' comment ']. Color ='blue') plt.xlabel(' price ') plt.ylabel(' sales ') plt.title(' price, sales scatter ') plt.show()Copy the code

Here are the results:

It can be seen that:

With the increase of the price, the sales will decrease, and the price is within 400, which has little impact on the sales, proving that most people’s consumption range of lipstick is between 0 and 400 yuan, but the most expensive reaches nearly 1700 yuan. Haha, poverty limits my imagination.

Four,

After this little data analysis, or learned a lot. As a young white, there is a lot to learn:

  • Data cleaning, it is able to analyze the correct results of the guarantee;
  • How to mine the relationship between different dimensions of data;

Shortcomings: This data analysis still has many areas to be improved:

  • For example, analyze the proportion of different types of stores;
  • Comparison of sales between different types of stores;
  • Sentiment analysis was not performed because the comment data was not crawled this time.

There is still a long way to go in data analysis. Come on!

Don’t know what brand of lipstick to give your girlfriend on Valentine’s Day? It doesn’t matter! Python data analysis tells you.

Welcome to like, comment, retweet share, don’t forget to follow me!