preface
Python is widely used in various technical fields of the Internet due to its rich and powerful class library, among which data mining and analysis is the most common. In this paper, RFM user value analysis model is implemented through Python, bringing you together to feel the fun of data mining and analysis ~
RFM: Recency, Frequency and Monetary, we calculate the three core elements of business activities with a certain weight, so as to establish a value label for each user.
Through the user value tag, we can adopt different marketing strategies, do precise marketing, as well as new, recall and other operations
RFM modeling and analysis
Technical preparation
Technical stack: Python’s time, NUMpy, and Pandas class packages
Pandas:
- Provide a rich library of data functions (import, export, select, filter, statistics)
- Provides data visualization capabilities
Business practices
Now we have a raw data containing the user ID, order time, order ID, and order amount. We expect the value of each user from the RFM model:
The code is as follows, and the comments are very clear, so I won’t repeat them here:
import time
import numpy as np
import pandas as pd
# import data
df_raw = pd.DataFrame(pd.read_excel('test.xlsx',index_col='USERID'))
# Missing value handling
sales_data = df_raw.dropna() Lose the row record with the missing value NA
sales_data = sales_data[sales_data['AMOUNTINFO'] > 1] Discard the order value <=1
# Data conversion (resum by user ID)
recency_value = sales_data['ORDERDATE'].groupby(sales_data.index).max() Calculate the time of the last order
frequency_value = sales_data['ORDERDATE'].groupby(sales_data.index).count() # Calculate order frequency
monetary_value = sales_data['AMOUNTINFO'].groupby(sales_data.index).sum() # Calculate the total amount of the order
# Calculate R,F and M scores respectively
deadline_date = pd.datetime(2020.5.1) # specify a time node to calculate the distance between other times and the changed time
r_interval = (deadline_date - recency_value).dt.days # Compute the r interval
r_score = pd.cut(r_interval, 5, labels=[5.4.3.2.1]) # Calculate r score in reverse quintile order
f_score = pd.cut(frequency_value, 5, labels=[1.2.3.4.5]) # Calculate f score
m_score = pd.cut(monetary_value, 5, labels=[1.2.3.4.5]) # calculate m score
# R,F,M data merge
rfm_list = [r_score, f_score, m_score] # Put R,F,M into a list
rfm_cols = ['r_score'.'f_score'.'m_score'] Set R,F,M column names
rfm_pd = pd.DataFrame(np.array(rfm_list).transpose(), dtype=np.int32,
columns=rfm_cols, index=frequency_value.index) Create R,F,M data box
# Strategy 1: Weighted scores define user value
rfm_pd['rfm_wscore'] = rfm_pd['r_score'] *0.2 + rfm_pd['f_score'] *0.2 + rfm_pd['m_score'] *0.6
# Strategy 2: RFM combination directly outputs three dimensional values
rfm_pd_tmp = rfm_pd.copy()
rfm_pd_tmp['r_score'] = rfm_pd_tmp['r_score'].astype('str')
rfm_pd_tmp['f_score'] = rfm_pd_tmp['f_score'].astype('str')
rfm_pd_tmp['m_score'] = rfm_pd_tmp['m_score'].astype('str')
rfm_pd['rfm_comb'] = rfm_pd_tmp['r_score'].str.cat(rfm_pd_tmp['f_score']).str.cat(rfm_pd_tmp['m_score'])
# export data
rfm_pd.to_csv('simple.csv')
Copy the code
After you run the script, you get a new file that looks like this:
Refer to the document
conclusion
Visible, based on Python, RFM model or very good implementation, the difficulty of modeling actually lies in the selection of down values and weight values, they will directly affect the final result.
My personal experience is based on the 2-8 principle, and I choose the boundary value of R, F and M as a tie value. Then set weights according to actual services.
Technology itself is boring, but we can use technology to make life better. In the future, we will share more interesting Python practices with you!