Why is user profiling needed and how is it helpful to the business
We have entered the second half of the Internet, and the growth is driven by data
The starting point of data analysis is insight into user behavior and needs
Can help us solve three problems:
Who are
Come from
Where is
What dimensions can be used to design user tags
Principle of eight characters: analysis of consumer behavior
User tags: gender, age, region, income, education, occupation, etc
Consumption label: consumption habit, purchase intention, whether sensitive to promotion
Behavior tags: time, frequency, duration, favorites, clicks, likes, and ratings
(User Behavior can be divided into Explicit Behavior and Implicit Behavior)
Content analysis: Analyze the content that users normally browse, such as sports, games, gossip
The three phases of the user life cycle
Customer acquisition: how to attract new customers through more accurate marketing;
Stickker: personalized recommendation, search sorting, scene operation, etc.
Guest: attrition rate forecast, analyze key nodes to reduce attrition rate.
Where the label comes from
Typical ways are:
• PGC: Produced by specialists
• UGC: General production
Tags are abstractions of higher-dimensional things (reduced dimensions)
Clustering algorithms: K-means, EM clustering, mean-shift, DBSCAN, hierarchical clustering, PCA
• Large amounts of data need to be labeled (labeled)
•
• Use user tags to recommend products (recommendation algorithm)
How k-means works
KMeans:
• Step1, K points are selected as the initial class center points, which are generally randomly selected from the data set;
• Step2, assign each point to the nearest class center point, so that K classes are formed, and then recalculate the center point of each class;
• repeat Step2 until the class does not change, or you can set the maximum number of iterations, so that even if the class center point changes, the maximum number of iterations will end
Data normalization methods: Min-max, Z-Score, decimal calibration
Min – Max standardization
Project raw data into the specified space [min, Max]
New value = (original value – minimum)/(maximum – minimum)
When min=0 and Max =1, normalization is [0,1]
The MinMaxScaler sklearn
Data normalization:
After normalization of [0, 1], feature data of different dimensions can be compared under the same standard
Z – Score normalization
Convert the original data into a normally distributed form
New value = (original value – mean)/standard deviation
Sklearn of preprocessing. Scale ()
Data normalization:
New value = (original value – mean)/standard deviation
The most commonly used method of data standardization today.
Answers the question, “How many standard deviations from the mean of a given data point?”
Decimal calibration is normalized
It is normalized by moving the decimal point
Use numpy
Data normalization:
Evaluation index: accuracy rate, recall rate, accuracy rate, F value
What is the TF – IDF
TF: Term Frequency
The importance of a word is proportional to the number of times it appears in the document.
IDF: Inverse Document Frequency
The degree of differentiation of a word in a document. The fewer documents this word appears in, the greater the degree of differentiation, and the greater the IDF
Content-based recommended system steps
• Item Representation:
Extract features for each item
• Profile Learning:
To learn a user’s preference profile by using the characteristic data of the items that a user likes (dislikes) in the past;
• Generate Recommendation Generation list:
Through the characteristics of user profile and candidate item, recommend the item with the most correlation.
• The tag can be used as a Profile of the user or as an Item feature
• Matching between User=>Item:
U SimpleTagBased
U NormTagBased
U TagBased – TFIDF
U clustering is a way of reducing dimension, definition of distance
U Defines the dimensions of the user profile (user, consumption, behavior, content) to guide the business
U Conduct business around the user life cycle (customer acquisition, customer adhesion, customer retention)
U Data processing layers: data source => Algorithm layer => business layer
U tag is an abstract capability. Through Profile Learning of user portrait and tag extraction of item, tag-based recall can be completed
The calculation of u-label recall is simple and belongs to a strategy of recall
10 solutions to MNIST
algorithm
tool
Logistic Regression
from sklearn.linear_model import LogisticRegression
CART**, ID3**** (decision tree) **
from sklearn.tree import DecisionTreeClassifier
LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
Naive Bayes
from sklearn.naive_bayes import BernoulliNB
SVM
from sklearn import svm
KNN
from sklearn.neighbors import KNeighborsClassifier
Adaboost
from sklearn.ensemble import AdaBoostClassifier
XGBoost
from xgboost import XGBClassifier
TPOT
from tpot import TPOTClassifier
keras
import keras
TPOT: AutoML tool based on Python
TPOT github.com/EpistasisLa… (6.2 K)
TPOT can solve: feature selection, model selection, but not data cleaning
Processing small data is very fast and large data is very slow. You can sample a small portion first and use TPOT
TPOT:
• Only supervised learning for now
• Supported classifiers include Bayesian, decision tree, integration tree, SVM, KNN, Linear Model, and XGBoost
• Supported regressors include decision tree, integration tree, Linear Model, and XgBoost
• Data preprocessing: binarization, clustering, dimension reduction, standardization, regularization, etc
• Feature selection: based on tree model, based on variance, based on percentage of F-value
• The training process can be exported to a.py file in the form skLearn Pipeline using the export() method
Mind Map: