Why is user profiling needed and how is it helpful to the business

We have entered the second half of the Internet, and the growth is driven by data

The starting point of data analysis is insight into user behavior and needs

Can help us solve three problems:

Who are

Come from

Where is

What dimensions can be used to design user tags

Principle of eight characters: analysis of consumer behavior

User tags: gender, age, region, income, education, occupation, etc

Consumption label: consumption habit, purchase intention, whether sensitive to promotion

Behavior tags: time, frequency, duration, favorites, clicks, likes, and ratings

(User Behavior can be divided into Explicit Behavior and Implicit Behavior)

Content analysis: Analyze the content that users normally browse, such as sports, games, gossip

The three phases of the user life cycle

Customer acquisition: how to attract new customers through more accurate marketing;

Stickker: personalized recommendation, search sorting, scene operation, etc.

Guest: attrition rate forecast, analyze key nodes to reduce attrition rate.

Where the label comes from

Typical ways are:

• PGC: Produced by specialists

• UGC: General production

Tags are abstractions of higher-dimensional things (reduced dimensions)

Clustering algorithms: K-means, EM clustering, mean-shift, DBSCAN, hierarchical clustering, PCA

• Large amounts of data need to be labeled (labeled)

• Use user tags to recommend products (recommendation algorithm)

How k-means works

KMeans:

• Step1, K points are selected as the initial class center points, which are generally randomly selected from the data set;

• Step2, assign each point to the nearest class center point, so that K classes are formed, and then recalculate the center point of each class;

• repeat Step2 until the class does not change, or you can set the maximum number of iterations, so that even if the class center point changes, the maximum number of iterations will end

Data normalization methods: Min-max, Z-Score, decimal calibration

Min – Max standardization

Project raw data into the specified space [min, Max]

New value = (original value – minimum)/(maximum – minimum)

When min=0 and Max =1, normalization is [0,1]

The MinMaxScaler sklearn

Data normalization:

After normalization of [0, 1], feature data of different dimensions can be compared under the same standard

Z – Score normalization

Convert the original data into a normally distributed form

New value = (original value – mean)/standard deviation

Sklearn of preprocessing. Scale ()

Data normalization:

New value = (original value – mean)/standard deviation

The most commonly used method of data standardization today.

Answers the question, “How many standard deviations from the mean of a given data point?”

Decimal calibration is normalized

It is normalized by moving the decimal point

Use numpy

Data normalization:

Evaluation index: accuracy rate, recall rate, accuracy rate, F value

What is the TF – IDF

TF: Term Frequency

The importance of a word is proportional to the number of times it appears in the document.

IDF: Inverse Document Frequency

The degree of differentiation of a word in a document. The fewer documents this word appears in, the greater the degree of differentiation, and the greater the IDF

Content-based recommended system steps

• Item Representation:

Extract features for each item

• Profile Learning:

To learn a user’s preference profile by using the characteristic data of the items that a user likes (dislikes) in the past;

• Generate Recommendation Generation list:

Through the characteristics of user profile and candidate item, recommend the item with the most correlation.

• The tag can be used as a Profile of the user or as an Item feature

• Matching between User=>Item:

U SimpleTagBased

U NormTagBased

U TagBased – TFIDF

U clustering is a way of reducing dimension, definition of distance

U Defines the dimensions of the user profile (user, consumption, behavior, content) to guide the business

U Conduct business around the user life cycle (customer acquisition, customer adhesion, customer retention)

U Data processing layers: data source => Algorithm layer => business layer

U tag is an abstract capability. Through Profile Learning of user portrait and tag extraction of item, tag-based recall can be completed

The calculation of u-label recall is simple and belongs to a strategy of recall

10 solutions to MNIST

algorithm

tool

Logistic Regression

from sklearn.linear_model import LogisticRegression

CART**, ID3**** (decision tree) **

from sklearn.tree import DecisionTreeClassifier

LDA

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

Naive Bayes

from sklearn.naive_bayes import BernoulliNB

SVM

from sklearn import svm

KNN

from sklearn.neighbors import KNeighborsClassifier

Adaboost

from sklearn.ensemble import AdaBoostClassifier

XGBoost

from xgboost import XGBClassifier

TPOT

from tpot import TPOTClassifier

keras

import keras

TPOT: AutoML tool based on Python

TPOT github.com/EpistasisLa… (6.2 K)

TPOT can solve: feature selection, model selection, but not data cleaning

Processing small data is very fast and large data is very slow. You can sample a small portion first and use TPOT

TPOT:

• Only supervised learning for now

• Supported classifiers include Bayesian, decision tree, integration tree, SVM, KNN, Linear Model, and XGBoost

• Supported regressors include decision tree, integration tree, Linear Model, and XgBoost

• Data preprocessing: binarization, clustering, dimension reduction, standardization, regularization, etc

• Feature selection: based on tree model, based on variance, based on percentage of F-value

• The training process can be exported to a.py file in the form skLearn Pipeline using the export() method

Mind Map: