First, demand background
In the practice of Internet digitalized operation, there is a kind of data analysis application that is unique to the Internet industry — path analysis. The application of path analysis is to visually display the upstream and downstream of a specific page and analyze the path distribution of users when using the product. For example, when users use an APP, how do they enter the [details page] from the [home page], how do they enter the [details page], [play page] and [download page] respectively from the [home page], and can help us analyze the node from which users leave.
In the scene corresponding to the specific technical scheme design, we divided access data according to session, mining the frequent access path of users; Users can view the path of the selected node in real time, set the starting point or end point of the path, and view the transformation results of different target groups on the same behavior path based on new or active users, meeting the requirements of refined analysis.
1.1 Application Scenarios
The following are the main concerns of users in a scenario where path analysis is required:
- What are the main paths of users in APP in order of conversion rate from high to low?
- What is the actual direction of the user after leaving the desired path?
- What are the differences in user behavior paths for different characteristics?
With a real business scenario we can see how the path analysis model solves such problems;
[Business Scenario]
Analyze the main behavior path of “active users” to reach the target landing page [small video page] (daily data volume is one billion levels, requiring about 1s of output time of calculation results)
[User operation]
-
Select the start/End page and add the filter “User”.
-
Select Number of Access/Number of Sessions.
-
Click query, instant output results.
2. Basic concepts
Before proceeding to the specific data model and engineering architecture design, I will introduce some basic concepts to help you better understand this article.
2.1 Path Analysis
Path analysis is one of the common data mining methods. It is mainly used to analyze the path distribution of users when using products and dig out the frequent access paths of users. Like the funnel feature, path analysis explores the steps a user takes as they linger on your site or application, but path analysis can randomly explore multiple paths, not just one predefined path.
2.2 Session and Session Time
Different from sessions in WEB applications, sessions in data analysis refer to a series of interactions that occur on a website within a specified period of time. Session Time in this model means that when the interval between two behaviors exceeds Session Time, it is considered that the two behaviors do not belong to the same path.
2.3 the sankey diagram
Sankey diagram, namely Sankey energy distribution diagram, is also called Sankey energy balance diagram. It is a specific type of flowchart in which the width of the extended branches corresponds to the size of the data flow. As shown in Figure 4.1-1, each edge represents the traffic from the previous node to this node. A complete Sankey diagram consists of node data and node conversion rate (red box in the figure below) and side data and side conversion rate (black box in the figure below). See [3.5. Conversion rate Calculation] for the calculation of conversion rate.
2.4 adjacency table
The construction of sankey graph can be simplified to the problem of compression and storage of a graph. A diagram usually consists of several parts:
- Side (edge)
- Point (vertex)
- Weight (weight)
- Degree (degree)
In this model, we use adjacency list for storage. Adjacency list is a common graph compression storage structure, which saves the nodes and edges in the graph by means of linked list and neglects the edges that do not exist between nodes, so as to compress the matrix. The adjacency list is constructed as follows:
In (a), the left side is the vertex node, which contains vertex data and a pointer to the first edge; The right side is the edge node, which contains the weight of the edge, the access and equal edge information and the pointer to the next edge. A complete adjacency list is similar to the structure of a Hashmap, as shown in Figure (b). The left side is an order list that stores the edge nodes in (a). Each edge node corresponds to a linked list that stores edges connected to that node. In the page path model, we have modified the structure of vertex nodes and edge nodes to meet the needs of the model. For details, please refer to Section [4.1].
2.5 Pruning of trees
Pruning is an important step in tree construction. It refers to the removal of some unimportant nodes to reduce the complexity of computation or search. In the page path model, we trim the tree constructed by the original data in the pruning process and remove the unqualified branches to ensure the integrity of the path from root node to leaf node in the tree.
2.6 PV and SV
PV is Page View, the number of visits, in this model, the number of visits in a period of time; SV stands for Session View, the number of sessions. In this model, the number of sessions that have passed through this access path. For example, if there is path 1: A → B → C → D → A → B and path 2: A → B → D, then, PV of A → B is 2+1=3, and SV is 1+1=2.
Data model design
This section introduces the design of the data model, including data flow, path division, PS/SV calculation, and conversion calculation of the path in the resulting Sankey diagram.
3.1 Overall data flow
Data is obtained from a unified data warehouse, calculated using Spark, written to Clickhouse, and cold backed up using Hive. The data flow diagram is shown in Figure 3.1-1.
Figure 3.1 1
3.2 Technology selection
Clickhouse is not the focus of this article and will not be described in detail here, only briefly explaining why Clickhouse was chosen.
The reason for choosing Clickhouse is that it is columnar storage and extremely fast. Take a look at the data magnitude and query speed (as of the date of this writing) :
Figure 3.2 1
And I ended up with something like this,
Figure 3.2 2
3.3 Data Modeling
3.3.1 Obtaining page Information and dividing sessions
The page path model cuts the corresponding page ids based on various event ids to analyze the page path. The concept of Session can be seen in Section 2.2 and will not be described here. At present, we use a more flexible Session partition, so that users can query the page conversion information of users in different time granularity (5,10,15,30 and 60 minutes) Session sessions.
Assume that there are users A and B, and the behavior events of user A are E1, E2, E3… , the corresponding pages are P1, P2, P3… , events occurred at T1, T2, T3… , the selected session interval is TG. As shown in figure T4-T3> TG, P1,P2,P3 are divided into the first Session, P4,P5 are divided into the second Session, and P6 and the following pages are also divided into the new Session.
The pseudo-code implementation is as follows:
def splitPageSessions(timeSeq: Seq[Long], events: Seq[String], interval: Int)
(implicit separator: String): Array[Array[Array[String]]] = {
// Events is a collection of events, and timeSeq is a collection of corresponding event times
if (events.contains(separator))
throw new IllegalArgumentException("Separator should't be in events.")
if(events.length ! = timeSeq.length)throw new Exception("Events and timeSeq not in equal length.")
val timeBuf = ArrayBuffer[String](timeSeq.head.toString) // Stores a collection of times with session delimited identifiers
val eventBuf = ArrayBuffer[String](events.head) // Stores a collection of events with session-separated identifiers
if (timeSeq.length >= 2) {
events.indices.tail.foreach { i =>
if (timeSeq(i) - timeSeq(i - 1) > interval * 60000) { // If the interval between two events is longer than the set interval, add a delimiter as an identifier to divide the session later
timeBuf += separator;
eventBuf += separator
}
timeBuf += timeSeq(i).toString;
eventBuf += events(i)
}
}
val tb = timeBuf.mkString(",").split(s",\\$separator,").map(_.split(",")) // Divide the set into time sets under each session by identifier
val eb = eventBuf.mkString(",").split(s",\\$separator,").map(_.split(",")) // Divide the collection into a collection of events under each session by identifier
tb.zip(eb).map(t => Array(t._1, t._2)) // Zip the events in the session together with the occurrence time, and change the tuple to an array type, convenient for subsequent processing
}
Copy the code
3.3.2 Deduplication of adjacent Pages
Different events may correspond to the same page, and adjacent pages need to be filtered out. Therefore, after dividing the session, what needs to be done is to delete adjacent pages.
Figure 3.3 2
This is what happens when adjacent pages are de-duplicated
Figure 3.3 3
3.3.3 Get the front/back four pages of each page
Then perform window function analysis on the above data to obtain four levels of pages before and after each page in each session, where SID is assembled according to user ID and session number. For example, the following 7 records will be generated for the first session 0 of the above user A. Page in the figure is listed as the current page. Empty pages are denoted by -1
Figure 3.3-4
Counting the rest, you get a total of 7+7+6+4+5=29 records. Get the full record below
3.3.4 Collecting Pv/SV Statistics of positive and negative Paths
Take page and page_ID_previous1, page_ID_previous2, page_ID_previous3, page_ID_previous4 to get negative level 5 path (path_direction is 2). Take page and page_ID_next1, page_ID_next2, page_ID_next3, page_ID_nexT4 to get the forward five-level path (path_direction is 1), calculate the PV and SV of the path respectively (remove the weight according to sid). Get the following data dfSessions,
Looking directly at the data above can be confusing, so here are two examples of data, the first result data
Figure 3.3-4
This is a positive (path_direction is 1) path. In the figure below, it is the left-to-right path. The corresponding two paths are as follows
Figure 3.3 5
Article 2 Result Data
Figure 3.3-6
Pv is 2, and the corresponding two paths are as follows. Sv is 1 because the siDs of the two paths are the same. Both paths are generated by user A in S1 session
Figure 3.3-7
3.3.5 Calculate the PV/SV of each path
Pv and SV are then grouped by page_ID_lv1 according to dfSessions data to obtain the pv and SV of the level-1 path, which specifically sets path_direction to 0
Pv and SV of level 2, 3, 4 and 5 paths are similarly calculated, and all results are combined as follows
3.4 Data Writing
Data calculated by Spark analysis needs to be written to Clickhouse for online services and Hive for cold backup. Clickhouse data can be restored.
Clickhouse tables use a Distributed table structure that does not store any data on its own but acts as a transparent proxy for data sharding, automatically routing data to nodes in the cluster, so a Distributed table engine needs to work with other table engines. The table data of the user path analysis model is stored in the shards of the cluster using random shards, which involves writing Clickhouse data.
In this regard, at the beginning of the model, we used the way of writing distributed tables to write data, and the specific writing process is as follows:
- The client establishes A JDBC connection with node A in the cluster and writes data through THE HTTP POST request.
- A Shard does two things after receiving data. First, it divides the data according to sharding rules. Second, it writes the data belonging to the current shard to its local table.
- A Fragment Writes the data in the remote fragment to A temporary bin file in the directory. The file name is as follows: /database@host:port/[increase_num].bin.
- A fragment attempts to establish A connection with A remote fragment.
- Another set of listening tasks listens to the temporary bin files generated above and sends the data to the remote shard, with each data sent in a single thread.
- The remote shard receives data and writes it to the local table.
- A Fragment confirmation is complete.
As you can see from the above process, the Distributed table is responsible for writing data to all partitions. Therefore, the inbound and outbound traffic of the nodes with JDBC connections is very high, which causes the following problems:
- The load on a single node is too high, which is mainly reflected in the memory, network card incoming and outgoing traffic, and TCP connection waiting number. The machine health is poor.
- As the business grows, more models will access Clickhouse for OLAP, which means more data. Continuing to write in the current way will inevitably result in the failure of a single machine.
- In the future, IT will make ck cluster highly available and use ReplicatedMergeTree, which is more reliable. When using this engine to write data, data inconsistency will also occur because of writing distributed tables.
For this data end, the DNS polling writing table is reformed. After the transformation:
- The number of TCP connection waits for machines used for JDBC connections decreased from 90 to 25, a decrease of more than 72%;
- Peak incoming traffic for machines used for JDBC connections decreased from 645M/s to 76M/s, a decrease of more than 88%;
- The outbound traffic caused by data distribution on the JDBC connection machine is about 92M/s. After the transformation, the outbound traffic is cleared.
In addition, there are two types of write operations: asynchronous and synchronous. Asynchronous write operations return a write success message after the Distributed shard is written to the remote shard. Synchronous write operations return a write success message after the Distributed shard is written to the remote shard. We can modify the parameters to control the synchronous write wait timeout.
def splitPageSessions(timeSeq: Seq[Long], events: Seq[String], interval: Int)
(implicit separator: String): Array[Array[Array[String]]] = {
// Events is a collection of events, and timeSeq is a collection of corresponding event times
if (events.contains(separator))
throw new IllegalArgumentException("Separator should't be in events.")
if(events.length ! = timeSeq.length)throw new Exception("Events and timeSeq not in equal length.")
val timeBuf = ArrayBuffer[String](timeSeq.head.toString) // Stores a collection of times with session delimited identifiers
val eventBuf = ArrayBuffer[String](events.head) // Stores a collection of events with session-separated identifiers
if (timeSeq.length >= 2) {
events.indices.tail.foreach { i =>
if (timeSeq(i) - timeSeq(i - 1) > interval * 60000) { // If the interval between two events is longer than the set interval, add a delimiter as an identifier to divide the session later
timeBuf += separator;
eventBuf += separator
}
timeBuf += timeSeq(i).toString;
eventBuf += events(i)
}
}
val tb = timeBuf.mkString(",").split(s",\\$separator,").map(_.split(",")) // Divide the set into time sets under each session by identifier
val eb = eventBuf.mkString(",").split(s",\\$separator,").map(_.split(",")) // Divide the collection into a collection of events under each session by identifier
tb.zip(eb).map(t => Array(t._1, t._2)) // Zip the events in the session together with the occurrence time, and change the tuple to an array type, convenient for subsequent processing
}
Copy the code
3.5 Conversion rate Calculation
Select the corresponding dimension in the front page and select the start page:
The back end will query in Clickhouse,
Set node_depth to 1 and level-1 page (page_ID_Lv1) is the data of the selected page, get level-1 page and its SV/PV,
Set node_depth to 2 and level-1 page (page_ID_Lv1) to the data of the selected page. Take the first 10 sv/ PV in reverse order to get level-2 page and its SV/PV.
Set node_depth to 2 and level-1 page (page_ID_Lv1) to the data of the selected page. Take the first 20 sv/ PV in reverse order to get level-2 page and its SV/PV.
Node_depth = 2 and level-1 page (page_ID_Lv1) is the data of the selected page. Take the first 30 sv/ PV in reverse order to get level-4 page and its SV/PV.
Set node_depth to 2 and level-1 page (page_ID_Lv1) to the data of the selected page. Take the first 50 sv/ PV in reverse order to get level-5 page and its SV/PV.
Conversion rate calculation rules:
Page conversion rate:
Suppose there are paths A-b-c, A-d-c, and a-b-d-c, where ABCD is four different pages
Calculate the conversion rate of level 3 page C:
(The third-level pages of all paths with node depth of 3 are pv/ SV and the path of C)÷(PV/SV of the first-level pages)
Path conversion
Suppose there are A-B-C, A-D-C, and A-b-D-C, where ABCD is four different pages
Calculate the b-C conversion rate in the A-B-C path:
(PV/SV of the path a-b-C)÷(The secondary page of all the paths with node depth of 3 is the PV/SV sum of the path B)
4. Engineering end architecture design
This section will explain the engineering side of the processing architecture, including several aspects: sankey graph construction, path combination and conversion rate calculation, pruning.
4.1 The construction of sankey diagram
As can be seen from the above prototype diagram, we need to construct a Sankey diagram, which means to construct a weighted path tree for the engineering end.
By simplifying the figure above, you can translate the requirements into constructing adjacency lists with weight trees. The diagram on the left below is our adjacency list design. The sequential list on the left stores Vertex nodes, including node name, node code, and a pointer to the Edge list. Each Vertex points to a linked list of edges. Each Edge holds the weight of the current Edge, endpoint information, and a pointer to the next Edge of the same node.
Figure 4.1 2
Figure 4.1 3
Figure 4.1-2 is the adjacency list we use in our model. Some changes have been made to the adjacency list described in 2.4. In our Sankey diagram, nodes with the same name and different conversion rates appear at different levels. As a loop of the path, these nodes cannot be regarded as repeated nodes according to their names and do not constitute a loop. If the entire Sankey diagram is represented by an adjacency list, such nodes will be treated as identical nodes, resulting in loops in the image. Therefore, the Sankey diagram is divided into levels, and each two levels are represented by an adjacency list, as shown in FIG. 4.1-2. Level 1 indicates nodes of Level 1 and edges pointing to Level 2, Level 2 indicates nodes of Level 2 point to edges of Level 3, and so on.
4.2 Path Definition
First, let’s review the Sankey diagram:
Looking at the figure above, we need to calculate four data: PV/SV per node, conversion rate per node, PV/SV between nodes, and conversion rate between nodes. The following are the definitions of these data:
Pv/SV of the node = Total pv/ SV of the node at the current layer
Node conversion rate = (node PV/SV)/(path start node PV/SV)
Pv/SV between nodes = PV/SV of the flow from the upper node to the current node
Inter-node conversion rate = (pv/ SV between nodes)/(PV/SV of the upper node)
Take a look at the path data stored in Clickhouse. Let’s look at the table structure:
(
`node_depth` Int8 COMMENT 'Node depth, total 5 levels of depth, enumeration values 1-2-3-4-5' CODEC(T64, LZ4HC(0)),
`page_id_lv1` String COMMENT 'Primary page, start page' CODEC(LZ4HC(0)),
`page_id_lv2` String COMMENT 'Secondary page' CODEC(LZ4HC(0)),
`page_id_lv3` String COMMENT 'Level 3 pages' CODEC(LZ4HC(0)),
`page_id_lv4` String COMMENT 'Level 4 Pages' CODEC(LZ4HC(0)),
`page_id_lv5` String COMMENT 'Level 5 pages' CODEC(LZ4HC(0)))Copy the code
The above are several important fields in the path table, representing node depth and node levels respectively. The data in the table contains both full and intermediate paths. A full path is a path that starts at the beginning and exits and ends at the beginning. A path with more than 5 levels is treated as a 5-level path. The intermediate path is the intermediate data generated during the calculation of the index data and cannot be used as a complete path.
Path data:
(1) Complete path
(2) Incomplete path
Then we need to screen out the complete path from the data and organize the path data into a tree structure.
4.3 Design and Implementation
4.3.1 Overall framework
The overall implementation of the back end idea is very clear, the main steps are to read data, construct adjacency list and prune. So how do you implement full/incomplete path filtering? We filter out incomplete paths by pruning the service layer. Here is the pseudocode that describes the entire process:
// 1-1: read the original data hierarchically
// 1-1-1: Layer Clickhouse Sql
for( int depth = 1; depth <= MAX_DEPTH; depth ++){
sql.append(select records where node_depth = depth)
}
// 1-1-2: Reads data
clickPool.getClient();
records = clickPool.getResponse(sql);
// 2-1: Get parent-child relationship between nodes (bidirectional edge construction)
findFatherAndSonRelation(records);
findSonAndFathRelation(records);
/ / 3-1: pruning
// 3-1-1: Clears isolated nodes
for(int depth = 2; depth <= MAX_DEPTH; depth ++){
while(hasNode()){
node = getNode();
if node does not have father in level depth-1: cut out node; }}// 3-1-2: Filter incomplete paths
for(int depth = MAX_DEPTH - 1; depth >= 1; depth --){
cut out this path;
}
// 3-2: construct the adjacency list
while(node.hasNext()){
sumVal = calculate the sum of pv/sv of this node until this level;
edgeDetails = get the details of edges connected to this node and the end point connected to the edges;
sortEdgesByEndPoint(edgeDetails);
path = new Path(sumVal, edgeDetails);
}
Copy the code
4.3.2 Clickhouse connection pool
ClickHouse was introduced in the page path, and its features won’t be described here. We connect to ClickHouse Server using a simple Http connection pool. The connection pool structure is as follows:
4.3.3 Data Reading
As described in 2, we need to read the full path in the data.
(
`node_depth` Int8 COMMENT 'Node depth, enumeration value',
`page_id_lv1` String COMMENT 'Primary page, start page',
`page_id_lv2` String COMMENT 'Secondary page',
`page_id_lv3` String COMMENT 'Level 3 pages',
`page_id_lv4` String COMMENT 'Level 4 Pages',
`page_id_lv5` String COMMENT 'Level 5 pages',
`val` Int64 COMMENT 'Full data value'
)
Copy the code
In the above table structure, you can see that the path to write to the database is already a path with a depth of less than or equal to 5 after first-level filtering. On this basis, we need to distinguish the complete path from the incomplete path, and determine whether the complete path is based on node_depth and page_ID_lvn, and calculate the value of each node.
Complete path judgment conditions:
- node_depth=n, page_id_lvn=pageId (n < MAX_DEPTH)
- node_depth=n, page_id_lvn=pageId || page_id_lvn=EXIT_NODE (n = MAX_DEPTH)
Now that we know the conditions for a full path, there are two options for reading a path. Solution 1: Filter directly according to the above conditions to obtain the full path, due to the limitations of Clickhouse and back-end performance, the value must be limit; Scheme 2: Read layer by layer, which can calculate the full amount of data, but cannot guarantee the exact number of paths taken out.
Through observation, it is found that there will be repeated paths in the data, and suppose there are two paths:
A → B → C → D → EXIT_NODE
A → B → E → D → EXIT_NODE
If the preceding two paths exist, you need to calculate the value of each node. In actual data, we can only obtain the value of the current node through an incomplete path. Therefore, option 1 is not applicable.
Scheme 2 can then be read layer by layer with the following pseudocode:
for(depth = 1; depth <= MAX_DEPTH; depth++){ select node_depth as nodeDepth, ... , sum(sv) as val from table_name where ... AND (toInt16OrNull(pageId1) =45)
AND (node_depth = depth)
...
group by
node_depth,
pageId1,
pageId2,
...
ORDER BY
...
LIMIT
...
}
Copy the code
Read the following data:
Node1_A_val = 10+20, node2_B_val = 9+15, and so on.
4.3.4 pruning
According to 4.3.3, in the fetch stage, we will take out all the original data in layers, and the original data contains complete and incomplete paths. The following figure is a tree constructed directly from the raw data (the original tree). According to our definition of the full path: the path depth reaches 5 and the end node is exit or other nodes; The path depth is less than 5 and the end node exits. It can be seen that the part marked in red (node4_lv1 → node3_lv2) is an incomplete path.
In addition, there are isolated nodes in the original tree (the green node node4_lv2). This is because in the fetch stage, we will sort the data hierarchically and then take it out, so the relevance of each layer of data cannot be guaranteed. Therefore, node4_lv2 is in the first order at lV2 layer, while its precursor and successor nodes are in the last order and cannot be selected, resulting in isolated nodes.
Figure 4.3 3
Therefore, after we pull out the raw data set, we need to filter to get the path we really need.
In the model, we implement this filtering operation by pruning.
// Clear the isolated node
for(int depth = 2; depth <= MAX_DEPTH; depth ++){
while(hasNode()){
node = getNode();
if node does not have any father and son: / / [1]cut out node; }}// Filter incomplete paths
for(int depth = MAX_DEPTH - 1; depth >= 1; depth --){
cut out this path; / / [2]
}
Copy the code
In the previous steps, we got the bidirectional EDGE list (parent-child and parent-child lists). Therefore, in the above pseudocode [1], the edge list can be used to quickly find the precursor and successor of the current node, so as to determine whether the current node is an isolated node.
Similarly, we use edge lists to crop incomplete paths. For incomplete paths, you only need to care about paths that are less than MAX_DEPTH and whose last node is not EXIT_NODE when pruning. Therefore, in the pseudocode [2] above, we only need to determine whether the node of the current layer has sequential edges (parent-child relationship), if not, the current node will be cleared.
Write at the end
Based on the platform in the query query time is short, need visual requirements, combined with the existing storage computing resources as well as the specific requirements, we in the heart of the implementation path enumeration is divided into two after the merger, the data for the first time is the same day to the same path to merge, the second is the date range of path for the summary. This paper hopes to provide reference for the path analysis, and it also needs to make reasonable design based on the characteristics of each business to better serve the business.
Clickhouse is not covered in detail here, but if you are interested in it, you are welcome to explore it with me.
Author: Vivo Internet Big Data team