Summary: The MaxCompute challenge uses SQL for sequential data processing — not MR and functions
Daily preparation of data processing tasks, the main method is to use SQL. First, BECAUSE I have a good grasp of SQL (more than ten years of experience in data development, I dare not tell others that I am not good at these keywords), so MR and function are not involved. In my years with MaxCompute, I should have written no more than 10 functions, mainly because of my poor JAVA skills. I remember writing an ID card number verification function in the early years, when there was a project feedback a piece of SQL originally 2 minutes, using my function became 12 minutes. At that time, the same project team came to me for MaxCompute, and the head of the development team came to me and asked me to tune my code. I’m scared. I know what kind of shit I am. Finally, I shamelessly asked the r&d to help me optimize, and the performance was finally improved. After this, I will not be able to randomly create functions, after all, MaxCompute official recommend using SQL as much as possible, SQL is optimized method, using MR and custom function performance is difficult to guarantee. That’s why I’m still so bad at it. I don’t think IT’s my fault. I just listened to “mom.”
Recently, it is amazing that two projects in a row have encountered the problem of sequence value calculation, and both of them require that functions and MR cannot be used. When my colleague sent me the questions, I found it took me a long time just to understand the questions (the questions were a little round). I was not engaged in front-line development for a long time, so I was a little rusty. It took us a day to solve the same problem the first time and a half a day to solve it the second time. So, summarize and share with you.
So let’s talk about what sequential value processing is. The records in the table themselves are unordered, but the business data is ordered, and generally time is a natural sequence. For example, using the time of my day, CALCULATE how many times I eat and how long I eat. At first glance, it looks like I’m going to write a function.
The problem is simulated as follows:
Question: How many times and for how long did you eat?
Condition: 1- Two “meal” states spaced within 1 hour count as one meal
2- The start time of the next state after the last “eat” state, is the end time of “eat”
From the above analysis, we can get the result: approximately four meals, since the evening meal time is long, according to the rule of two meals (the fourth time seems to be for masturbation). How did I do that? The first step is to remove the irrelevant information, line 1, line 4, and line 1. In the second step, I found the end time of the state by taking advantage of the fact that data is continuous time. In the third step, I identified the feature of 1-hour interval between states, identified other unrelated states mixed in a “eat”, and analyzed that the third “eat” and the fourth “eat” states were two independent states.
So how do you do that in SQL? Sorting is a given, and to sort and deal with state, you have to use window functions. The only window functions available seem to be LAG and Lead.
Window functions:
LAG
Takes the value of the row preceding the current row by offset.
LEAD
Takes the value of the row after the current row by offset.
Official documents:
[
Help.aliyun.com/document\_d…
] (help.aliyun.com/document_de…).
Even with this function, there is a problem where the function needs to specify an offset, and there is no way to know how many states will occur. Is it also useless? Let’s see.
The breakdown of the problem is as follows:
Use the LAG\LEAD function to fetch the state and time of the previous record and the next record and analyze the records:
1- The current state is not “eat”, the previous state is not “eat”, the record is not retained.
2- The current state is not “eat”, the previous state is “eat”, provides the end time for the previous state, the record is not retained.
3- The current state is “eat”, and the previous state and the next state are both “eat”. The record is not retained.
4- If the current status is “eat”, record a status time as the end time of the current status and keep the record.
The diagram below:
Then we get the following table:
Obviously, this is not what we need at the end. Although we found the row in the “eat” state, we found the end of the state for it through the window function. But the table needs to be processed one more time to get the result we want. Again using
LAG\LEAD function, we need to combine the “eating” states spaced within 1 hour.
The problem is broken down again as follows:
Use the LAG\LEAD function to take the start and end times of the previous record and the next record and analyze the records:
1- The start time of the current record minus the End Time of the last point in time. If the time is less than 1 hour, the record is not retained. The state recorded in this row needs to be merged with the previous row into a “eat” state. The rows are highlighted in green in the figure below.
2- The start time of the next time point minus the end time of the current record. If less than 1 hour, this record is merged with the next record. Example Change the end time of the Eat state at the current time point to the end time of the next time point. The lines in orange are shown below.
Then we get the following table:
It doesn’t matter how complicated we thought it was, what kind of loop or recursive logic we needed to implement it, but now it’s solved. We answered the original question with this table. The man ate four times, starting at 7:10, 12:25, 17:40 and 19:45, each time lasting about an hour. This process is a process of finding the required information and eliminating irrelevant information, but the where is a bit complicated.
In fact, from the point of view of analyzing the problem, the problem itself is a bit complicated, and it generally takes a certain amount of time to understand the problem. From an implementation point of view, it’s easier to implement in a high-level language like JAVA or Python, so there’s no point in iterating through it. It seems a bit complicated to implement in SQL (probably because I’ve been using SQL for a long time, I feel as if the process of analyzing a problem is the same as the process of implementing it). , but with the least amount of code (and possibly the best performance). From the maintainability of the comprehensive comparison, or the use of SQL to achieve better.
So you should be able to handle similar problems in the future. If it’s a little difficult, at least you can go back to this example, because I spent a lot of time designing it.
SQL questions:
with ta as(
select*
from values
(1001,’06:05:00′,’sleep’)
,(1001,’07:10:00′,’eat’)
,(1001,’08:15:00′,’phone’)
,(1001,’11:20:00′,’phone’)
,(1001,’12:25:00′,’eat’)
,(1001,’12:40:00′,’phone’)
,(1001,’13:30:00′,’eat’)
,(1001,’13:35:00′,’sleep’)
,(1001,’17:40:00′,’eat’)
,(1001,’18:05:00′,’eat’)
,(1001,’18:25:00′,’eat’)
,(1001,’18:30:00′,’phone’)
,(1001,’19:45:00′,’eat’)
,(1001,’20:55:00′,’phone’)
,(1001,’22:00:00′,’sleep’)
t(id,stime,stat))
— 5 Calculate whether to merge records based on the time before and after the records
selectid,stime
,case whens2<=60 thenetime2 else etime end asetime,stat
from(
— 4 Calculate the time difference recorded before and after
selectid,stime,etime,stat
,datediff(stime,etime1,’mi’) ass1
,datediff(stime2,etime,’mi’) ass2
,etime2
from(
— 3 Calculate the time recorded before and after
selectid,stime,etime,stat
,lag (stime,1) over(partition byid order by stime asc)as stime1
,lag (etime,1) over(partition byid order by stime asc)as etime1
,lead(stime,1) over(partition byid order by stime asc)as stime2
,lead(etime,1) over(partition byid order by stime asc)as etime2
from(
— 2 Record the status before and after identification and find the end time of the status
selectid,stime,stat
,lead(stime,1) over(partition byid order by stime asc)as etime
,lag (stat,1) over(partition byid order by stime asc)as stat1
,lead(stat,1) over(partition byid order by stime asc)as stat2
from(
— 1 Turn the string to time
selectid,to_date(concat(‘2021-06-29 ‘,stime),’yyyy-mm-dd hh:mi:ss’) asstime,stat
fromta)t1)t2
wherestat=’eat’ and not(stat=’eat’ andstat1=’eat’ andstat2=’eat’))t3)t4
wheres1 >60 ors1 is null
;
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.