During the National Day holiday, I spent some time trying and playing grafana for the first time. These days, I continue to optimize and improve it. Now I am quite satisfied with my results. – Gradually close to the ideal background I want.

Demand is non-stop. Today, I found some new requirements for myself, such as variables, filter boxes and so on. I collected them and continued to play when I was free. In the process of learning programming, I have not tried new skill points, instinct will feel difficult, but hands-on experience told me: don’t panic, use it to understand, look at me every time quickly get started. Being good at making yourself feel good about learning and actually making continuous progress, self-motivation is a particularly useful soft skill.

Then I thought I’d tidy up some of the advanced SQL statements I’ve been using with Grafana these days. By advanced, I mean my personal SQL abilities, specifically those that were not in my previous notes and that I did use over and over again in Grafana. Sorting out the new knowledge points that you have just repeatedly practiced can consolidate new knowledge. With this done, you can move on to the next stage and sprint into the unknown.

Earlier, I wrote a note about why I wanted to play Grafana and how I was praised by the engineer within 24 hours. It mentioned that I copied the SQL statements that the engineer had implemented, broke them down into meta-knowledge points, and then understood them one by one: what were their functions and how to use them, and then directly applied them to test the effect.

Take an example to disassemble a meta-knowledge point

In this note, I also use an example of knowledge disaggregation, the statement is: count the number of users with learning behavior every day. Note: Learning behavior actually contains a variety of specific behaviors, distributed in two tables.

with data as( select date(created_at) as time, user_id from user_comments union all select date(created_at) as time, User_id from user_Activities) select time, count(distinct user_id) as Number of users who learn daily from data group by time order by timeCopy the code

Note: UNLIKE Python, SQL is insensitive to case, line breaks, indentation, and so on. The reason for line breaks and indentation above is just for legibility.

This may seem like a long SQL statement, but it actually has two parts. The data in front of as is the name of the data, which we have customized, and the from data source behind B is it. The part A enclosed with data as() is used to generate data. It is equivalent to performing retrieval statistics to obtain some data named data, and then performing retrieval statistics on data.

With data as (【 句 block A】) 【 句 block B】Copy the code

Can be nestedwith data as()

Using with data as() for a short period of time, I wondered: can this thing be nested? A try worked. Nesting just makes it look a little bit more complicated, it doesn’t really change anything. As shown below, the data source for statement block A is raw data, statement block B is data, and statement block C is datax.

With datax as(with datax as(【 解 释 A】) 【 解 释 B】) 【 解 释 C】Copy the code

In practice, I have used 3 levels of nesting at most, and occasionally; Double-layer nesting is used more. A single layer is quite common.

withunionMerge data row

SQL > merge table with data as(); Here’s the abstraction.

【 句 block X】 union all 【 句 block Y】Copy the code

When dealing with the merging of table data, subdivision has the following three situations:

  • To combine multiple columns or rows of data into a single column or row of data
  • Let’s merge the columns in table A with the columns in table B
  • Merge the rows in table A with the rows in table B

Union handles row-based merges. For example, if the result of block X is line A and the result of block Y is line B, the result of the union all combination will have (a+b) lines. However, the result of using union is to take the union of A and B, that is, only one copy of data row exists in both A and B.

In pandas, the axis argument to pd.concat() handles the merging of rows and columns in different ways.

functiondata()withasThe alias

In the example above, the statement blocks X and Y are basically basic statements. But there’s still an approach that I haven’t used before.

  • date(created_at) as time,Count (distinct user_id) AS Indicates the number of daily study usersIn both of these clips,asBefore we have an expression statement,asThis is followed by an alias for the result of the statement operation.
  • date()The method is to reduce complex time data to date data of year, month and day. Ultra high frequency use.
  • count(distinct user_id)It indicates: Yesuser_idDe-weight, and then countuser_idNumber. Ultra high frequency use.

Things like count() and sum() are basic functions that are used frequently. But in statistics, it’s more commonly used to add up. Statements are deterministic combinations, so they are not listed separately:

Sum (Number of users) over (ORDER by exchange date ASC rows between unbounded preceding and current row) asCopy the code

Count (1), count(*), and count(column_name) have different operating efficiencies. Since I am not yet aware of writing the best performing SQL statements, I will not delve further.

Weight loss under various circumstances

As mentioned above, using DISTINCT is not complicated; What is complicated is the requirement. The definition of data indicators should be understood accurately. Different data indicators have different requirements for deduplication.

Situation A: No weight loss.

Although count is user_id, the value is not the number of users leaving messages every day, but the number of messages left every day.

Select date(created_at) as time, count(user_id) as daily number of messages from user_Comments group by time order by timeCopy the code

Situation B: The same day.

Lose weight in the day, cross the sky does not lose weight. If a user has multiple voice messages on a day, only 1 voice message can be contributed to the number of users

Select date(created_at) as time, count(distinct user_id) as Number of users of leave messages every day from user_Comments group by time order by timeCopy the code

Scenario C: Historical cumulative deweighting.

The accumulated users who leave messages are deduplicated in the whole period. As long as the user has left messages, the count is 1 and the count is not repeated.

Select count(distinct user_id) as Total number of users who leave messages from user_CommentsCopy the code

Scenario D: Daily and historical accumulations are simultaneously de-weighted.

Suppose we want to know the number of new users leaving messages every day, that is, if the user has left messages before, it does not count, otherwise, it counts 1 on the day of the first message. This scenario is a little more complicated than the previous three, but it is also quite frequently used.

with data as ( select distinct on (user_id) user_id, date(created_at) as time from user_comments ) select time, Count (user_id) as Specifies the number of users of new voice messages each day. Count (user_id) over (Order by time ASC Rows between unbounded preceding and current row) as the total number of accumulated message leaving users from data group by time,user_id order by timeCopy the code

A few common tips

Limit specifies how many pieces of data to display. In other words, the absence of this condition means that all data for the result of the query is displayed. When I didn’t know this, I would sometimes accidentally look at a table at a command prompt and print so many rows that I couldn’t scroll down to the bottom… And in the data background, usually with the sorting function, used to display “leaderboard” data. For example, the number of times studied, the total amount of exchange and so on.

select *  from table_name limit 50;
Copy the code

Order by specifies which fields the data is sorted by. The default order is desc.

select *  from table_name order by column_name;
Copy the code

Group by specifies which fields the data is grouped by. Many reports are counted daily. This method has been implicitly used several times in the previous example, so I won’t give a separate example.

Multi-table union query

Finally, a relatively complex multi-table query. Merge queries from multiple tables, or tables and custom data sources such as Data. A relatively simple example is to query user_id from users_extra based on the input variable user_name, and then use user_id to query the user_Activities table.

With data as(select user_id,user_name from users_extra where user_name = '$user_name') select count(1) from data as(select user_id,user_name from users_extra where user_name = '$user_name' user_activities,data where user_activities.user_id = data.user_idCopy the code

The prerequisite for such a federated query is that multiple data sources can be correlated by a field. In fact, more complex examples can be disassembled into more unit knowledge points. I’m not going to expand it here.

By the way, the above user_name = ‘$user_name’ statement is used in Grafana to call custom variables and is implemented to support dropdown filtering. This is also the new requirement I mentioned at the beginning of writing this article. As a result, when the article was revised and rewritten, this requirement was actually realized by me. That was fast!

summary

If one day you’re like me and you start learning advanced, complex SQL statements or other skills, don’t panic. Find some ready-made examples (like my notes) to digest, break them down into bits and pieces, and then put them together and use, and you’ll find: that’s it.

This process is like playing with blocks! Have fun!

If this note helps you, please leave a comment and let me know. This will encourage me to sort and share more. Welcome nuggets of engineers XDJM point out the text points wrong.