Java Architecture - SQL optimization practice - query speed up 300 times!

Most developers are familiar with SQL, no matter what language development system, as long as the use of relational databases, will involve the use of SQL.

In some systems, the main program logic is embodied in a single stored procedure, such as data center products, at this time, everyone thinks that the product’s main development language is SQL, so we think of SQL as the program itself. However, in more business systems, we usually only need to do ordinary add, delete, change and search. SQL is more just some query statements inserted in Java or XML files. At this time, developers only view SQL as a tool for query analysis, not as a program.

Let me tell you a real story about SQL optimization at work.

This is a user behavior analysis system with three tables (after simplified fields), as shown in the figure below.

Java Architecture – SQL optimization practice – query speed up 300 times! The daily_access table records the access status of users on the current day. One record is an access request.

In the IP_range table, segments of IP addresses are stored. IP addresses from IP_start to IP_end belong to the same region.

The area table records area_id and the corresponding region. The same area may have multiple IP address segments, so the area table may have duplicate data.

The data volume of the three tables is about 100,000 for daily_access, and 500,000 for area and IP_range. Prerequisites: The IP_range and area tables are one-to-one, and ip_start and IP_end are mutually exclusive.

The requirement now is to count the number of visitors from each region from three tables.

Daily_access table ip_access field must fall between ip_start and IP_end of the IP_range table. Query statement as follows:

select COUNT (*), a.addr

from daily_access d, ip_range r, area a

where 1 =1

and d.ip_access between r.ip_start and r.ip_end

and r.area_id = a.area_id

group by a.addr;

This SQL is true and has been used in the system for some time, but it is not effective because the SQL takes about 15 minutes to execute at a time with the amount of data mentioned above.

You may think, for a background analysis system, real-time query result does not need to see, or deposited in the result table for future reference can be output to the statements – indeed – but 100000 visits is actually a very small number, if there are millions of visitors, must, so the consumption of time will rise exponentially, even one night out report.

Therefore, the query statement is optimized to filter the tables with less data and then disassociate the tables with more data:

select COUNT (*), a.addr

from ( select t1.ip_access , t2.addr

from (select d.ip_access ,

( select r.area_id

from ip_range r

where d.ip_access between r.ip_start and r.ip_end) as area_id ,

from daily_access d) t1,

area t2

where t1.area_id = t2.area_id ) d,

area a

where d.area_id = a.area_id

group by a.addr;

After optimization, because the tables with more data are first processed, fewer results are screened out and then associated with another table, the speed is improved, and it takes about 6 minutes to execute once. Although the second solution is more than twice as efficient as the first, it is clear that the performance of either solution is difficult to accept.

Next, let’s look at what a query would look like in a real production system (again, simplifying the fields for clarity) :

with vstat_details as ( select /*+ all_rows materialize */ distinct ip_access from daily_access ),

vstat_ip_range as (

select /*+ all_rows materialize */

v2.ip_start n_ip, v2.area_id

from (select v1.dataset, v1.ip_start,

last_value(v1.range_start ignore nulls ) over (order by v1.vc_ip_start,v1.dataset) range_start,

last_value(v1.range_end ignore nulls ) over (order by v1.vc_ip_start,v1.dataset) range_end,

last_value(v1.area_id ignore nulls ) over (order by v1.vc_ip_start,v1.dataset) area_id

from (select 1 dataset,

t1.ip_start,

t1.ip_start range_start,

t1.ip_end range_end,

t1.area_id

from ip_range t1

union all

select /*+ leading(d) use_hash(r) no_merge(d) full(r) */

2 dataset,

t2.n_ip ip_start,

null range_start,

null range_end,

null area_id

from daily_access t2) v1) v2

where v2.ip_start >= v2.range_start

and v2.ip_start <= v2.range_end

and v2.dataset = 2)

select /*+ all_rows leading(v,d) use_hash(d,a) no_merge(v) */

count (*) as n_pageviews,

a.addr

from vstat_ip_range v,

daily_access d,

area a

where v.n_ip = d.ip_access

and v.area_id = a.area_id

group by a.addr;

Why is a simple query so long?

The first two query statements, developers in writing, subconscious SQL as a query and analysis of data means and tools, rather than programming, and this SQL, not only from the perspective of “check” the problem, but also the use of data structure and algorithm to solve the problem. This difference in starting point leads to different programming ideas.

Now, let’s take the above SQL apart and see how it works.

First, start from the innermost layer, the subquery of the inner layer, preprocess the ip_range table data, add a mark “1” :

select 1 dataset,

t1.ip_start,

t1.ip_start range_start,

t1.ip_end range_end,

t1.area_id

from ip_range t1

Assume the following data for ip_range (we’ll simplify IP to a simple integer representation for convenience) :

id area_id ip_start ip_end

1 1 15 20

2 2 22 to 25

3 3 30 to 35

4 4 36 and 40

The data structure after the tag is complete will look like this

Mark area_id ip_start ip_end start2

1, 1, 15, 20, 15

1, 2, 22, 25, 22

1, 3, 30, 35, 30

1, 4, 36, 40, 36

Daily_access = daily_access = daily_access = daily_access = daily_access = daily_access = daily_access

select 2 dataset,

t2.n_ip ip_start,

null range_start,

null range_end,

null area_id

from daily_access t2

Assuming the following four access records, the temporary data structure after collation is as follows:

Mark area_id ip_start ip_end start2

2 null 16 null null

2 null 22 null null

2 null 24 null null

2 null 39 null null

If you combine the two tables (union all) and sort them by ip_start and tag fields, you get the following data structure:

Mark area_id ip_start ip_end start2

1, 1, 15, 20, 15

2 null 16 null null

1, 2, 22, 25, 22

2 null 22 null null

2 null 24 null null

1, 3, 30, 35, 30

1, 4, 36, 40, 36

2 null 39 null null

Area_id = area_id = area_id = area_id = area_id = area_id = area_id = area_id = area_id = area_id = area_id = area_id = area_id = 1

Mark area_id ip_start ip_end start2

1, 1, 15, 20, 15

2, 1, 16, 20, 15

1, 2, 22, 25, 22

2, 2, 22, 25, 22

2, 2, 24, 25, 22

1, 3, 30, 35, 30

1, 4, 36, 40, 36

2, 4, 39, 40, 36

After removing the data marked “1” from the above temporary table, we get the data we need:

Mark area_id ip_start ip_end start2

2, 1, 16, 20, 15

2, 2, 22, 25, 22

2, 2, 24, 25, 22

2, 4, 39, 40, 36

If the value of ip_start falls between ip_end and start2 in the same data, its area_id is the data we want to obtain. The whole process does not do any large amount of data linked table query, very high efficiency.

Preconstructing the above procedure into a temporary table is what the first half of the query does:

with vstat_details as ( select /*+ all_rows materialize */ distinct ip_access from daily_access ),

vstat_ip_range as (

select /*+ all_rows materialize */

v2.ip_start n_ip, v2.area_id

from (select v1.dataset, v1.ip_start,

last_value(v1.range_start ignore nulls ) over (order by v1.vc_ip_start,v1.dataset) range_start,

last_value(v1.range_end ignore nulls ) over (order by v1.vc_ip_start,v1.dataset) range_end,

last_value(v1.area_id ignore nulls ) over (order by v1.vc_ip_start,v1.dataset) area_id

from (select 1 dataset,

t1.ip_start,

t1.ip_start range_start,

t1.ip_end range_end,

t1.area_id

from ip_range t1

union all

select /*+ leading(d) use_hash(r) no_merge(d) full(r) */

2 dataset,

t2.n_ip ip_start,

null range_start,

null range_end,

null area_id

from daily_access t2) v1) v2

where v2.ip_start >= v2.range_start

and v2.ip_start <= v2.range_end

and v2.dataset = 2)

In the end, we only need to use this temporary table for a simple associative query:

select /*+ all_rows leading(v,d) use_hash(d,a) no_merge(v) */

count (*) as n_pageviews,

a.addr

from vstat_ip_range v,

daily_access d,

area a

where v.n_ip = d.ip_access

and v.area_id = a.area_id

group by a.addr;

Since there is no between comparison, the amount of data is also pre-filtered and processed, so the whole query process is very fast. With the aforementioned amount of data, the query takes about 3 seconds, which is 300 times higher than the performance of the original query statement.

In fact, the current user behavior analysis system has been using big data platform, making the IP address is a relational database can also be used to obtain higher performance, but the query statement in the period of the old system, can bring us inspiration, is still very meaningful, it is in fact for us to realize such a truth: SQL program.

Keeping this in mind will help us write SQL statements that are more like “programs” than “queries” based on natural semantics.

Every day will share dry goods, remember to point attention oh!!

Java Architecture – SQL optimization practice – query speed up 300 times!

Related Posts

Fun | made product managers, our way of thinking will have a change?

Use DataV to create large screen for real-time sales data visualization

What are the benefits of small programs for enterprises and businesses?