Because the database space is still relatively immature, SQL developers on every platform are struggling, making the same mistakes over and over again. Of course, database vendors are making some progress and continue to grapple with the bigger issues.

Whether SQL developers code on SQL Server, Oracle, DB2, Sybase, MySQL, or any other relational database platform, concurrency, resource management, space management, and speed continue to plague them.

Part of the problem is that there is no magic bullet; I can cite at least one exception to almost every best practice.

We say that tuning databases is both an art and a science, which makes sense because there are few hard and fast rules that apply across the board. A problem you solve on one system is not a problem on another, and vice versa.

There is no right answer when it comes to tuning queries, but that doesn’t mean you should give up. Here are some guidelines you can follow to get great results.

Don’t use UPDATE instead of CASE

This problem is common but hard to spot, and many developers often ignore it because it seems logical to use Updates as a natural thing to do.

Take this scenario: You insert data into a temporary table and need it to display a value if another value exists.

Maybe you pull records from the Customer table and want to mark customers with orders over $100,000 as “Preferred.”

So, you insert the data into the table, run the UPDATE statement, and set the CustomerRank column to “Preferred” for any customer whose order exceeds $100,000.

The problem is that the UPDATE statement is logged, which means that every time it is written to the table, it is written twice.

Workaround: Using an inline CASE statement in an SQL query, which validates the order amount condition for each row and sets the “Preferred” flag before writing it to the table, handles the performance improvement dramatically.

Don’t blindly reuse code

This problem is also common. It’s easy to copy code written by someone else because you know it will get the data you need.

The problem is that it often takes too much data you don’t need, and developers are rarely streamlined, so you end up with a lot of data.

This usually manifests as an extra outer join or extra condition in the WHERE clause. If you streamline your reusable code to fit your exact requirements, you can dramatically improve performance.

Take as many columns as you need

This problem is similar to problem 2, but is specific to columns. It’s easy to code all queries with SELECT* instead of listing the columns one by one.

The problem, again, is that it extracts too much data you don’t need, a mistake I’ve seen countless times. A developer performs a SELECT* query on a 120-column table with millions of rows, but ends up using only three or five of those columns.

So, you process a lot more data than you really need, and the query returns a miracle. Not only are you processing more data than you need, you’re also taking resources away from other processes.

Do not query twice (double-dip)

This is another mistake I see a lot of people making: writing stored procedures to extract data from a table with hundreds of millions of rows.

The developer wants to extract information about customers who live in California and make more than $40,000 a year. So he looked up customers who lived in California and put the results into a temporary table.

Then query customers making more than $40,000 a year and put those results into another temporary table. Finally, he joins the two tables to get the final result.

Are you kidding me? This should be done with one query, instead of two queries on a large table. Don’t be silly: try to query large tables only once, and you’ll find that stored procedures execute much faster.

A slightly different scenario is when several steps in a process require a subset of the large table, which results in querying the large table every time.

To avoid this problem, simply query this subset, persist it elsewhere, and then refer the subsequent steps to the smaller data set.

Know when to use temporary tables

This problem is a bit trickier to solve, but the results are remarkable. Temporary tables can be used in many situations, such as preventing two queries against a large table. Temporary tables can also be used, drastically reducing the processing power required to join large tables.

If you must join a table to a large table with conditions on the large table, you can improve query performance by simply extracting the required portion of data from the large table into a temporary table and then joining with the temporary table.

This is also helpful if there are several queries in the stored procedure that need to perform similar joins to the same table.

Pre-staging data

If you have a report or stored procedure (or set of procedures) that performs similar join operations on large tables, you can greatly help by pre-joining tables and persisting them into a table to pre-hold data.

Reports can now be run against this pre-staging table, avoiding large joins. You can’t always use this method, but if you do, you’ll find it’s a great way to save server resources.

Note: Many developers get around this connection problem by focusing on the query itself and creating read-only views based on the connection, so that they don’t have to type connection conditions over and over again.

The problem with this approach, however, is that you still have to run the query for every report that needs it. If you pre-store data, you only have to run the connection once (say, 10 minutes before the report) and others can avoid large connections.

You have no idea how much I like this, but in most environments, some of the most commonly used tables are always linked, so there’s no reason why they can’t be pre-stored.

Batch delete and update

This is another technique that is often overlooked, and deleting or updating large amounts of data from large tables can be a nightmare if you don’t do it right.

The problem is that both statements run as a single transaction. If you need to terminate them, or if there is a problem with their execution, the system must roll back the entire transaction, which can take a long time.

These operations also block other transactions while they are ongoing, effectively creating a bottleneck in the system that can be resolved by deleting or updating in small batches.

This solves the problem in several ways:

No matter what the transaction is terminated for, it has only a few rows that need to be rolled back, so the database comes back online much faster.

When small-batch transactions are committed to disk, other transactions can come in and handle some of the work, thus greatly increasing concurrency.

Again, many developers have been adamant that these deletions and updates must be done on the same day. That’s not always the case, especially if you’re filing.

If you need to extend the operation, you can do so, and small batches help; If you take longer to perform these intensive operations, don’t slow down your system.

Use temporary tables to improve cursor performance

It’s best to avoid cursors if possible. Not only do cursors have speed problems, which are a major problem for many operations, but they can also cause your operations to block other operations for long periods of time, greatly reducing the concurrency of your system.

However, you cannot always avoid using cursors, and if you cannot avoid using cursors, you can perform cursor operations on temporary tables instead to get rid of the performance problems caused by cursors.

Take looking at a table and updating a cursor for several columns based on some comparison results. You might be able to put that data into temporary tables and compare against temporary tables instead of active tables.

You can then run a single UPDATE statement for a much smaller, short-locked active table.

Making such data changes can greatly improve concurrency. Finally, you don’t need to use cursors at all, there’s always a set-based solution.

Use table-valued functions

This is one of my favorite techniques of all time, because it’s the kind of secret that only experts know.

When a scalar function is used in the SELECT list of a query, the function is called for each row in the result set, which can significantly degrade the performance of large queries.

However, you can dramatically improve performance by converting scalar functions to table-valued functions and then using CROSS APPLY in your queries, a neat trick that can significantly improve performance.

Do not perform large operations on many tables in the same batch

This may seem obvious, but it’s not. I’ll use another vivid example, because it’s more illustrative.

I have a system that has a lot of blocking and a lot of operations are stalled. It was found that the delete routine, which ran several times a day, deleted data from 14 tables in an explicit transaction. Processing all 14 tables in a transaction means locking each table until all deletes are complete.

The solution is to split the deletion of each table into separate transactions so that only one table is locked per deletion transaction.

This frees up the other tables and relieves blocking, allowing other operations to continue. You should always break such large transactions into separate smaller transactions to prevent blocking.

Don’t use triggers

This is basically the same as the previous one, but it’s worth mentioning. The problem with triggers: Whatever you want the trigger to do, it will do in the same transaction as the original operation.

If you write a trigger to insert data into another table when updating rows in the Orders table, both tables will be locked until the trigger completes.

If you need to insert data into another table after an update, place the update and insert into a stored procedure and do it in a separate transaction.

If you need to roll back, you can easily do so without having to lock both tables. As always, keep transactions short and don’t lock more than one resource at a time.

Do not cluster on guids

After all these years, I can’t believe we’re still struggling with this. But I still encounter clustered GUids at least twice a year.

A GUID (globally unique identifier) is a randomly generated number of 16 bytes. Sorting the data in your table by this column results in table fragmentation much faster than using a steadily increasing value, such as DATE or IDENTITY.

A few years ago I did a benchmark test where I inserted a bunch of data into a table with a clustered GUID and the same data into another table with an IDENTITY column.

The GUID table was so fragmented that after just 15 minutes, performance dropped by several thousand percent.

It took five hours for the IDENTITY table to lose a few percentage points of performance, and this applies not just to GUids, but to any volatile column.

Don’t count rows if you just want to see if the data exists

This is a very common situation where you need to check that the data exists in a table, and based on the results of that check, you need to do something.

I’ve often seen people execute SELECT COUNT (*) fromdbo.t1 to check if the data exists:

SET @CT=(SELECT COUNT(*) FROM

dbo.T1);

If@CT>0

BEGIN

<Do something>

ENDCopy the code

It’s totally unnecessary. If you want to check if the data exists, just do this:

If EXISTS (SELECT 1 FROM dbo.T1)

BEGIN

<Do something>

ENDCopy the code

Don’t count everything in the table, just fetch the first row you find. SQL Server is smart enough to use EXISTS correctly and the second code returns results super fast.

The bigger the table, the more obvious the gap. Do the right thing before your data gets too big. It’s never too early to tune a database.

In fact, I’m just running this example on one of my production databases, against a table with 270 million rows.

The first query took 15 seconds and contained 456,197 logical reads, and the second query returned results in less than 1 second and contained only five logical reads.

However, if you do need to count the number of rows in a table, and the table is large, another way is to extract it from the system table,

SELECT Rows FromSysINDEXES will get the number of rows for all indexes for you.

And because the clustered index represents the data itself, you can get the table row by simply adding WHERE inDID = 1, and then simply including the table name.

So, the final query is:

SELECT rows from sysindexes where object_name(id)='T1'and indexid =1Copy the code

In my 270 million row table, results were returned in less than 1 second with only 6 logical reads, and now the performance is different.

Don’t reverse search

Take the simple query SELECT * FROMCustomers WHERE RegionID <> 3. You cannot use an index with this query because it is a reverse search that requires a row-by-row comparison with a table scan. If you need to perform such a task, you may find that performance is much better if you rewrite the query to use the index.

The query can be easily rewritten like this:

SELECT * FROM Customers WHERE RegionID<3 UNION ALL SELECT * FROM Customers WHERE RegionIDCopy the code

This query will use indexes, so if your data set is large, the performance is much better than the table scan version.

Of course, nothing is that easy, and performance may be worse, so try it before you use it. It works 100 percent, although there are too many factors involved.

Finally, I realized that this query violated rule 4: Don’t query twice, but it also shows that there are no hard and fast rules. Although we are querying twice here, we are doing so to avoid costly table scans.

You can’t use all of these techniques all the time, but if you keep them in mind, you could use them to solve some big problems someday.

The most important thing to remember is, don’t take what I say as dogma. Try it in your real world. The same solutions don’t work in every situation, but I use them all the time when I’m looking for bad performance, and they always work.


The original article was published on April 27, 2018

Author: Data analyst

This article is from the cloud community partner Data & Cloud. For more information, follow data & Cloud.