Djangos ORM is useful for developers, but abstracting database access comes at a cost, and developers willing to explore the database often find that modifying the ORM’s default behavior can lead to performance improvements. In this article, I’ll share nine tips for working with a database in Django.

1. Aggregation with Filter

Before Django 2.0, if we wanted to get something like total number of users and total number of active users, we had to resort to conditional expressions:

from django.contrib.auth.models import User
from django.db.models import (
    Count,
    Sum,
    Case,
    When,
    Value,
    IntegerField,
)

User.objects.aggregate(
    total_users=Count('id'),
    total_active_users=Sum(Case(
        When(is_active=True, then=Value(1)),
        default=Value(0),
        output_field=IntegerField(),
    )),
)
Copy the code

In Django 2.0, we added filter parameters to aggregate functions to make it easier:

from django.contrib.auth.models import User
from django.db.models import Count, F

User.objects.aggregate(
    total_users=Count('id'),
    total_active_users=Count('id'.filter=F('is_active')))Copy the code

It’s great, short and delicious. If you are using ostgreSQL, the two queries will look like this:

SELECT
    COUNT(id) AS total_users,
    SUM(CASE WHEN is_active THEN 1 ELSE 0 END) AS total_active_users
FROM
    auth_users;
SELECT
    COUNT(id) AS total_users,
    COUNT(id) FILTER (WHERE is_active) AS total_active_users
FROM
    auth_users;
Copy the code

The second query uses the WHERE filter clause.

Code words are not easy to nonsense two sentences: need python learning materials or technical questions to exchange “click”

2. QuerySet results as namedtuples

I’m a big fan of Namedtuples, as well as Django 2.0 ORM. In Django 2.0, a named attribute is added to the parameter of the values_list method. Setting named to True returns QuerySet as a namedTuples list:

> user.objects.values_list(
    'first_name'.'last_name',)0] (' Haki ', 'Benita') > user_names = user.objects.values_list'first_name'.'last_name',
    named=True,
)
> user_names[0]
Row(first_name='Haki', last_name='Benita')
> user_names[0].first_name
'Haki'
> user_names[0].last_name
'Benita'
Copy the code

3. Custom functions

Django 2.0’s ORM is powerful and feature-rich, but it still doesn’t sync with all database features. Fortunately, ORM lets us extend it with custom functions. Suppose we have a field that records the duration of reports and we want to find the average duration of all reports:

from django.db.models importAvg Report. Objects. Aggregate (avg_duration = Avg () 'duration') > {'avg_duration': datetime.timedelta(0.0.55432)}
Copy the code

That’s great, but if you just have the mean, that’s a little bit less information. Let’s figure out the standard deviation:

from django.db.models import Avg, StdDev
Report.objects.aggregate(
    avg_duration=Avg('duration'),
    std_duration=StdDev('duration'),
)
ProgrammingError: function stddev_pop(interval) does not exist
LINE 1: SELECT STDDEV_POP("report"."duration") AS "std_dura... ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts.Copy the code

PostgreSQL does not support standard deviation for interval fields. We need to convert the interval to a number before we can apply STDDEV_POP to it.

One option is to extract from the interval:

SELECT
    AVG(duration),
    STDDEV_POP(EXTRACT(EPOCH FROM duration))
FROM 
    report;

      avg       |    stddev_pop
----------------+------------------
 00:00:00.55432 | 1.06310113695549
(1 row)
Copy the code

So how do we do that in Django? You guessed it — a custom function:

# common/db.py
from django.db.models import Func

class Epoch(Func) :
   function = 'EXTRACT'
   template = "%(function)s('epoch' from %(expressions)s)"
Copy the code

Our new function is used like this:

from django.db.models import Avg, StdDev, F
from common.db import Epoch

Report.objects.aggregate(
    avg_duration=Avg('duration'),
    std_duration=StdDev(Epoch(F('duration')))) {'avg_duration': datetime.timedelta(0.0.55432),
 'std_duration': 1.06310113695549}
Copy the code

* Notice the use of F expressions in the Epoch call.

4. Statement Timeout

This is probably the simplest and most important tip I can give. We’re human and we all make mistakes. We can’t account for every edge case, so we have to set boundaries.

Unlike other non-blocking application servers, such as Tornado, Asyncio, and even Node, Django typically uses a synchronous worker process. This means that when a user performs a long-running operation, the worker process is blocked and no one else can use it until it is finished.

Nobody really runs Django in production with just one worker process, but we still want to make sure that a query doesn’t waste too much resources for too long.

In most Django applications, you spend most of your time waiting for database queries. So setting timeouts on SQL queries is a good place to start.

I like to set a global timeout in my wsgi.py file like this:

# wsgi.py
from django.db.backends.signals import connection_created
from django.dispatch import receiver

@receiver(connection_created)
def setup_postgres(connection, **kwargs) :
    ifconnection.vendor ! ='postgresql':
        return
    
    # Timeout statements after 30 seconds.
    with connection.cursor() as cursor:
        cursor.execute(""" SET statement_timeout TO 30000; "" ")
Copy the code

Why wsgi.py? Because then it only affects the worker process, not out-of-process analysis queries, CRon tasks, and so on.

Hopefully, you are using a persistent database connection, so there is no connection overhead per request. Timeouts can also be configured to user granularity:

postgresql=#> alter user app_user set statement_timeout TO 30000;
ALTER ROLE
Copy the code

An aside: We spend a lot of time in other common places, like the web. Therefore, make sure that the timeout is always set when calling the remote service:

import requests

response = requests.get(
    'https://api.slow-as-hell.com',
    timeout=3000.)Copy the code

5. Limit

This is somewhat related to the last point about setting boundaries. Sometimes the behavior of our clients can be unpredictable. For example, it is not uncommon for the same user to open another TAB and try again on the first attempt to “get stuck.”

That’s why you need to use limits.

We limit the return of a query to no more than 100 rows:

# bad example
data = list(Sale.objects.all(to)) :100]
Copy the code

This is bad, because even though only 100 rows are returned, you’ve actually taken all the rows out and put them into memory.

Let’s try again:

data = Sale.objects.all(to) :100]
Copy the code

This is much better. Django uses the limit clause in SQL to get 100 rows of data.

We added restrictions, but we still had a problem — users wanted all the data, but we only gave them 100, and users now think there are only 100 data.

Instead of blindly returning the first 100 rows, let’s make sure that if we go beyond 100 rows (usually after filtering) we throw an exception:

LIMIT = 100

if Sales.objects.count() > LIMIT:
    raise ExceededLimit(LIMIT)
return Sale.objects.all()[:LIMIT]
Copy the code

It’s useful, but we’ve added a new query.

Can you do better? We can do this:

LIMIT = 100

data = Sale.objects.all()[:(LIMIT + 1)]
if len(data) > LIMIT:
    raise ExceededLimit(LIMIT)
return data
Copy the code

Instead of taking 100 rows, we take 100 + 1 = 101 rows. If 101 rows exist, then we know that there are more than 100 rows:

Remember the LIMIT + 1 trick, sometimes it comes in handy.

6. Transaction and lock control

This one is harder. Because of the locking mechanism in the database, we started seeing transaction timeout errors in the middle of the night. The common pattern for manipulating transactions in our code is as follows:

from django.db import transaction as db_transaction

...
with db_transaction.atomic():
  transaction = (
        Transaction.objects
        .select_related(
            'user'.'product'.'product__category',
        )
        .select_for_update()
        .get(uid=uid)
  )
  ...
Copy the code

Transactions usually involve some user and product attributes, so we often use select_Related to force joins and save queries.

The update transaction also involves acquiring a lock to ensure that it is not acquired by others.

Now, do you see the problem? No? Neither did I.

We have some ETL processes running at night, mainly for maintenance on product and user tables. These ETL operations update the field and insert the table so that they also acquire the table lock.

So what’s the problem? When select_for_UPDATE is used with select_related, Django will attempt to acquire locks on all tables in the query.

The code we use to acquire transactions attempts to acquire locks for transaction tables, users, products, and category tables. Once ETL locked the last three tables in the middle of the night, the trade began to fail.

Once we had a better understanding of the problem, we started looking for ways to lock only the necessary tables (transaction tables). Luckily, a new option for select_for_update is available in Django 2.0:

from django.db import transaction as db_transaction

...
with db_transaction.atomic():
  transaction = (
        Transaction.objects
        .select_related(
            'user'.'product'.'product__category',
        )
        .select_for_update(
            of=('self',)
        )
        .get(uid=uid)
  )
  ...
Copy the code

The “of” option is added to select_for_update, using “of” to specify which table we want to lock. Self is a special keyword to indicate that we want to lock the model we are working with, i.e. the transaction table.

Currently, this feature only works with PostgreSQL and Oracle.

7. FK Indexes

When creating a model, Django creates a B-tree index on all the foreign keys, which can be quite expensive and sometimes unnecessary.

A typical example is the M2M (many-to-many) direct model:

class Membership(Model) :
    group = ForeignKey(Group)
    user = ForeignKey(User)
Copy the code

In the above model, Django implicitly creates two indexes: one for users and one for groups. Another common pattern in M2M models is to have two fields together as a unique constraint. In this case, it means that a user can only be a member of the same group, or the same model:

class Membership(Model) :
    group = ForeignKey(Group)
    user = ForeignKey(User)
    class Meta:
        unique_together = (
           'group'.'user'.)Copy the code

This unique_together will also create two indexes, so we have a model with two fields and three indexes, right? Db_index =False Depending on what we do with this model, we can set db_index=False to ignore FK indexes and keep only unique constraint indexes:

class Membership(Model) :
    group = ForeignKey(Group, db_index=False)
    user = ForeignKey(User, db_index=False)
    class Meta:
        unique_together = (
            'group'.'user'.)Copy the code

Removing redundant indexes will make inserts and queries faster, and our database lighter.

Order of columns in composite index

An index with more than one column is called a composite index. In a b-tree composite index, the first column is indexed using the Tree structure. Create a new tree from the first layer of leaves for the second layer, and so on.

The order of columns in an index is very important.

In the example above, we first get a tree of groups and another tree of all its users. The rule of thumb for b-tree composite indexes is to keep the secondary indexes as small as possible. In other words, columns with high cardinality (more explicit values) should be first.

In our example, we assume that there are fewer groups than users (generally), so putting the user column first makes the group’s secondary index smaller.

class Membership(Model) :
    group = ForeignKey(Group, db_index=False)
    user = ForeignKey(User, db_index=False)
    class Meta:
        unique_together = (
            'user'.'group'.)Copy the code

Notice that the order of ‘user’ and ‘group’ in the unique_together tuple has been adjusted to make the index smaller. This is just a rule of thumb, and the final index should be optimized for a particular scenario. The point here is to understand the importance of column order in both implicit and composite indexes.

9. BRIN Indexes

The b-tree index is structured like a Tree. The cost of finding a single value is randomly accessing the height of the table tree + 1. This makes b-tree indexes well suited for unique constraint and (some) range queries.

The disadvantage of a B-tree index is its size — the B-tree index can get larger.

Are there no other options? No, there are plenty of other types of indexes that databases provide for specific use cases. Starting with Django 1.11, there is a new Meta option for creating indexes on models. This gives us an opportunity to explore other types of indexes.

PostgreSQL has a very useful index type, BRIN (block-range index). In some cases, a BRIN index can be more efficient than a B-tree index.

Let’s take a look at the official document:

BRIN is designed to work with very large tables where some of the columns have some natural correlation to physical locations within the table. To understand this statement, it is important to understand how the BRIN index works. As the name suggests, a BRIN index creates a small index on a series of adjacent blocks in a table. The index is very small and only indicates whether a value is in range or in index block range.

Let’s do a simple example of how a BRIN index can help us.

Suppose we have these values in a column, each of which is a block:

1.2.3.4.5.6.7.8.9
Copy the code

We create a range for every three adjacent blocks

[1.2.3], [4.5.6], [7.8.9]
Copy the code

For each range, we will save the minimum and maximum values within the range:

[1-3], [4-6], [7-9]
Copy the code

We try to search 5 through this index:

[1 -- 3] -- definitely not here [4 -- 6] -- probably here [7 -- 9] -- definitely not hereCopy the code

Using indexes, we limit our search to the range [4-6].

For another example, the values in this column will not be sorted very well:

[2-9], [1-7], [3-8]
Copy the code

Try finding 5 again:

[2-9] -- could be here [1-7] -- could be here [3-8] -- could be hereCopy the code

The index is useless — not only does it not limit the search, but we actually have to search more because we extract both the index and the entire table.

Back to documentation:

Columns have some natural correlation to physical locations within the table

This is the key to BRIN’s index. To get the most out of it, the values in the columns must be roughly sorted or aggregated on disk.

Now back to Django, which oft-indexed fields do we have that are most likely to be naturally sorted on disk? That’s right, auto_now_add. (This is very common, if you don’t use it, you can understand it.)

A very common pattern in Django models is:

class SomeModel(Model) :
    created = DatetimeField(
        auto_now_add=True.)Copy the code

When using auto_now_add, Django automatically populates the line with the current time. The created field is also usually a good candidate for a query, so it is usually inserted into an index.

Let’s add a BRIN index at create time:

from django.contrib.postgres.indexes import BrinIndex
class SomeModel(Model) :
    created = DatetimeField(
        auto_now_add=True.)class Meta:
        indexes = (
            BrinIndex(fields=['created']),Copy the code

To understand the difference in size, I created a table with about 2M rows and sorted the date fields naturally on disk:

B-tree index: 37 MB BRIN index: 49 KBCopy the code

Yes, you read that right.

There is much more to creating an index than its size. But now, with Django 1.11’s index support, we can easily integrate new types of indexes into our applications, making them lighter and faster.

As a Python developer, I spent three days to compile a set of Python learning tutorials, from the most basic Python scripts to Web development, crawlers, data analysis, data visualization, machine learning, etc. These materials can be “clicked” by the friends who want them