In this column, I will occasionally share some of Django’s most cutting-edge articles, focusing on tips and lessons learned from:
- Medium
- I’m a Twitter blogger
If you are interested, please pay attention to it and give me some motivation. After all, it takes time to sort out the translation. Thank you
—
Original address:
9 Django Tips for Working with Databases
Original author:
Haki Benita
Translator:
In the book
Check:
In the book
Recommended value: ✨✨✨✨✨
ORM is useful for developers, but abstracting database access comes at a cost of its own, and developers willing to explore the database often find that modifying the DEFAULT behavior of the ORM can lead to performance improvements.
In this article, I’ll share nine tips for working with a database in Django
1. Aggregation with Filter
Before Django 2.0, if we wanted to get something like total number of users and total number of active users, we had to resort to conditional expressions:
from django.contrib.auth.models import User
from django.db.models import (
Count,
Sum,
Case,
When,
Value,
IntegerField,
)
User.objects.aggregate(
total_users=Count('id'),
total_active_users=Sum(Case(
When(is_active=True, then=Value(1)),
default=Value(0),
output_field=IntegerField(),
)),
)
Copy the code
In Django 2.0, we added filter parameters to aggregate functions to make it easier:
from django.contrib.auth.models import User
from django.db.models import Count, F
User.objects.aggregate(
total_users=Count('id'),
total_active_users=Count('id', filter=F('is_active')),
)
Copy the code
It’s great, short and delicious
If you are using PostgreSQL, the two queries will look like this:
SELECT
COUNT(id) AS total_users,
SUM(CASE WHEN is_active THEN 1 ELSE 0 END) AS total_active_users
FROM
auth_users;
SELECT
COUNT(id) AS total_users,
COUNT(id) FILTER (WHERE is_active) AS total_active_users
FROM
auth_users;
Copy the code
The second query uses the WHERE filter clause.
2. QuerySet results as namedtuples
I’m a big fan of Namedtuples, as well as Django 2.0 ORM.
In Django 2.0, a named attribute is added to the parameter of the values_list method. Setting named to True returns QuerySet as a namedTuples list:
> user.objects.values_list('first_name', 'last_name',)[0] (' Haki ', User_names = user.objects.values_list ('first_name', 'last_name', named=True, ) > user_names[0] Row(first_name='Haki', last_name='Benita') > user_names[0].first_name 'Haki' > user_names[0].last_name 'Benita'Copy the code
3. Custom functions
Django 2.0’s ORM is powerful and feature-rich, but it still doesn’t sync with all database features. Fortunately, ORM lets us extend it with custom functions.
Suppose we have a field that records the duration of reports and we want to find the average duration of all reports:
From django.db.models import Avg report.objects. aggregate(avg_duration=Avg(' duration')) > {'avg_duration': datetime.timedelta(0, 0, 55432)}Copy the code
That’s great, but if you just have the mean, that’s a little bit less information. Let’s figure out the standard deviation:
from django.db.models import Avg, StdDev
Report.objects.aggregate(
avg_duration=Avg('duration'),
std_duration=StdDev('duration'),
)
ProgrammingError: function stddev_pop(interval) does not exist
LINE 1: SELECT STDDEV_POP("report"."duration") AS "std_dura...
^
HINT: No function matches the given name and argument types.
You might need to add explicit type casts.
Copy the code
Uh… PostgreSQL does not support standard deviation for interval fields. We need to convert the interval to a number before we can apply STDDEV_POP to it.
One option is to extract from the interval:
SELECT AVG(duration), STDDEV_POP(EXTRACT(EPOCH FROM duration)) FROM report; Avg | stddev_pop -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 00:00:00. 55432 | 1.06310113695549 (1 row)Copy the code
So how do we do that in Django? You guessed it — a custom function:
# common/db.py
from django.db.models import Func
class Epoch(Func):
function = 'EXTRACT'
template = "%(function)s('epoch' from %(expressions)s)"
Copy the code
Our new function is used like this:
from django.db.models import Avg, StdDev, F
from common.db import Epoch
Report.objects.aggregate(
avg_duration=Avg('duration'),
std_duration=StdDev(Epoch(F('duration'))),
)
{'avg_duration': datetime.timedelta(0, 0, 55432),
'std_duration': 1.06310113695549}
Copy the code
* Notice the use of F expressions in the Epoch call.
4. Statement Timeout
This is probably the simplest and most important tip I can give. We’re human and we all make mistakes. We can’t account for every edge case, so we have to set boundaries.
Unlike other non-blocking application servers, such as Tornado, Asyncio, and even Node, Django typically uses a synchronous worker process. This means that when a user performs a long-running operation, the worker process is blocked and no one else can use it until it is finished.
Nobody really runs Django in production with just one worker process, but we still want to make sure that a query doesn’t waste too much resources for too long.
In most Django applications, you spend most of your time waiting for database queries. So setting timeouts on SQL queries is a good place to start.
I like to set a global timeout in my wsgi.py file like this:
# wsgi.py from django.db.backends.signals import connection_created from django.dispatch import receiver @receiver(connection_created) def setup_postgres(connection, **kwargs): if connection.vendor ! = 'postgresql': return # Timeout statements after 30 seconds. with connection.cursor() as cursor: cursor.execute(""" SET statement_timeout TO 30000; "" ")Copy the code
Why wsgi.py? Because then it only affects the worker process, not out-of-process analysis queries, CRon tasks, and so on.
Hopefully, you are using a persistent database connection, so there is no connection overhead per request.
Timeouts can also be configured to user granularity:
postgresql=#> alter user app_user set statement_timeout TO 30000;
ALTER ROLE
Copy the code
An aside: We spend a lot of time in other common places, like the web. Therefore, make sure that the timeout is always set when calling the remote service:
import requests
response = requests.get(
'https://api.slow-as-hell.com',
timeout=3000,
)
Copy the code
5. Limit
This is somewhat related to the last point about setting boundaries. Sometimes the behavior of our clients can be unpredictable
For example, it is not uncommon for the same user to open another TAB and try again on the first attempt to “get stuck.”
That’s why it’s limited, right
We limit the return of a query to no more than 100 rows:
# bad example
data = list(Sale.objects.all())[:100]
Copy the code
This is bad, because even though only 100 rows are returned, you’ve actually taken all the rows out and put them into memory.
Let’s try again:
data = Sale.objects.all()[:100]
Copy the code
This is much better. Django uses the limit clause in SQL to get 100 rows of data.
We added restrictions, but we still had a problem — users wanted all the data, but we only gave them 100, and users now think there are only 100 data.
Instead of blindly returning the first 100 rows, let’s make sure that if we go beyond 100 rows (usually after filtering) we throw an exception:
LIMIT = 100
if Sales.objects.count() > LIMIT:
raise ExceededLimit(LIMIT)
return Sale.objects.all()[:LIMIT]
Copy the code
It’s useful, but we’ve added a new query
Can you do better? We can do this:
LIMIT = 100
data = Sale.objects.all()[:(LIMIT + 1)]
if len(data) > LIMIT:
raise ExceededLimit(LIMIT)
return data
Copy the code
Instead of taking 100 rows, we take 100 + 1 = 101 rows. If 101 rows exist, then we know that there are more than 100 rows:
Remember the LIMIT + 1 trick, sometimes it comes in handy
6. Transaction and lock control
This one is harder.
Because of the locking mechanism in the database, we started seeing transaction timeout errors in the middle of the night.
(It seems the author was often woken up in the middle of the night 🤣)
The common pattern for manipulating transactions in our code is as follows:
from django.db import transaction as db_transaction
...
with db_transaction.atomic():
transaction = (
Transaction.objects
.select_related(
'user',
'product',
'product__category',
)
.select_for_update()
.get(uid=uid)
)
...
Copy the code
Transactions usually involve some user and product attributes, so we often use select_Related to force joins and save queries.
The update transaction also involves acquiring a lock to ensure that it is not acquired by others.
Now, do you see the problem? No? Neither did I. (The author is so cute)
We have some ETL processes running at night, mainly for maintenance on product and user tables. These ETL operations update the field and insert the table so that they also acquire the table lock.
So what’s the problem? When select_for_UPDATE is used with select_related, Django will attempt to acquire locks on all tables in the query.
The code we use to acquire transactions attempts to acquire locks for transaction tables, users, products, and category tables. Once ETL locked the last three tables in the middle of the night, the trade began to fail.
Once we had a better understanding of the problem, we started looking for ways to lock only the necessary tables (transaction tables). Luckily, a new option for select_for_update is available in Django 2.0:
from django.db import transaction as db_transaction
...
with db_transaction.atomic():
transaction = (
Transaction.objects
.select_related(
'user',
'product',
'product__category',
)
.select_for_update(
of=('self',)
)
.get(uid=uid)
)
...
Copy the code
The “of” option is added to select_for_update, using “of” to specify which table we want to lock. Self is a special keyword to indicate that we want to lock the model we are working with, i.e. the transaction table.
Currently, this feature only works with PostgreSQL and Oracle.
7. FK Indexes
When creating a model, Django creates a B-tree index on all the foreign keys, which can be quite expensive and sometimes unnecessary.
A typical example is the M2M (many-to-many) direct model:
class Membership(Model):
group = ForeignKey(Group)
user = ForeignKey(User)
Copy the code
In the above model, Django implicitly creates two indexes: one for users and one for groups.
Another common pattern in M2M models is to have two fields together as a unique constraint. In this case, it means that a user can only be a member of the same group, or the same model:
class Membership(Model):
group = ForeignKey(Group)
user = ForeignKey(User)
class Meta:
unique_together = (
'group',
'user',
)
Copy the code
The unique_together also creates two indexes, so we get a model of two fields and three indexes 😓
Depending on what we do with this model, we can ignore the FK index and keep only the unique constraint index:
class Membership(Model):
group = ForeignKey(Group, db_index=False)
user = ForeignKey(User, db_index=False)
class Meta:
unique_together = (
'group',
'user',
)
Copy the code
Removing redundant indexes will make inserts and queries faster, and our database lighter.
Order of columns in composite index
An index with more than one column is called a composite index. In a b-tree composite index, the first column is indexed using the Tree structure. Create a new tree from the first layer of leaves for the second layer, and so on.
The order of columns in an index is very important.
In the example above, we first get a tree of groups and another tree of all its users.
The rule of thumb for b-tree composite indexes is to keep the secondary indexes as small as possible. In other words, columns with high cardinality (more explicit values) should be first.
In our example, we assume that there are fewer groups than users (generally), so putting the user column first makes the group’s secondary index smaller.
class Membership(Model):
group = ForeignKey(Group, db_index=False)
user = ForeignKey(User, db_index=False)
class Meta:
unique_together = (
'user',
'group',
)
Copy the code
* Notice the order of field names in tuples
This is just a rule of thumb, and the final index should be optimized for a particular scenario. The point here is to understand the importance of column order in both implicit and composite indexes. (Pro book: No trouble for trouble)
9. BRIN Indexes
The b-tree index is structured like a Tree. The cost of finding a single value is randomly accessing the height of the table tree + 1. This makes b-tree indexes well suited for unique constraint and (some) range queries.
The disadvantage of a B-tree index is its size — the B-tree index can get larger.
Are there no other options? No, there are plenty of other types of indexes that databases provide for specific use cases.
Starting with Django 1.11, there is a new Meta option for creating indexes on models. This gives us an opportunity to explore other types of indexes.
PostgreSQL has a very useful index type, BRIN (block-range index). In some cases, a BRIN index can be more efficient than a B-tree index.
Let’s take a look at the official document:
BRIN is designed to work with very large tables where some of the columns have some natural correlation to physical locations within the table.
To understand this statement, it is important to understand how the BRIN index works. As the name suggests, a BRIN index creates a small index on a series of adjacent blocks in a table. The index is very small and only indicates whether a value is in range or in index block range.
Let’s do a simple example of how a BRIN index can help us.
Suppose we have these values in a column, each of which is a block:
One, two, three, four, five, six, seven, eight, nineCopy the code
We create a range for every three adjacent blocks:
[1, 2, 3], [4 and 6], [7,8,9]Copy the code
For each range, we will save the minimum and maximum values within the range:
[1, 3], [4, 6], [7-9]Copy the code
We try to search 5 through this index:
[1–3]
— Definitely not here[4, 6]
— It could be here[7–9]
— Definitely not here
Using indexes, we limit our search to the range [4-6].
For another example, the values in this column will not be sorted very well:
[2, 9], [1, 7], [3-8]Copy the code
Try finding 5 again:
[2, 9]
— It could be here(1-7)
— It could be here[3-8]
— It could be here
The index is useless — not only does it not limit the search, but we actually have to search more because we extract both the index and the entire table.
Back to documentation:
. Columns have some natural correlation to physical locations within the table
This is the key to BRIN’s index. To get the most out of it, the values in the columns must be roughly sorted or aggregated on disk.
Now back to Django, which oft-indexed fields do we have that are most likely to be naturally sorted on disk? That’s right, auto_now_add. (This is very common, if you don’t use it, you can understand it.)
A very common pattern in Django models is:
class SomeModel(Model):
created = DatetimeField(
auto_now_add=True,
)
Copy the code
When using auto_now_add, Django automatically populates the line with the current time. The created field is also usually a good candidate for a query, so it is usually inserted into an index.
Let’s add a BRIN index at create time:
from django.contrib.postgres.indexes import BrinIndex
class SomeModel(Model):
created = DatetimeField(
auto_now_add=True,
)
class Meta:
indexes = (
BrinIndex(fields=['created']),
)
Copy the code
To understand the difference in size, I created a table with about 2M rows and sorted the date fields naturally on disk:
- B-tree index: 37 MB
- BRIN index: 49 KB
Yes, you read that right.
There is much more to creating an index than its size. But now, with Django 1.11’s index support, we can easily integrate new types of indexes into our applications, making them lighter and faster.
—
Djangos 9 Things you need to know about interacting with databases
What do you think of the article? I personally think it’s awesome!
If you find this kind of article useful to you, please be sure to click on it, thank you