Nested Queries with SQLAlchemy

One of the most valuable aspects of running popular courses online is that every now and then I get a question that forces me to learn something new. The other day, a reader asked me how they wrote a database query with an unusual sort, and I had to stop for a minute (ok, about half an hour) to think about how to execute this query in the context of Flask and flask-SQLAlchemy applications. Are you ready to see some advanced SQLAlchemy operations?

The problem

The reader has a database with an order table that looks something like this:

id customer_id order_date
1 1 2018-01-01
2 1 2018-01-05
3 3 2018-01-07
4 1 2018-02-06
5 3 2018-01-31
6 2 2018-02-01
7 3 2018-02-01
8 3 2018-01-20
9 2 2018-02-07

The question is how to sort the items in the table by the customer_id field, but instead of using simple ascending or descending order, my readers need to sort the list by the customer’s last order.

So basically, the ideal order is this:

id customer_id order_date
9 2 2018-02-07
6 2 2018-02-01
4 1 2018-02-06
2 1 2018-01-05
1 1 2018-01-01
7 3 2018-02-01
5 3 2018-01-31
8 3 2018-01-20
3 3 2018-01-07

In this table, the record for Customer 2 appears first, because when you look at the original table, you can see that the customer placed the most recent order, with ID 9, on February 7. This is the last order stored in the table, so this customer is the most recent, so it ranks first among customers. Records are arranged in descending order by order date.

The next set of records is for Customer 1, because that customer is the second most recent customer, whose order was on February 6. Again, the customer’s three orders are placed next to each other in descending order by date.

Finally, Customer 3 was the first of the three to place an order, with the latest order being placed on February 1. The customer’s four orders are placed at the bottom, again in descending order by order date.

I can’t immediately think of a SQLAlchemy query that could do this. Would you like to see if you can solve this problem before reading on? To make things easier for you, I created a GIST on GitHub that contains the full working sample application and database you can use. All you need to do is grab the file and write your query where the comment indicates! Open sample GIST

To use this application, you need to create a virtual environment and install Flask-SQLAlchemy. Then simply run the script. In this application, I’m using an in-memory SQLite database, so you don’t have to worry about creating a database, creating a new, clean database every time you run the script.

If you want to know how I solved this problem, read on.

Subquery

The solution to the above problem cannot be solved by a simple query (at least I don’t think so, but would love to be proved wrong!). . To be able to sort rows as needed, we need to create two queries and combine them.

The first part of the solution is to determine the order in which the customer needs to appear in the query. To do this, we basically need to look at each customer’s last order. One simple approach is to compress or group order tables.

In a relational database, the group operation looks at the value of a particular column and collapses all rows with the same value into a single row in a temporary grouped table. For our example, when we group by the Customer_ID field, we end up with a grouped table with three rows, one for each customer.

However, the tricky part is how to combine all rows with the same customer_id into one line. To do this, SQL provides aggregate functions that take a list of values and produce a single result. Common aggregate functions used in all SQL implementations are sum, AVG, min, Max, and so on. In our example, we want to know the date of each customer’s last order, so we can use Max (order_date) to create our grouped table.

I’m not sure if this works for other SQLAlchemy developers, but for unusual queries, I find it easier to find a solution using raw SQL, and then rewrite it to SQLAlchemy once I have it. In SQL, grouping is done with the following statements:

SELECT customer_id, max(order_date) AS last_order_date FROM orders GROUP BY customer_id
Copy the code

This query merges all rows with the same customer_ID into one, and for the ORDER_DATE column, it calculates the maximum value of all rows and merges them into one row. The original table also had the ID column for the ORDER primary key, but I didn’t reference it in the SELECT section of this query, so it won’t be included in the grouping result, which IS exactly what I want, because I really don’t have a proper way to aggregate the ID field.

If you run the above query against the sample data I provided, the result would be:

customer_id last_order_date
1 2018-02-06
2 2018-02-07
3 2018-02-01

Now that we have a working SQL query, we can convert it to flask-SQLalchemy:

last_orders = db.session.query(
    Order.customer_id, db.func.max(Order.order_date).label('last_order_date')
).group_by(Order.customer_id).subquery()
Copy the code

If you use normal SQLAlchemy, you can replace the above db.session with your own session object. Flask-sqlalchemy makes it easier to handle sessions, so I’ve always preferred to use this extension.

If you’re used to running queries that start with Order.query.something or session.query(Order).something with ordinary SQLAlchemy, this must look strange. Notice that in the original SQL example, I had two entities in the SELECT section. The standard SQLAlchemy query you see is a simplified form for querying an entire row from a single table. If the query returns results from two or more tables, or in this case, a combination of real and aggregated columns, then you must use this more detailed form, which requires specifying the columns returned by the query as arguments to the session.query() method.

The two parameters of session.query() are directly converted from the SQL SELECT, and they are the customer_ID column and the Max (order_date) aggregate column. Notice the AS LAST_ORDER_date section of the SQL statement, which provides an alias for the aggregate column. In SQLAlchemy, the label() method applied to the column achieves the same result.

Grouping is done using the group_by() query method, which uses columns as grouping parameters, the same as the GROUP BY item in SQL.

The statement ends with a call to subQuery (), which tells SQLAlchemy that the intent is to use it in a larger query rather than on its own. In practice, this means that we will be able to treat the subquery object as a real database table, when in fact it is a dynamically created temporary table. You’ll see how it works in the next section.

Join

Now that we know the order we want Customers to return, we must merge the order into the original table. The most straightforward approach is to first merge the Orders table with the last_Orders subquery obtained in the previous section.

To join two tables in a relational database, we use the Join operation. Join takes each row in the Orders table, matches it with the corresponding row in the last_Orders subquery, and finally generates a new combined row containing columns from both tables. The result of the JOIN operation will be another dynamic table. Using raw SQL, joins to subqueries are done as follows:

SELECT * FROM orders JOIN (
  SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id
) AS last_orders
ON orders.customer_id = last_orders.customer_id
Copy the code

Here, we JOIN(…) The AS name construct has the subquery from the previous section and maps the last_Orders name to the subquery result. Other parts of the query can then use this name to refer to the individual columns of those results. The ON section specifies the join condition for the two tables, in this case it is a simple condition that matches only rows with the same customer_id value.

In our example, the join will return the following data:

id customer_id order_date last_order_date
1 1 2018-01-01 2018-02-06
2 1 2018-01-05 2018-02-06
3 3 2018-01-07 2018-02-01
4 1 2018-02-06 2018-02-06
5 3 2018-01-31 2018-02-01
6 2 2018-02-01 2018-02-07
7 3 2018-02-01 2018-02-01
8 3 2018-01-20 2018-02-01
9 2 2018-02-07 2018-02-07

Now that we have the last order date that each customer added to the order, we can sort the table in descending order by the virtual last_order_date column, which satisfies the first sorting criteria in our problem statement:

SELECT * FROM orders JOIN (
  SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id
) AS last_orders
ON orders.customer_id = last_orders.customer_id
ORDER BY last_order_date DESC
Copy the code

But we’re not done yet, because we need to implement a secondary sort. In each customer, we need to provide results in descending order by order date. This can be done by additional sorting using the original order_DATE field.

Here is the full SQL statement:

SELECT * FROM orders JOIN (
  SELECT customer_id, max(order_date) AS last_order_date FROM order GROUP BY customer_id
) AS last_orders
ON orders.customer_id = last_orders.customer_id
ORDER BY last_order_date DESC, orders.order_date DESC
Copy the code

The conversion to SQLAlchemy is fairly straightforward, but we’ll separate the subqueries to avoid getting too complicated in a single statement. Here is the SQLAlchemy version of the query above:

last_orders = db.session.query(
    Order.customer_id, db.func.max(Order.order_date).label('last_order_date')
).group_by(Order.customer_id).subquery()

query = Order.query.join(
    last_orders, Order.customer_id == last_orders.c.customer_id
).order_by(last_orders.c.last_order_date.desc(), Order.order_date.desc())
Copy the code

The LAST_Orders subquery is a copy of the subquery I described in the previous section. Note that nothing has been sent to the database at this point, and storing subqueries in local variables ahead of time will not trigger additional database queries.

In the second statement, we take the Order model and join it with the last_Orders subquery. Subquery objects work like SQLAlchemy tables, so we can refer to individual columns using the table.c.column_name syntax. C confused a lot of people, and unfortunately SQLAlchemy used this strange name as a container for columns in a table object.

The join() method takes two parameters, first the right table in the join(the last_Orders subquery), and then the join condition, that is, the customer_ID column in the two tables matches. Once the connection is in place, you can specify the order, which is a direct conversion of the two sequential statements in the SQL example. Notice how the virtual last_order_date column is referenced with that strange C as last_Orders.c. ast_order_date, but order_date in the Order model is referenced directly as an attribute. The difference here is that Order is a model, whereas last_Orders is a generic table with results. Models have higher-level interfaces than tables, so they are easier to use.

As a final exercise, I want to see how my manual SQL stacks up against the SQL generated by SQLAlchemy using the above query. If you don’t know this trick, you can get the SQL generated by SQLAlchemy for any query object by converting the query to a string:

print(str(query))
Copy the code

The SQLAlchemy query above generates the following raw SQL:

SELECT 
    orders.id AS orders_id,
    orders.customer_id AS orders_customer_id,
    orders.order_date AS orders_order_date
FROM orders JOIN (
    SELECT
        orders.customer_id AS customer_id,
        max(orders.order_date) AS last_order_date
    FROM orders GROUP BY orders.customer_id
) AS anon_1
ON orders.customer_id = anon_1.customer_id
ORDER BY anon_1.last_order_date DESC, orders.order_date DESC
Copy the code

If you ignore the slightly increased verbosity of the generated statement, things are almost the same. SQLAlchemy likes to create an alias for each column in the query, so you see a lot of use of AS name constructs. The subquery is the same as the original SQL, but without context SQLAlchemy gave it a generic anon_1 name instead of the more explicit last_Orders.

The translator’s note: In order to solve the situation where different customers have the same last order date, ORDER BY last_order_date DESC, orders.customer_id DESC, orders.order_date DESC