Building a Django driver for Psycopg 3
Posted by Daniele Varrazzo on 2021-08-02
Tagged as
psycopg3,
development,
recipe
One of the goals of the Psycopg 3 project is to make easy to port code developed from Psycopg 2. For this reason the creation of a Django backend (the module you specify in the settings as your database ENGINE) was a project with a double goal:
- A Django driver is a way to make Psycopg 3 useful from the start, with the possibility of dropping it in a project transparently and have available, when needed the new features offered (for instance the superior COPY support).
- The difficulty of introducing Psycopg 3 in the Django codebase and the type of changes required are indicative of the type of problems that could be found porting other projects.
...and it's done! A few days ago, the new Psycopg 3 Django backend could pass the entire Django test suite!
The implementation of the Django backend actually started several months ago, but it can be seen, from the test progression above, that its development had been suspended for several months. The problem, in the first attempts, was that too much of the Django code was in need of being adapted: this was a sign that the changes needed to use the new adapter were too invasive and that the same type of difficulties would have been met by everyone trying to replace Psycopg 2 with Psycopg 3. Back to the design board then, but hopefully the resulting adapter will behave mostly as you might expect and will not force users to change every query in their program (which would have been a deal breaker for most non-trivial projects).
The backend cannot be used with the current Django version: a few modifications to the Django codebase are needed in order to use it. These changes will be proposed to the Django project: if the Django maintainer will accept them, the driver should be usable starting from one of the next Django releases.
The aim of this article is to take a look at some of these modifications, to understand where the behaviour of Psycopg 3 diverges from its well known predecessor and how to work around the differences.
Server-side parameters binding
Many of these changes are the consequence of using server-side binding for the query parameters (using the libpq PQexecParams() function), instead of merging the arguments to the query on the client-side and using the simpler PQexec() function.
In the PQexec() case, the Postgres query parser has access to the literal values in the context where they are used and it looks like it is able to use this information in ways we don't appreciate...until we lose them. Do you think that text is the best PostgreSQL data type to convert Python strings to? I wish it was so simple. Below is an experiment with the psycopg.pq objects, the low level libpq wrapper that Psycopg 3 exposes:
>>> from psycopg.postgres import types >>> conn.execute("create table testjson (id serial primary key, data jsonb)") # <psycopg.Cursor [COMMAND_OK] [INTRANS] (database=piro) at 0x7f92d43d7b80> # Note: $1, $2... are the low level Postgres placeholders. # In a normal Psycopg query you would use classic '%s'. >>> conn.pgconn.exec_params( ... b"insert into testjson (data) values ($1)", ... [b"{}"], [types["text"].oid]) # <psycopg_c.pq.PGresult [FATAL_ERROR] at 0x7f92cec70a90> >>> print(_.error_message.decode("utf8")) # ERROR: column "data" is of type jsonb but expression is of type text # LINE 1: insert into testjson (data) values ($1) # ^ # HINT: You will need to rewrite or cast the expression.
Specifying the text Postgres type is an excessively strict type indication: in most cases, Postgres will not be able to automatically convert the value to the required type.
When we use a literal '{}' in the query, we are specifying an untyped literal. Postgres docs say that we can do the same using 0 as type OID for the parameter (see the paramTypes[] description). But it seems this isn't always the case. For instance:
>>> conn.execute("select concat(%s, %s)", ["foo", "bar"]) # ...becomes... >>> conn.pgconn.exec_params( ... b"select concat($1, $2)", ... [b"foo", b"bar"], [0, 0]) # <psycopg_c.pq.PGresult [FATAL_ERROR] at 0x7f92d43db4d0> >>> print(_.error_message.decode("utf8")) # ERROR: could not determine data type of parameter $1
This problem doesn't happen with every function: it seems to be only a problem with "variadic" functions, such as concat() or json_build_object(). As sporadic as it is, it doesn't seem like there is a universally correct way of mapping Python types to PostgreSQL type OIDs: we can try to get it right most of the time (so, by default, Psycopg 3 dumps Python strings using the OID 0), but places where this isn't right do exist...and they exist in Django, of course.
There are two different ways to work around the problem, both have their merit and one might be easier to use than the other in different contexts.
Add a cast to the placeholder: specifying %s::text (or other types) in your query, it is possible to disambiguate the type where "unknown" doesn't work:
>>> conn.execute("select concat(%s::text, %s::text)", ["foo", "bar"]) # ...becomes... >>> conn.pgconn.exec_params( ... b"select concat($1::text, $2::text)", ... [b"foo", b"bar"], [0, 0]) # <psycopg_c.pq.PGresult [TUPLES_OK] at 0x7f92cebfb630> >>> _.get_value(0, 0) # b'foobar'
One place in Django where this was needed is in array comparisons, because they follow stricter rules than the base type comparisons and may require an explicit cast to work.
The other option is to use a different type than the Python str and handle it using a different dumper.
>>> conn.execute("select concat(%s, %s)", [Text("foo"), Text("bar")]) # ...becomes... >>> conn.pgconn.exec_params( ... b"select concat($1, $2)", ... [b"foo", b"bar"], [types["text"].oid, types["text"].oid]) # <psycopg_c.pq.PGresult [TUPLES_OK] at 0x7f92cebfbbd0> >>> _.get_value(0, 0) # b'foobar'
This is proposed in Django and used, for instance, in the concat() case.
Note that both the solutions make the queries compatible with Psycopg 2 and 3: the %s::text casts are no problem in psycopg2 queries and psycopg2 is smart enough to notice that Text is a str subclass and to apply the same conversion rules used for the vanilla strings.
Server-side binding doesn't work for all the queries
Argument binding on the server only works for queries that select and modify data, but it doesn't work in the Data Definition Language. For example:
>>> conn.execute("create table test (id int default %s)", [42]) # Traceback (most recent call last): # ... # psycopg.errors.UndefinedParameter: there is no parameter $1 # LINE 1: create table test (id int default $1) # ^
The solution for this type of problem is to use the psycopg.sql module to explicitly generate a query client-side and send it to the server without parameters:
>>> conn.execute( ... sql.SQL("create table test (id int default {})") ... .format(sql.Literal(42)) ... )
A similar module is available in psycopg2 too so it is easy to write code that works for both versions: it's just the import statement that needs to be changed.
GROUP BY/ORDER BY with parameters
Another unexpected problem manifested with tests failing with a message like "column name must appear in the GROUP BY clause or be used in an aggregate function". This type of error appeared in tests leveraging Django ORM aggregation features, when the aggregation keys contain parameters.
For example, if you want to count your people by grouping them by the first two letters of their name, you may use a query such as:
SELECT left(name, 2) AS prefix, count(*) AS number FROM people GROUP BY left(name, 2) ORDER BY left(name, 2)
If the "2" is actually a parameter, Django ends up composing and executing a query like:
cursor.execute(""" SELECT left(name, %s) AS prefix, count(*) AS number FROM people GROUP BY left(name, %s) ORDER BY left(name, %s) """, [2, 2, 2])
If composed by the client, this query presents no problem, because the server query parser can clearly see the same expression in the output column and in the group/order predicates. However, moving to use server-side parameters, the query would be transformed to:
SELECT left(name, $1) AS prefix, count(*) AS number FROM people GROUP BY left(name, $2) ORDER BY left(name, $3)
This query could be executed only if the three parameters are the same, but at parsing time the server cannot make sure this will be the case, failing with the error above.
If this query was under someone's control, it could be easily rewritten using named parameters instead of positional ones. I would personally write:
cursor.execute(""" SELECT left(name, %(len)s) AS prefix, count(*) AS number FROM people GROUP BY left(name, %(len)s) ORDER BY left(name, %(len)s) """, {"len": 2})
which is transformed in a query with a single $1 placeholder used three times, and parsed successfully. Unfortunately Django only uses positional placeholders throughout its entire codebase and switching to a parameters mapping would be a very invasive change. A more localised change is to use column aliases: the same query can be rewritten as:
SELECT left(name, %s) AS prefix, count(*) AS number FROM people GROUP BY 1 ORDER BY 1
where "1" refers to the first output column. It is not a particularly loved syntax but it turns out useful here.
Probably not every database supports this syntax, so, in the proposed Django changeset, new feature parameters can be used to signal that the feature is accepted by specific databases, and currently enabled only for PostgreSQL.
A different package organisation
This is less mysterious to understand. The psycopg2 package has a bit of a chaotic organisation, with a couple of kitchen-sink modules (extensions and extras) containing a bit of everything: cursor subclasses, extra data types, utility functions, symbolic constants...starting from Psycopg 3, the package is organised in a more rational way and separated in different modules and sub-packages.
The bulk of the changes required to use Psycopg 3 from Django is just not to assume that talking to PostgreSQL will be done using psycopg2, but to import and use the objects according to the version of the driver in use.
Note that, with these modifications, Django is able to use both Psycopg 2 and 3, possibly both in the same project. While this is possible, it might not be the average use case in your project: more often than not you will be just interested in upgrading from Psycopg 2 to 3. In the simpler case of an update, you will just have to change your import statements, assuming unconditionally that the psycopg package is installed (there is no psycopg3 package: the same module name will do for the following major versions too) and just say goodbye to the glorious workhorse that psycopg2 has been so far. 👋
Takeaway points
In the design of Psycopg 3, a great effort has been made to allow a smooth adoption for users who have experience, and codebases, in psycopg2. The experience of the Django backend porting shows that most of the adjustments required fall in these categories:
- Different package organisation
- Over/under-specified types, which can be tweaked via casts and data wrappers
- Inability to use server-side parameters, which can be worked around with client-side query composition
Hopefully now you know how to address these problems in case you are thinking of using Psycopg 3 in your next or current project!