[Fixed]-Django query filter using large array of ids in Postgres DB

6👍

I found a solution building on @erwin-brandstetter answer using a custom lookup

from django.db.models import Lookup
from django.db.models.fields import Field

@Field.register_lookup
class EfficientInLookup(Lookup):

    lookup_name = "ineff"

    def as_sql(self, compiler, connection):
        lhs, lhs_params = self.process_lhs(compiler, connection)
        rhs, rhs_params = self.process_rhs(compiler, connection)
        params = lhs_params + rhs_params
        return "%s IN (SELECT unnest(%s))" % (lhs, rhs), params

This allows to filter like this:

MyModel.objects.filter(id__ineff=<list-of-values>)

3👍

The trick is to transform the array to a set somehow.

Instead of (this form is only good for a short array):

SELECT *
FROM   tbl t
WHERE  t.tbl_id = ANY($1);
-- WHERE  t.tbl_id IN($1);  -- equivalent

$1 being the array parameter.

You can still pass an array like you had it, but unnest and join. Like:

SELECT *
FROM   tbl t
JOIN   unnest($1) arr(id) ON arr.id = t.tbl_id;

Or you can keep your query, too, but replace the array with a subquery unnesting it:

SELECT * FROM tbl t
WHERE  t.tbl_id = ANY (SELECT unnest($1));

Or:

SELECT * FROM tbl t
WHERE  t.tbl_id IN    (SELECT unnest($1));

Same effect for performance as passing a set with a VALUES expression. But passing the array is typically much simpler.

Detailed explanation:

1👍

Is this an example of the first thing you’re asking?

relation_list = list(ModelA.objects.filter(id__gt=100))
obj_query = ModelB.objects.filter(a_relation__in=relation_list)

That would be an “IN” command because you’re first evaluating relation_list by casting it to a list, and then using it in your second query.

If instead you do the exact same thing, Django will only make one query, and do SQL optimization for you. So it should be more efficient that way.

You can always see the SQL command you’ll be executing with obj_query.query if you’re curious what’s happening under the hood.

Hope that answers the question, sorry if it doesn’t.

0👍

I had lots of trouble to make the custom lookup ‘ineff’ work.
I may have solved it, but would love some validation from Django and Postgres experts.

1) Using it ‘directly’ on a ForeignKey field (ModelB)

ModelA.objects.filter(ModelB__ineff=queryset_ModelB)

Throws the following exception:
“Related Field got invalid lookup: ineff”

ForeignKey fields cannot be used with custom lookups.

A similar issue is reported here:
Custom lookup is not being registered in Django

2) Using it ‘indirectly’ on the pk field of related model (ModelB.id)

ModelA.objects.filter(ModelB__id__ineff=queryset_ModelB.values_list('id', flat=True))

Throws the following exception:
“can only concatenate list (not “tuple”) to list”

Looking at Django Traceback, I noticed that rhs_params is a tuple.
Yet we try to add it to lhs_params (a list) in our custom lookup.

Hence I changed:

params = lhs_params + rhs_params

into:

params = lhs_params + list(rhs_params)

3) I then got a Postgres error (at least I had passed Django ORM)
“function unnest(uuid) does not exist”
“HINT: No function matches the given name and argument types. You might need to add explicit type casts.”

I apparently solved it by changing the sql:

from:

return "%s IN (SELECT unnest(%s))" % (lhs, rhs), params

to:

return "%s IN (SELECT unnest(ARRAY(%s)))" % (lhs, rhs), params    

Hence my final as_sql method looks like this:

def as_sql(self, compiler, connection):
    lhs, lhs_params = self.process_lhs(compiler, connection)
    rhs, rhs_params = self.process_rhs(compiler, connection)
    params = lhs_params + list(rhs_params)
    return "%s IN (SELECT unnest(ARRAY(%s)))" % (lhs, rhs), params

It seems to work, and is indeed faster than in__ (tested with EXPLAIN ANALYZE in Postgres).
But I would love to have some validation from experts, perhaps Erwin Brandstetter?
Thanks for your input.

👤edyas

Leave a comment