QuerySet

Introduction

This guideline will base on these models:

class User(models.Model):
    username = models.CharField(max_length=50)


class Person(models.Model):
    user = models.ForeignKey(
        'User',
        on_delete=models.CASCADE,
        related_name='people',
    )
    first_name = models.CharField(max_length=50)
    last_name = models.CharField(max_length=150)
    employee_id = models.CharField(max_length=15, unique=True)
    active = models.Boolean(default=True, blank=True)


class Department(models.Model):
    name = models.CharField(max_length=255)

    members = models.ManyToManyField(
        'Person',
        related_name='departments',
    )

Filtering

Joint Field Filtering

# Yes
def find_people_yes(query: str):
    return Person.objects.annotate(
        full_name=Concat('first_name', Value(' '), 'last_name'),
    ).filter(
        full_name__icontains=query,
    )

# No
def find_people_no(query: str):
    return Person.objects.filter(
        Q(first_name__icontains=query)
        | Q(last_name__icontains=query)
    )

First, we try to filter people with name John...

people = find_people_yes(query='John')
print(people)
# <QuerySet [<Person: John Doe>, <Person: John Wick>]>

people = find_people_no(query='John')
print(people)
# <QuerySet [<Person: John Doe>, <Person: John Wick>]>

Everything seems fine, right? Both techniques return the same results. But what about filtering person's name by John Doe?

people = find_people_yes(query='John Doe')
print(people)
# <QuerySet [<Person: John Doe>]>

people = find_people_no(query='John Doe')
print(people)
# <QuerySet []>

Now you see the difference? It's because the second method (find_people_no) does not support filtering across two fields.

But in the other side, the second method give us a better performance if both fields are db_index'ed. The first method was using the annotated field which is not indexed.

However, after considered those factors, we would suggest you to go with the first method.

Using `.count()` instead of `len()`

# Yes
Person.objects.count()     # SELECT COUNT(*) FROM person

# No
len(Person.objects.all())  # SELECT * FROM person

Calling Person.objects.count() will count directly inside the database
But, calling len(Person.objects.all()) will fetch all data from the database and count them in Python instead

So, using an aggregate function from database always consume less time that getting every records.

Using `.exists()` to check if any data exists

# Yes
if Person.objects.filter(first_name='John').exists():
    do_something()

# No
if Person.objects.filter(first_name='John'):
    do_something()

Using .exists() consume less resources than not using it.

Using `.first()` instead of calling `.exists()`, followed by `.get()`

# Yes
matched_person = Person.objects.filter(employee_id='123').first()
if matched_person is not None:
    do_something(matched_person)

# No
if Person.objects.filter(employee_id='123').exists():
    matched_person = Person.objects.get(employee_id='123')
    do_something(matched_person)

Calling .first() method will hit database only one time, and you will also get the object.
Calling .exists() method hit database one time, then calling .get() hit database one more time.

In this case, using .first() has better performance since it hit database only one.

You can read more about When QuerySets are evaluated.

Avoid filtering across many relationship

Filtering across one-to-many or many-to-many relationship can cause an unexpected result such as duplicated records.

Suppose we have two departments

hr = Department.objects.create(name='HRD')
accounting = Department.objects.create(name='HRM')

John Doe is in both departments

john_doe = Person.objects.create(...)
hr.add(john_doe)
accounting.add(john_doe)

If we try to get person who is in department that it's name contains word HR

Person.objects.filter(departments__name__icontains='HR')
# <QuerySet [<Person: John Doe>, <Person: John Doe>]>

We will get duplicated records because John Doe is in both departments. The solution for this case is to use .distinct()

Person.objects.filter(departments__name__icontains='HR').distinct()
# <QuerySet [<Person: John Doe>]>

If you are using PostgreSQL as a database backend, you can specify field id in .distinct() to improve the performance. Note that you will need to order the queryset by id as well

Person.objects.filter(departments__name__icontains='HR').order_by('id').distinct('id')
# <QuerySet [<Person: John Doe>]>

No need to use `.all()` before `.filter()`

If you're using .filter() in your queryset, then you don't need to provide .all() anymore

# Yes
Person.objects.filter(first_name='John')

# No
Person.objects.all().filter(first_name='John')

Avoid filtering against `JSONField`

Filtering against JSONField can cause a huge performance degrades.

If it is really needed to filter, then you might need to consider refactoring your model by using database relation (such as one-to-many or many-to-many) to represent your data instead.

`Q() | Q()` is more preferred than union queryset

# Yes
queryset = Person.objects.filter(
    Q(first_name='John') | Q(last_name='Doe')
)

# No
queryset = Person.objects.filter(first_name='John')
queryset.union(
    Person.objects.filter(last_name='Doe')
)

The first method will generate a query such as SELECT ... FROM ... WHERE ... OR ... which is faster than the second one which will generate SELECT ... FROM ... WHERE ... UNION SELECT ... FROM ... WHERE ....

Manipulating Data

`.update()` is more preferred than looping save models

# Yes
Person.objects.update(active=False)  # hit database

# No
for person in Person.objects.all():
    person.active = False
    person.save()                    # hit database (n times)

Calling QuerySet.update() will produce a single query using UPDATE ... WHERE .... It is faster than looping update the model in Python which hit database every iteration.

WARNING: Calling .update() won't trigger post_save signal.

Refresh model from database

# Getting John
john = Person.objects.get(first_name='John', active=True)
# Set John to inactive
Person.objects.filter(pk=person.pk).update(active=False)

print(john.active)
# True

john.refresh_from_db(fields=['active'])
print(john.active)
# False

When you update a model field using .update(), a local instance won't update automatically. You'll need to call .refresh_from_db() to refresh an instance from database.

Sometimes you might need refresh the whole queryset. In this case, you can query it again by calling .all()

queryset = Person.objects.all()
queryset.filter(first_name='John').update(active=False)

# update the whole queryset
queryset = queryset.all()

Using `.select_related()` and `.prefetch_related()`

# Yes
person = Person.objects \
    .select_related('user') \
    .prefetch_related('departments') \
    .first()                                 # hit database
print(person.first_name)
print(person.user.username)
print(person.departments.all())

# No
person = Person.objects.first()              # hit database
print(person.first_name)
print(person.user.username)                  # hit database
print(person.departments.all())              # hit database

The second method seems short and easy to understand. But behind that, it hits database 3 times.

WARNING: Calling person.departments.filter(name='HR') in both methods will hit database since it needs to query database to find the records which matched the given condition. prefetch_related does not help you in this situation.

any_user = User.objects.first()

# No need to use `select_related` since you only filter it.
Person.objects.filter(user=any_user)

# You need to, because you're accessing user instance
person = Person.objects.select_related('user').first()
do_something(person.user.username)

person = Person.objects.first()
user = User.objects.first()

# Yes
if person.user_id == user.pk:    # no database hit
    do_something()

# No
if person.user == user:          # hit database
    do_something()

The first method doesn't hit the database because field user_id is in Person instance locally.

The second one hit the database because accessing user property makes Django to query User instance.

Custom Manager

Repeat filtering logics

If you have any repeat filtering logics, it is recommended to adding an extra manager function instead of repeating yourself in multiple location.

class PersonManager(models.Manager):
    def only_inactive(self):
        return self.filter(active=False)


class Person(models.Model):
    ...
    objects = PersonManager()


# Then you can call it with
Person.objects.only_inactive()

Override default queryset

For example, if you want the default behavior to filter only active person.

class PersonManager(models.Manager):
    def get_queryset(self):
        return super().get_queryset().filter(active=True)


class Person(models.Model):
    ...
    objects = PersonManager()


Person.objects.all()  # will now showing only active person

Other

Using `.only()` to optimize performance

In case you want to process only person's first name, using .only() will result a better performance.

Person.objects.only('first_name')  # faster
Person.objects.all()

WARNING: If you try to get field that is not in .only(), Django will query the database again to get the field.

Difference between `.only()` and `.values()`

Both of them are getting only for selected field(s). But,

.only() produce a model instance which you still can get any other field from the result
.values() produce a dictionary. So, you cannot get any other field from the result

person = Person.objects.only('first_name').first()
print(person)
# <Person: John Doe>
print(person.first_name)
# John
print(person.last_name)     # hit database
# Doe

person = Person.objects.values('first_name').first()
print(person)
# {'first_name': 'John'}
print(person['first_name'])
# John
print(person['last_name'])  # raise KeyError

No Raw SQL

Writing query in Raw SQL format can cause many problems.

The most important reason is the difficulty to maintain. If you write a lot of queries, it will be difficult to understand the logics of those queries. Makes it difficult to modify as well.

The second one is security. Writing Raw SQL can cause a SQL injection if you don't carefully escape every parameters used in the queries.

The last reason is it can goes wrong easily. By using Raw SQL, you'll need to write it as a plain text. If you're using IDE that support validating SQL language, you might be lucky. But if you aren't, you'll easily mess things up. Checking with a real database query should be done frequently.

Django provides a lot of utility functions which can help you writing queries without using Raw SQL such as Subquery, Func, Aggregate, and etc.

Anyway, there might be a case that you cannot prevent using Raw SQL. If that time comes, please be very careful.

Table of Contents

QuerySet

Introduction

Filtering

Joint Field Filtering

Using `.count()` instead of `len()`

Using `.exists()` to check if any data exists

Using `.first()` instead of calling `.exists()`, followed by `.get()`

Avoid filtering across many relationship

No need to use `.all()` before `.filter()`

Avoid filtering against `JSONField`

`Q() | Q()` is more preferred than union queryset

Manipulating Data

`.update()` is more preferred than looping save models

Refresh model from database

Using `.select_related()` and `.prefetch_related()`

Custom Manager

Repeat filtering logics

Override default queryset

Other

Using `.only()` to optimize performance

Difference between `.only()` and `.values()`

No Raw SQL

References

Table of Contents

Introduction

Filtering

Joint Field Filtering

Using .count() instead of len()

Using .exists() to check if any data exists

Using .first() instead of calling .exists(), followed by .get()

Avoid filtering across many relationship

No need to use .all() before .filter()

Avoid filtering against JSONField

Q() | Q() is more preferred than union queryset

Manipulating Data

.update() is more preferred than looping save models

Refresh model from database

Related Models

Using .select_related() and .prefetch_related()

Select/Prefetch related only when you're using those values

Avoid comparing instance if not select/prefetch related

Custom Manager

Repeat filtering logics

Override default queryset

Other

Using .only() to optimize performance

Difference between .only() and .values()

No Raw SQL

References

Using `.count()` instead of `len()`

Using `.exists()` to check if any data exists

Using `.first()` instead of calling `.exists()`, followed by `.get()`

No need to use `.all()` before `.filter()`

Avoid filtering against `JSONField`

`Q() | Q()` is more preferred than union queryset

`.update()` is more preferred than looping save models

Using `.select_related()` and `.prefetch_related()`

Using `.only()` to optimize performance

Difference between `.only()` and `.values()`