Skip to content

Commit

Permalink
Add selectin loading
Browse files Browse the repository at this point in the history
Adding a new kind of relationship loader that is
a cross between the "immediateload" and the "subquery"
eager loader, using an IN criteria to load related items
in bulk immediately after the lead query result is loaded.

Change-Id: If13713fba9b465865aef8fd50b5b6b977fe3ef7d
Fixes: #3944
  • Loading branch information
zzzeek committed Apr 26, 2017
1 parent 029d0f7 commit 19d2424
Show file tree
Hide file tree
Showing 16 changed files with 2,691 additions and 18 deletions.
16 changes: 16 additions & 0 deletions doc/build/changelog/changelog_12.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,22 @@
inner element is negated correctly, when the :func:`.not_` modifier
is applied to the labeled expression.

.. change:: 3944
:tags: feature, orm
:tickets: 3944

Added a new kind of eager loading called "selectin" loading. This
style of loading is very similar to "subquery" eager loading,
except that it uses an IN expression given a list of primary key
values from the loaded parent objects, rather than re-stating the
original query. This produces a more efficient query that is
"baked" (e.g. the SQL string is cached) and also works in the
context of :meth:`.Query.yield_per`.

.. seealso::

:ref:`change_3944`

.. change::
:tags: bug, orm
:tickets: 3967
Expand Down
91 changes: 91 additions & 0 deletions doc/build/changelog/migration_12.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,97 @@ very unusual cases, such as a relationship that uses a custom

:ticket:`3954`

.. _change_3944:

New "selectin" eager loading, loads all collections at once using IN
--------------------------------------------------------------------

A new eager loader called "selectin" loading is added, which in many ways
is similar to "subquery" loading, however produces a simpler SQL statement
that is cacheable as well as more efficient.

Given a query as below::

q = session.query(User).\
filter(User.name.like('%ed%')).\
options(subqueryload(User.addresses))

The SQL produced would be the query against ``User`` followed by the
subqueryload for ``User.addresses`` (note the parameters are also listed)::

SELECT users.id AS users_id, users.name AS users_name
FROM users
WHERE users.name LIKE ?
('%ed%',)

SELECT addresses.id AS addresses_id,
addresses.user_id AS addresses_user_id,
addresses.email_address AS addresses_email_address,
anon_1.users_id AS anon_1_users_id
FROM (SELECT users.id AS users_id
FROM users
WHERE users.name LIKE ?) AS anon_1
JOIN addresses ON anon_1.users_id = addresses.user_id
ORDER BY anon_1.users_id
('%ed%',)

With "selectin" loading, we instead get a SELECT that refers to the
actual primary key values loaded in the parent query::

q = session.query(User).\
filter(User.name.like('%ed%')).\
options(selectinload(User.addresses))

Produces::

SELECT users.id AS users_id, users.name AS users_name
FROM users
WHERE users.name LIKE ?
('%ed%',)

SELECT users_1.id AS users_1_id,
addresses.id AS addresses_id,
addresses.user_id AS addresses_user_id,
addresses.email_address AS addresses_email_address
FROM users AS users_1
JOIN addresses ON users_1.id = addresses.user_id
WHERE users_1.id IN (?, ?)
ORDER BY users_1.id
(1, 3)

The above SELECT statement includes these advantages:

* It doesn't use a subquery, just an INNER JOIN, meaning it will perform
much better on a database like MySQL that doesn't like subqueries

* Its structure is independent of the original query; in conjunction with the
new :ref:`expanding IN parameter system <change_3953>_` we can in most cases
use the "baked" query to cache the string SQL, reducing per-query overhead
significantly

* Because the query only fetches for a given list of primary key identifiers,
"selectin" loading is potentially compatible with :meth:`.Query.yield_per` to
operate on chunks of a SELECT result at a time, provided that the
database driver allows for multiple, simultaneous cursors (SQlite, Postgresql;
**not** MySQL drivers or SQL Server ODBC drivers). Neither joined eager
loading nor subquery eager loading are compatible with :meth:`.Query.yield_per`.

The disadvanages of selectin eager loading are potentially large SQL
queries, with large lists of IN parameters. The list of IN parameters themselves
are chunked in groups of 500, so a result set of more than 500 lead objects
will have more additional "SELECT IN" queries following. Also, support
for composite primary keys depends on the database's ability to use
tuples with IN, e.g.
``(table.column_one, table_column_two) IN ((?, ?), (?, ?) (?, ?))``.
Currently, Postgresql and MySQL are known to be compatible with this syntax,
SQLite is not.

..seealso::

:ref:`selectin_eager_loading`

:ticket:`3944`

.. _change_3229:

Support for bulk updates of hybrids, composites
Expand Down
4 changes: 3 additions & 1 deletion doc/build/faq/ormconfiguration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,9 @@ The primary key is a good choice for this::

Note that the :func:`.joinedload` eager loader strategy does not suffer from
the same problem because only one query is ever issued, so the load query
cannot be different from the main query.
cannot be different from the main query. Similarly, the :func:`.selectinload`
eager loader strategy also does not have this issue as it links its collection
loads directly to primary key values just loaded.

.. seealso::

Expand Down
173 changes: 169 additions & 4 deletions doc/build/orm/loading_relationships.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,12 @@ The primary forms of relationship loading are:
related table to be loaded to load all members of related collections / scalar
references at once. Subquery eager loading is detailed at :ref:`subquery_eager_loading`.

* **select IN loading** - available via ``lazy='selectin'`` or the :func:`.selectinload`
option, this form of loading emits a second (or more) SELECT statement which
assembles the primary key identifiers of the parent objects into an IN clause,
so that all members of related collections / scalar references are loaded at once
by primary key. Select IN loading is detailed at :ref:`selectin_eager_loading`.

* **raise loading** - available via ``lazy='raise'``, ``lazy='raise_sql'``,
or the :func:`.raiseload` option, this form of loading is triggered at the
same time a lazy load would normally occur, except it raises an ORM exception
Expand All @@ -69,7 +75,7 @@ at mapping time to take place in all cases where an object of the mapped
type is loaded, in the absense of any query-level options that modify it.
This is configured using the :paramref:`.relationship.lazy` parameter to
:func:`.relationship`; common values for this parameter
include ``"select"``, ``"joined"``, and ``"subquery"``.
include ``select``, ``joined``, ``subquery`` and ``selectin``.

For example, to configure a relationship to use joined eager loading when
the parent object is queried::
Expand Down Expand Up @@ -99,7 +105,7 @@ is to set them up on a per-query basis against specific attributes. Very detail
control over relationship loading is available using loader options;
the most common are
:func:`~sqlalchemy.orm.joinedload`,
:func:`~sqlalchemy.orm.subqueryload`,
:func:`~sqlalchemy.orm.subqueryload`, :func:`~sqlalchemy.orm.selectinload`
and :func:`~sqlalchemy.orm.lazyload`. The option accepts either
the string name of an attribute against a parent, or for greater specificity
can accommodate a class-bound attribute directly::
Expand Down Expand Up @@ -348,7 +354,10 @@ in play.
To "batch" queries with arbitrarily large sets of result data while maintaining
compatibility with collection-based joined eager loading, emit multiple
SELECT statements, each referring to a subset of rows using the WHERE
clause, e.g. windowing.
clause, e.g. windowing. Alternatively, consider using "select IN" eager loading
which is **potentially** compatible with :meth:`.Query.yield_per`, provided
that the database driver in use supports multiple, simultaneous cursors
(SQLite, Postgresql drivers, not MySQL drivers or SQL Server ODBC drivers).


.. _zen_of_eager_loading:
Expand Down Expand Up @@ -597,6 +606,13 @@ load the full contents of all collections at once, is therefore incompatible
with "batched" loading supplied by :meth:`.Query.yield_per`, both for collection
and scalar relationships.

The newer style of loading provided by :func:`.selectinload` solves these
limitations of :func:`.subqueryload`.

.. seealso::

:ref:`selectin_eager_loading`


.. _subqueryload_ordering:

Expand Down Expand Up @@ -629,6 +645,124 @@ that the inner query could return the wrong rows::

:ref:`faq_subqueryload_limit_sort` - detailed example

.. _selectin_eager_loading:

Select IN loading
-----------------

Select IN loading is similar in operation to subquery eager loading, however
the SELECT statement which is emitted has a much simpler structure than
that of subquery eager loading. Additionally, select IN loading applies
itself to subsets of the load result at a time, so unlike joined and subquery
eager loading, is compatible with batching of results using
:meth:`.Query.yield_per`, provided the database driver supports simultaneous
cursors.

.. versionadded:: 1.2

"Select IN" eager loading is provided using the ``"selectin"`` argument
to :paramref:`.relationship.lazy` or by using the :func:`.selectinload` loader
option. This style of loading emits a SELECT that refers to
the primary key values of the parent object inside of an IN clause,
in order to load related associations:

.. sourcecode:: python+sql

>>> jack = session.query(User).\
... options(selectinload('addresses')).\
... filter(or_(User.name == 'jack', User.name == 'ed')).all()
{opensql}SELECT
users.id AS users_id,
users.name AS users_name,
users.fullname AS users_fullname,
users.password AS users_password
FROM users
WHERE users.name = ? OR users.name = ?
('jack', 'ed')
SELECT
users_1.id AS users_1_id,
addresses.id AS addresses_id,
addresses.email_address AS addresses_email_address,
addresses.user_id AS addresses_user_id
FROM users AS users_1
JOIN addresses ON users_1.id = addresses.user_id
WHERE users_1.id IN (?, ?)
ORDER BY users_1.id, addresses.id
(5, 7)

Above, the second SELECT refers to ``users_1.id IN (5, 7)``, where the
"5" and "7" are the primary key values for the previous two ``User``
objects loaded; after a batch of objects are completely loaded, their primary
key values are injected into the ``IN`` clause for the second SELECT.

"Select IN" loading is the newest form of eager loading added to SQLAlchemy
as of the 1.2 series. Things to know about this kind of loading include:

* The SELECT statement emitted by the "selectin" loader strategy, unlike
that of "subquery", does not
require a subquery nor does it inherit any of the performance limitations
of the original query; the lookup is a simple primary key lookup and should
have high performance.

* The special ordering requirements of subqueryload described at
:ref:`subqueryload_ordering` also don't apply to selectin loading; selectin
is always linking directly to a parent primary key and can't really
return the wrong result.

* "selectin" loading, unlike joined or subquery eager loading, always emits
its SELECT in terms of the immediate parent objects just loaded, and
not the original type of object at the top of the chain. So if eager loading
many levels deep, "selectin" loading still uses exactly one JOIN in the statement.
joined and subquery eager loading always refer to multiple JOINs up to
the original parent.

* "selectin" loading produces a SELECT statement of a predictable structure,
independent of that of the original query. As such, taking advantage of
a new feature with :meth:`.ColumnOperators.in_` that allows it to work
with cached queries, the selectin loader makes full use of the
:mod:`sqlalchemy.ext.baked` extension to cache generated SQL and greatly
cut down on internal function call overhead.

* The strategy will only query for at most 500 parent primary key values at a
time, as the primary keys are rendered into a large IN expression in the
SQL statement. Some databases like Oracle have a hard limit on how large
an IN expression can be, and overall the size of the SQL string shouldn't
be arbitrarily large. So for large result sets, "selectin" loading
will emit a SELECT per 500 parent rows returned. These SELECT statements
emit with minimal Python overhead due to the "baked" queries and also minimal
SQL overhead as they query against primary key directly.

* "selectin" loading is the only eager loading that can work in conjunction with
the "batching" feature provided by :meth:`.Query.yield_per`, provided
the database driver supports simultaneous cursors. As it only
queries for related items against specific result objects, "selectin" loading
allows for eagerly loaded collections against arbitrarily large result sets
with a top limit on memory use when used with :meth:`.Query.yield_per`.

Current database drivers that support simultaneous cursors include
SQLite, Postgresql. The MySQL drivers mysqlclient and pymysql currently
**do not** support simultaneous cursors, nor do the ODBC drivers for
SQL Server.

* As "selectin" loading relies upon IN, for a mapping with composite primary
keys, it must use the "tuple" form of IN, which looks like
``WHERE (table.column_a, table.column_b) IN ((?, ?), (?, ?), (?, ?))``.
This syntax is not supported on every database; currently it is known
to be only supported by modern Postgresql and MySQL versions. Therefore
**selectin loading is not platform-agnostic for composite primary keys**.
There is no special logic in SQLAlchemy to check ahead of time which platforms
support this syntax or not; if run against a non-supporting platform (such
as SQLite), the database will return an error immediately. An advantage to SQLAlchemy
just running the SQL out for it to fail is that if a database like
SQLite does start supporting this syntax, it will work without any changes
to SQLAlchemy.

In general, "selectin" loading is probably superior to "subquery" eager loading
in most ways, save for the syntax requirement with composite primary keys
and possibly that it may emit many SELECT statements for larger result sets.
As always, developers should spend time looking at the
statements and results generated by their applications in development to
check that things are working efficiently.

.. _what_kind_of_loading:

Expand Down Expand Up @@ -666,7 +800,27 @@ references a scalar many-to-one reference.
* When multiple levels of depth are used with joined or subquery loading, loading collections-within-
collections will multiply the total number of rows fetched in a cartesian fashion. Both
joined and subquery eager loading always join from the original parent class; if loading a collection
four levels deep, there will be four JOINs out to the parent.
four levels deep, there will be four JOINs out to the parent. selectin loading
on the other hand will always have exactly one JOIN to the immediate
parent table.

* Using selectin loading, the load of 100 objects will also emit two SQL
statements, the second of which refers to the 100 primary keys of the
objects loaded. selectin loading will however render at most 500 primary
key values into a single SELECT statement; so for a lead collection larger
than 500, there will be a SELECT statement emitted for each batch of
500 objects selected.

* Using multiple levels of depth with selectin loading does not incur the
"cartesian" issue that joined and subquery eager loading have; the queries
for selectin loading have the best performance characteristics and the
fewest number of rows. The only caveat is that there might be more than
one SELECT emitted depending on the size of the lead result.

* selectin loading, unlike joined (when using collections) and subquery eager
loading (all kinds of relationships), is potentially compatible with result
set batching provided by :meth:`.Query.yield_per` assuming an appropriate
database driver, so may be able to allow batching for large result sets.

* Many to One Reference

Expand All @@ -692,6 +846,12 @@ references a scalar many-to-one reference.
joined loading, however, except perhaps that subquery loading can use an INNER JOIN in all cases
whereas joined loading requires that the foreign key is NOT NULL.

* Selectin loading will also issue a second load for all the child objects (and as
stated before, for larger results it will emit a SELECT per 500 rows), so for a load of 100 objects
there would be two SQL statements emitted. The query itself still has to
JOIN to the parent table, so again there's not too much advantage to
selectin loading for many-to-one vs. joined eager loading save for the
use of INNER JOIN in all cases.

Polymorphic Eager Loading
-------------------------
Expand All @@ -707,6 +867,7 @@ Wildcard Loading Strategies
---------------------------

Each of :func:`.joinedload`, :func:`.subqueryload`, :func:`.lazyload`,
:func:`.selectinload`,
:func:`.noload`, and :func:`.raiseload` can be used to set the default
style of :func:`.relationship` loading
for a particular query, affecting all :func:`.relationship` -mapped
Expand Down Expand Up @@ -1011,6 +1172,10 @@ Relationship Loader API

.. autofunction:: raiseload

.. autofunction:: selectinload

.. autofunction:: selectinload_all

.. autofunction:: subqueryload

.. autofunction:: subqueryload_all
2 changes: 2 additions & 0 deletions lib/sqlalchemy/orm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,8 @@ def clear_mappers():
lazyload_all = strategy_options.lazyload_all._unbound_all_fn
subqueryload = strategy_options.subqueryload._unbound_fn
subqueryload_all = strategy_options.subqueryload_all._unbound_all_fn
selectinload = strategy_options.selectinload._unbound_fn
selectinload_all = strategy_options.selectinload_all._unbound_all_fn
immediateload = strategy_options.immediateload._unbound_fn
noload = strategy_options.noload._unbound_fn
raiseload = strategy_options.raiseload._unbound_fn
Expand Down
Loading

0 comments on commit 19d2424

Please sign in to comment.