Ruby on Rails | Screencasts | Download | Documentation | Weblog | Community | Source

Ticket #9497 (closed enhancement: wontfix)

Opened 2 years ago

Last modified 2 years ago

[PATCH] Removing eager joins from limited eager loading pre-query

Reported by: dasil003 Assigned to: core
Priority: normal Milestone: 1.x
Component: ActiveRecord Version: edge
Severity: normal Keywords: patch eager loading
Cc: me@julik.nl

Description

I expect there will be controversy around this patch. It does break 11 existing tests, but it offers a huge performance benefit under certain circumstances and I don't believe much is lost.

The limited eager-loading pre-query currently adds all the LEFT OUTER JOINs when it is fetching the ids for use in the second query. It does this for the purpose of allowing :order or :conditions clauses to affect the selection through the eager-loaded tables. However in practice I don't think WHERE or ORDER BY clauses referring to the eager loaded tables are all that useful, especially since we are just loading the IDs of the base table. The utility of conditions is reduced here because all of the association will be loaded anyway. So if you put a condition on the eager-loaded model, it will have to apply to ALL associated objects in order to actually restrict the IDs that are selected. Similarly, ORDER BY on an association table has a narrow use-case.

In contrast, by using explicit :joins and :group you can achieve a much wider range of WHERE and ORDER BY clauses than what the vanilla LEFT OUTER JOINs can provide (especially considering you can't modify the ON clause for the eager-loaded queries).

Clearly there is a use case for the current behavior. The reason I want to change it is because it can offer an optimization of several orders of magnitude in cases where:

A) the base table result set is large and B) there are several eager loaded tables being joined

Just to show you what I'm talking about, here is the same page loaded back to back with and without this patch respectively:

Property Load IDs For Limited Eager Loading (0.096613) SELECT DISTINCT properties.id FROM properties INNER JOIN idxes as qidx_association_idxes ON qidx_association_idxes.id = properties.idx_id AND qidx_association_idxes.association_id = 2 WHERE (properties.idx_id IN( 7 )) AND (properties.idx_id IN( 7 ) AND properties.type IN( 'Land' )) ORDER BY price DESC LIMIT 0, 10

Property Load IDs For Limited Eager Loading (31.864761) SELECT DISTINCT properties.id FROM properties LEFT OUTER JOIN photos ON photos.property_id = properties.id LEFT OUTER JOIN open_houses ON open_houses.property_id = properties.id AND open_houses.start_time > DATE(NOW()) LEFT OUTER JOIN agents_properties ON agents_properties.property_id = properties.id LEFT OUTER JOIN agents ON agents.id = agents_properties.agent_id LEFT OUTER JOIN agent_photos ON agent_photos.agent_id = agents.id INNER JOIN idxes as qidx_association_idxes ON qidx_association_idxes.id = properties.idx_id AND qidx_association_idxes.association_id = 2 WHERE (properties.idx_id IN( 7 )) AND (properties.idx_id IN( 7 ) AND properties.type IN( 'Land' )) ORDER BY price DESC LIMIT 0, 10

As you can see the first query is 300 times faster. And this does not even use very large tables. The properties table is about 15,000 rows, with the WHERE conditions and INNER JOIN optimizing it to an active set of under 10,000 rows.

Attachments

dont_add_eager_loading_joins_to_pre_query.diff (495 bytes) - added by dasil003 on 09/07/07 00:55:40.
only_include_referenced_tables.diff (83 bytes) - added by dasil003 on 09/14/07 23:36:37.
(patch moved to ticket 9560)

Change History

09/07/07 00:55:40 changed by dasil003

  • attachment dont_add_eager_loading_joins_to_pre_query.diff added.

09/07/07 02:35:19 changed by protocool

Hey,

you said: "It does this for the purpose of allowing :order or :conditions clauses to affect the selection through the eager-loaded tables."

I don't think that's why it does it (tho I could be wrong).

I believe it is to correctly support :limit and :offset when you have :include and your :conditions refer to the included associations.

Project.find(:all, :include => :tasks, :conditions => ['tasks.due_date > ?', Date.today], :limit => 20)

Removing the eager associations from the prequery would break this - according to my understanding anyhow.

-1

09/07/07 03:16:54 changed by nzkoz

@protocool: yeah, I believe you're right there.

Lets continue this discussion on the mailing list.

09/07/07 04:48:57 changed by dasil003

@protocool: You are correct, and what I meant by my original statement is exactly what you said.

My argument is that such a condition is easily supported through manual :joins in a much more flexible way, and is therefore not worth the severe DB performance hit that it inflicts under certain conditions.

09/07/07 07:29:34 changed by julik

  • cc set to me@julik.nl.

Some people put quite some effort into making :include work with :offsets. It's a feature with performance implications, that's all. Maybe that should be noted in the docs.

-1

09/07/07 07:37:09 changed by dasil003

My issue is that it's impossible to work around to solve my problem (which stopped development when the DB lag caused the browser to actually time out), whereas the feature that it provides is extremely easy to work around the other way.

But I can also solve my problem by making it smarter at the cost of some additional complexity of checking which tables are actually being referenced... if that's the consensus of what needs to be done then so be it.

09/07/07 07:55:45 changed by julik

Well that propbably means that you must hit the :joins for your special case (huge data corpus + offsets) and that the standard way of doing that through includes does not cut it for you :-)

09/07/07 08:34:34 changed by dasil003

I'm not sure I understand your comment, but if you're suggesting that I can work around the limitations of :include w/ :limit by doing manual :joins, that is not the case, because manual :joins don't give you back the joined models (just read-only attributes).

My thesis is simple:

The current implementation to support conditions on eagerly-joined table columns only provides a small syntactical benefit over manual :joins while it creates an intractable performance problem by joining tables that aren't even used most of the time.

The reason I propose removing it is because I can't think of a scenario where such a condition could not trivially be supported with a manual join (can you provide an example?). However if this feature is indispensable to people than I propose that it should be finer-grained. Indiscriminately joining all the :includes in the mere presence of another table is a performance graveyard.

09/07/07 09:16:18 changed by dasil003

Okay, so I whipped up a quick patch here that builds a list of referenced tables and checks that before adding each individual include. All tests pass (for mysql).

Let me know your feedback on this one. If it's the preferred approach I will open a new ticket for it and we can close this one. I'm not convinced this is necessarily better than the first solution, but it's food for thought.

09/13/07 01:38:37 changed by dasil003

I've written up a lengthy explanation of my reasoning. If you are interested in this issue please read it and leave your comments on the mailing list:

http://darwinweb.net/article/Optimizing_And_Simplifying_Limited_Eager_Loading_In_Activerecord

09/14/07 23:36:37 changed by dasil003

  • attachment only_include_referenced_tables.diff added.

(patch moved to ticket 9560)

09/14/07 23:38:25 changed by dasil003

Per Pratik's suggestion I have moved the second (non-destructive) patch to a new ticket http://dev.rubyonrails.org/ticket/9560

I continue to advocate for the original patch, but if the community ultimately rejects it then this ticket should be closed and the patch applied from 9560.

09/25/07 23:40:50 changed by dasil003

  • status changed from new to closed.
  • resolution set to wontfix.

Since I'm not getting any vocal support for this I'm closing it and moving onto the other patch. I figure this idea could be revisited in a future refactoring of eager-loading anyway. In the meantime, I'm going to attempt to get the other patch committed.