Ruby on Rails | Screencasts | Download | Documentation | Weblog | Community | Source

Ticket #10061 (closed defect: fixed)

Opened 7 months ago

Last modified 6 months ago

[PATCH] Address shortcomings of changeset [8054]

Reported by: protocool Assigned to: core
Priority: normal Milestone: 2.0
Component: ActiveRecord Version: edge
Severity: normal Keywords:
Cc: hasmanyjosh, tarmo

Description

This references changeset [8054] and tickets #9923 and #10012.

First of all, I want to say that I *love* having :joins => :association as a shorthand for an inner join. It's a great idea.

Unfortunately the patch that was applied in [8054] treats :joins => :associations as a specialization of find_with_associations rather than as a specialization of add_joins!.

As a result, :include is now somewhat broken and, well, so is :joins.

1 - conflicts with :include and :joins

:include has always meant an outer join. However, now if you specify a non-string :joins option, your :includes no longer eager load and are no longer outer joins. As in:

Author.find(:all, :include => :should_be_outer, :joins => :should_be_inner)

is functionally equivalent to:

Author.find(:all, :joins => [:should_be_inner, :should_be_outer])

2 - non-string :joins are extremely inefficient

If you examine the sql generated by the following:

Author.find(:all, :joins => :contributions)

You can see that all columns from authors and all columns from contributions are being selected, even though none of the contributions columns will ever be used by an instantiated record.

On top of that, it's impossible to supply a :select argument (because it's a specialization of find_with_associations) which pretty much guarantees that your resulting dataset will have more rows than you need.

A different approach

Rather than treat :joins => :association as a specialization of find_with_associations, it should be a specializtion of add_joins! - that way the semantics of both :include and :joins is left intact.

Specifying Author.find(:all, :joins => :contributions) would behave exactly as if you had specified Author.find(:all, :joins => 'inner join contributions on contributions.author_id = authors.id').

When examining my patch please consider that:

  • it undoes many of the changes from [8054] so it looks really chatty and complex. If you apply the patch and then do "svn diff -r 8053 activerecord/lib/active_record/base.rb activerecord/lib/active_record/associations.rb activerecord/lib/active_record/calculations.rb " you'll see that it's actually a very straight-forward change.
  • I have not updated the ar_joins_test.rb file because many of the tests in there operate on the assumption that :join => :something should return the same number of rows that :include => :something would have done. Seeing that I fundamentally disagree with that feature I've left those tests unchanged, pending feedback from others.

Attachments

correct_ar_joins.diff (16.1 kB) - added by protocool on 11/02/07 10:37:28.
correct_joins_with_tests.diff (27.4 kB) - added by protocool on 11/07/07 03:55:06.
patch with reworked tests

Change History

11/02/07 10:37:28 changed by protocool

  • attachment correct_ar_joins.diff added.

11/02/07 12:19:45 changed by RubyRedRick

-1

Although this suggestion is well-intentioned, it misses the point of #10012

Let me take this point by point

1 - conflicts with :include and :joins

IMHO this is a failure of documentation rather than a real issue. I was already planning to provide a patch to document that :joins with a non-string value was incompatible with :include in the same call, since Josh Susser pointed out that this should be done.

Further, I'm not sure that :joins and :includes have ever been compatible even without [8054]. I can't seem to find ANY tests in the whole ActiveRecord test suite which invokes find with both :include and :joins

No real-world use-case is provided with this patch, nor is a test case to verify the claim of compatibility of :include with :joins. On the other hand #10012 was developed by test-driven development.

I also don't believe that it should be a goal of AR to be able to generate arbitrary SQL without recourse to programmer coded sql fragments. The motivation for this proposal seems to be a desire to combine inner with outer joins. I'd say the right way to do that lies with other options.

2 - non-string :joins are extremely inefficient

This patch claims that

Author.find(:all, :joins => :contributions)

is inefficient because "all columns from authors and all columns from contributions are being selected, even though none of the contributions columns will ever be used by an instantiated record."

The purpose for selecting them is to support certain use-cases involving selection. Josh Susser and I have been looking at streamlining the select portion of this kind of query, to address this specific concern, and I hope that we will find time to work on this together while we are at Rubyconf this weekend.

"On top of that, it's impossible to supply a :select argument (because it's a specialization of find_with_associations) which pretty much guarantees that your resulting dataset will have more rows than you need."

Don't use :select, use :conditions. When you mix sql fragment options with options like :include and :joins, which generate sql based on what's been GENERATED elsewhere, you wade out into untested territory. #10012 was developed specifically to let :joins and :conditions work well together, like :include and :conditions do.ΒΆ

Tests -

Running the tests and seeing how this breaks ar_joins_test.rb is instructive:

1 - This patch breaks the use of the new :joins syntax with AR::Base.count

2 - With this patch, using the new :joins syntax results in read only results.

3 - The testing that counts of the returned results are critical to the purpose of #10012, a major (probably the main) point of #10012 is to get a unique set of results. The reason the count tests fail is because duplicate ar objects can be included in the result, the path through join_with_associations allows avoiding this.

4 - The failure introduced in test_nested_scoped_find_ar_join(ArJoinsTest) is interesting, it seems to be finding a Developer with the name of "Active Controller" which, based on the fixture data would seem to be a Project rather than a Developer!

So to summarize this patch seems to throw the baby #10012 out with the bathwater.

While there are probably a few rough edges to #10012, I think that it's better to address those rather than backing it out.

11/02/07 14:35:31 changed by hasmanyjosh

  • cc set to hasmanyjosh.

11/02/07 18:30:40 changed by protocool

Rick,

first of all, I just want to say that it's commendable how graciously you've responded to my alteration of your patch. I fully recognize that I'm taking a new feature you've added and criticized it in the context of existing behavior, which I hope isn't too frustrating.

Next, I need to point out that I know it breaks your tests. IMHO the reason for the breakages are that (with respect) you're using ActiveRecord wrong.

The :joins option has very specific meaning. You are telling ActiveRecord "Yes, I know what I'm doing and I understand the implications".

The :include option has a different, but again very specific meaning. You are telling ActiveRecord "I want these top-level objects, can you please join these associations and load the resulting object graph with duplicates removed". ActiveRecord needs you to relinquish some control so that it can correctly construct your object graph.

What you have done is taken a rather excellent feature - namely turning :joins => {:of => :something} into inner joins - and confounded it with the "object graph with duplicates removed" behavior.

Addressing your responses:

1 - conflicts with :include and :joins

It's not a documentation issue. Your change should not alter the meaning of :include depending on whether :joins is a String or not. Given the behavior you have implemented I see :include and :joins as incompatible in the same call - to the point where :include with a non-string :joins should raise an exception.

Note - :include and :joins *are* compatible. Just because "you don't see tests" doesn't mean the behavior of mixing :include and :joins (prior to your patch) isn't predictable, desireable and expected.

In fact, mixing :joins and :include may be exactly what you have been looking for all along, you just didn't know it.

Please remember - specifying :include tells ActiveRecord to "load the object graph, removing duplicates". Check-out a stock version of rails that doesn't have either your patch or my patch (1.2.5 for example) and try the following sequence:

Author.find(:all, 
  :joins => 'inner join contributions on contributions.author_id = authors.id',
  :conditions => ['contributions.date < ?', 1.week.ago.to_s(:db)]
)

That will, as is expected, return multiple, read-only authors with additional 'contributions' attributes. There will be duplicate authors. This is expected and correct behavior.

But if you tell ActiveRecord an explicit :select to use, that's a cue that again you "know what you're doing".

Author.find(:all, :select => '*', 
  :joins => 'inner join contributions on contributions.author_id = authors.id',
  :conditions => ['contributions.date < ?', 1.week.ago.to_s(:db)]
)

The :select option above still results in multiple (duplicate) author objects, with additional 'contributions' attributes. However this time, ActiveRecord figures you know what you're doing, so it doesn't set :readonly to true.

Author.find(:all, :select => 'authors.*', 
  :joins => 'inner join contributions on contributions.author_id = authors.id',
  :conditions => ['contributions.date < ?', 1.week.ago.to_s(:db)]
)

In the above example you only get columns from the authors table (no additional contributions attributes) but there are still duplicates.

You're using :joins and you know what you're doing so this is what you do if you want to remove the duplicates:

Author.find(:all, :select => 'distinct authors.*', 
  :joins => 'inner join contributions on contributions.author_id = authors.id',
  :conditions => ['contributions.date < ?', 1.week.ago.to_s(:db)]
)

Grab the resulting SQL for the above query and run it through cocoamysql or similar and take note of the number of rows returned by mysql. Write that number down before moving on to the next example - it's part of the basis of my 'wasteful' assertion.

So, the example above is how you get back a non-duplicate "list of authors with contributions older than 1 week old'. What if you want that list to :include all 'payments' made to the author?

First of all, you have to remember that by using :include you are relinquishing some control so that ActiveRecord can "load the resulting object graph with duplicates removed". As such, your :select is ignored.

Author.find(:all, :include => :payments,
  :joins => 'inner join contributions on contributions.author_id = authors.id',
  :conditions => ['contributions.date < ?', 1.week.ago.to_s(:db)]
)

You should note that the list of authors returned should be the same as the list of authors returned in the previous example (where you wrote down the # of rows returned by mysql). However, this time, the :payments association is already populated with data.

This is how mixing :include and :joins is intended to work. I can't stress this enough: just because you weren't aware of it and haven't found test that prove it, doesn't change the fact that the behavior is expressed in the code.

Now, consider this example - we don't care about :payments any more, we're back to just wanting authors with a contribution that has a date more than a week old:

Author.find(:all, :include => [],
  :joins => 'inner join contributions on contributions.author_id = authors.id',
  :conditions => ['contributions.date < ?', 1.week.ago.to_s(:db)]
)

There are two very important things to note here.

  • Using :include with anything non-nil triggers the "load the object graph" behavior - even if you pass an empty collection.
  • The number of rows returned by mysql is dependent on how many contributions match your conditions. It is almost guaranteed to be more rows than the number you wrote down earlier - try it out in cocoamysql for yourself.

So, to re-visit my 'wasteful' assertion:

Your patch is a lot like my :include => [] trick. It triggers "load the object graph" behavior.

The problem is my :include => [] trick is wasteful in that it selects more rows than it has to (as compared to :select => 'distinct authors.*') - and those rows must be transmitted over the wire and subsequently processed by ActiveRecord.

But even that isn't as bad as your patch. At least my :include => [] trick only selects columns from the authors table. In your patch, it selects all columns from your :joins, even though it's going to discard all of that column data.

1a - test-driven-development and your broken tests etc

I know this breaks your tests. The patch was clearly marked as soliciting feedback about whether my 'worldview' that :include means 'include' and :joins means 'join' was consistent with the majority worldview. I'm a busy guy, I don't have time to waste on something that nobody else wants.

I know this may be interpreted as rude but there's no other way to say it: just because you wrote tests (maybe even test-first) does not change my belief that your feature is incorrect and inconsistent. In my opinion your tests are wrong in that they expect behaviour that should not be present.

2 - inefficiency of your patch

I've already addressed part of the inefficiency issues with your patch by comparing the number of rows returned by :select => 'distinct authors.*' vs include => [] when combined with :joins.

However, that's not even the worst of it. Ignoring the fact that your patch does not allow :select, it is completely pointless to select the column data from the contributions table in the example below:

Author.find(:all, :joins => :contributions)

Take a look at the resulting sql for that using your patch. Why bother selecting the contributions columns?

With my patch, these two statements return identical results and are consistent with existing behavior wrt :joins.

# implicit inner joins based on association metadata
Author.find(:all, :select => 'distinct authors.*', :joins => :contributions)

# explicit inner join
Author.find(:all, :select => 'distinct authors.*', 
  :joins => 'inner join contributions on contributions.author_id = authors.id'
)

Again, my patch is consistent with existing semantics of :joins while still giving you a really cool shortcut of specifying association names for inner joins - like I said, it's a great idea.

But suggesting that I should not use :select with :joins only serves to highlight the fact that you don't understand the point of :joins.

Using my patch, if specifying :select scares you, you can use the :include => [] trick, but it will be wasteful becuse mysql will return duplicated rows that must be discarded by ActiveRecord:

# returns fewer authors than rows returned by mysql
Author.find(:all, :include => [], :joins => :contributions)

# as a comparison, the number of authors returned
# below (with my patch) will be the numer of rows
# returned by mysql
Author.find(:all, :joins => :contributions)

I hope this clarifies things for you, and I *really* hope that I haven't offended you.

11/04/07 18:52:39 changed by lifofifo

+1

11/04/07 19:35:22 changed by RubyRedRick

Okay, I've recovered enough from sleep deprivation at RubyConf, and I get it.

So I'll take back my -1.

I would like to see some tests added, but...

11/06/07 07:21:17 changed by protocool

Okay, when I get a spare hour tomorrow I'll update the patch with tests.

11/06/07 16:25:41 changed by david

  • milestone changed from 2.x to 2.0.

11/07/07 03:55:06 changed by protocool

  • attachment correct_joins_with_tests.diff added.

patch with reworked tests

11/07/07 14:29:55 changed by tarmo

  • cc changed from hasmanyjosh to hasmanyjosh, tarmo.

+1

(btw. the patch adds trailing whitespace on line 1731 of activerecord/lib/active_record/associations.rb)

11/07/07 15:07:44 changed by david

  • status changed from new to closed.
  • resolution set to fixed.