Ruby on Rails | Screencasts | Download | Documentation | Weblog | Community | Source

Ticket #10011 (closed defect: fixed)

Opened 8 months ago

Last modified 8 months ago

[PPATCH] Optimize preloading record instantiation

Reported by: lifofifo Assigned to: core
Priority: normal Milestone: 2.x
Component: ActiveRecord Version: edge
Severity: normal Keywords:
Cc:

Description

This patch removes very costly collection.target.include?(association) per record check to optimize AR preloading.

Attachments

optimize_preloading_init.patch (2.6 kB) - added by lifofifo on 10/28/07 19:20:59.

Change History

10/28/07 18:39:49 changed by lifofifo

The discussion for this patch is also at core mailing list

10/28/07 19:20:59 changed by lifofifo

  • attachment optimize_preloading_init.patch added.

(follow-up: ↓ 3 ) 10/28/07 19:24:30 changed by RubyRedRick

I've just been crawling through this code myself while working on http://dev.rubyonrails.org/ticket/10012

I'm a little concerned about the lack of tests provided with this patch.

The example quoted on ruby-core:

   Data : 1 Person and 1000 items belonging to that person.
   Performance Script : http://pastie.caboo.se/111774

Seems to be too simple to drive out any problems. How does this work in contexts such as scoped finds, habtm associations, has_many: through: etc.?

I don't have time to write specific tests to see if I can break this, but I'm suspicious.

(in reply to: ↑ 2 ) 10/28/07 20:16:24 changed by lifofifo

Replying to RubyRedRick:

I don't have time to write specific tests to see if I can break this, but I'm suspicious.

Well, this is a simple fix in eager loaded associations initialization code. In any case, it shouldn't slow down anything. If you can provide more specific details, I can check it out.

(follow-up: ↓ 5 ) 10/28/07 20:41:56 changed by RubyRedRick

The concern isn't necessarily slowing things down, but introducing subtle bugs which change the behavior.

(in reply to: ↑ 4 ) 10/28/07 20:47:07 changed by lifofifo

Replying to RubyRedRick:

The concern isn't necessarily slowing things down, but introducing subtle bugs which change the behavior.

All the tests are passing. And this is a change of behavior at very lower level. So I wouldn't tend to be worry about that, unless there is something very specific ( in which case it'd be good to have tests for that anyways )

10/28/07 22:45:59 changed by RSL

  • summary changed from [PPATCH] Optimize preloading record instantiation to [PATCH] Optimize preloading record instantiation.

I'm in yr ppatch, fixin' yr typoes.

10/28/07 23:00:54 changed by lifofifo

  • summary changed from [PATCH] Optimize preloading record instantiation to [PPATCH] Optimize preloading record instantiation.

ppppppppperfromance patch ;-)

10/29/07 02:58:48 changed by bitsweat

Wouldn't hurt to comment this logic ;)

10/29/07 03:02:46 changed by bitsweat

  • status changed from new to closed.
  • resolution set to fixed.

(In [8051]) Associations: speedup duplicate record check. Closes #10011.

10/29/07 03:06:05 changed by RubyRedRick

  • status changed from closed to reopened.
  • resolution deleted.

On the parallel discussion of this patch on rails-core, On 10/28/07, Mark Reginald James wrote:

> 
> Pratik wrote:
> 
> > I've uploaded a patch to optimize eager loading, which changes the
> > code to spend less time in ruby -
> > http://dev.rubyonrails.org/ticket/10011
> >
> > The problem is because of following line in
> > associations.rb:#construct_associations :
> > collection.target.push(association) unless
> > collection.target.include?(association)
> >
> > This basically compares every processed object with the new object, to
> > see if it hasn't been instansiated already.
> 
> How does this compare to Fred Cheung's patch?
> 
> http://groups.google.com/group/rubyonrails-core/msg/b4da8cedcd73eaae

I took a look at the referenced discussion and, although it's a bit hard to read Fred's patch after it's been mangled by mail and google groups, I think I can summarize the difference based on his description. I tried to find an actual patch submission of Fred's idea, but if it's in Rails Trac I missed it.

Both Fred's and the current path deal with a pathological case in which a large number of joined records are associated with a record in an active record find with the :include option.

Under certain data patterns, the code in JoinDependency#construct_association which looks for duplications of rows in the large result produced by the left outer join generated by the :includes option, to prevent creating and adding duplicate active record objects to an association. The existing code looks for duplicates using include? against the array representing the association.

Fred's patch supplements the association array with a hash to make the duplicate detection quicker, or at least that's what his description of the patch implies.

The current patch (Pratik's) goes ahead and creates the duplicate AR objects, adds them to the association(s) and then when all rows in the outer join has been processed goes back and deletes the duplicates.

Apart from performance differences, to my eye, the supplementary hash approach seems closer to the original implementation, simply making the test faster without changing the core logic. From an instinctual level, it would seem that making the test faster would perform better than adding without testing and removing duplicates later, although I've seen stranger things in my checkered career.

My concerns about the current patch is due to the complexity of the code to remove the duplicates. It seems that this introduces opportunities for new bugs, and I don't see the tests included which drove developing this new code. The fact that this doesn't break the current set of AR tests is somewhat comforting, but it's no proof of a hidden dragon in the new code.

A large part of this concern probably comes from just having done another patch which deals with the same section of code. I wrote quite a few tests (and fixed quite a few bugs along the way) to convince myself that I hadn't broken anything.

From a performance point of view, I don't know how often the particular case comes up which these two patches appear to optimize actually comes up, and if either makes other cases, possibly more frequent ones, either better or worse.

Eager loading is certainly a two-edged sword. The :include option can generate fairly large sql query results to be processed, particularly since it generates an outer join meaning that rows from the primary table otherwise meeting any where clause will be contained even if they have no associated rows in joined tables.

My motivation for my recent patch http://dev.rubyonrails.org/ticket/10012 came from another use case, where you don't really want to eager load at all but want to get a unique set of candidate ActiveRecord objects based on conditions requiring joining associated tables. The :include option handles both adding the joins, and eliminating duplicates from the result but since it eager loads, you can easily end up instantiating a very large portion of all of the rows in those tables. My patch doesn't change the behavior of eager loading with :include, but adds the ability to specify the :joins option in a similar manner (as an added alternative to the existing string value) to values given for :include. If you use the new :joins value syntax, it automatically generates the right INNER joins for use in the conditions option, detects and avoids duplicates in the result set, and doesn't eagerly instantiate associated objects.

10/29/07 03:10:56 changed by RubyRedRick

Just to be clear.

I don't see this patch as conflicting http://dev.rubyonrails.org/ticket/10012 they are separate and are probably (or can easily be made) compatible. I am concerned that 10011 needs further testing.

10/29/07 21:59:26 changed by lifofifo

  • status changed from reopened to closed.
  • resolution set to fixed.

Lets jump the bridge when we get there ( failing tests/patch/etc. )