Ruby on Rails | Screencasts | Download | Documentation | Weblog | Community | Source

Ticket #3979 (closed enhancement: duplicate)

Opened 3 years ago

Last modified 3 years ago

[PATCH] lazy connection verification

Reported by: skaes Assigned to: David
Priority: high Milestone: 1.1
Component: ActiveRecord Version: 1.0.0
Severity: normal Keywords:
Cc: schoenm@earthlink.net, tom@popdog.net

Description

Shortly before stable 1.0 was released, a patch was applied to trunk to avoid connection timeout errors. Unfortunately, the implementation had severe performance implications. This patch resurrects performance charteristics comparable to the point before the patch went in.

Instead of clearing the connection cache after each request, we leave it as is. We add a @last_verification variable to the connection and a @@connection_cache_timeout variable to class ActiveRecord::Base. Before a request is processed, verify_connection_cache! gets called on AR::Base. connection.active? is called only when @@connection_cache_timeout seconds have passed since the last verification. This greatly reduces the number of DB calls.

Secondly, our patch reduces the number of entries in the connection cache to the number of defined connections (instead of having one cache entry for each AR class used during a certain number of requests).

Performance impact:

perf data file 1: 02-27.all.trunk
  requests=1000, options=-bm=all -mysql_session -fast_routes -fast_readers -lib=trunk

perf data file 2: 02-27.all.trunk2
  requests=1000, options=-bm=all -mysql_session -fast_routes -fast_readers -lib=trunk2

page   c1 real   c2 real  c1 r/s  c2 r/s  c1 ms/r  c2 ms/r  c1/c2
 1:    1.42228   1.33638   703.1   748.3     1.42     1.34   1.06
 2:    1.67259   1.58068   597.9   632.6     1.67     1.58   1.06
 3:    1.75018   1.65918   571.4   602.7     1.75     1.66   1.05
 4:    1.73553   1.64860   576.2   606.6     1.74     1.65   1.05
 5:    4.15005   3.89853   241.0   256.5     4.15     3.90   1.06
 6:    5.03691   4.83803   198.5   206.7     5.04     4.84   1.04
 7:    5.14496   4.94910   194.4   202.1     5.14     4.95   1.04
 8:    4.99811   4.86160   200.1   205.7     5.00     4.86   1.03

urls:
 1: /empty/index
 2: /welcome/index
 3: /rezept/index
 4: /rezept/myknzlpzl
 5: /rezept/show/713
 6: /rezept/cat/Hauptspeise
 7: /rezept/cat/Hauptspeise?page=5
 8: /rezept/letter/G

Attachments

lazy_connection_verification.latest.patch (8.5 kB) - added by skaes on 02/27/06 12:22:01.
connection_cache_bug.patch (3.5 kB) - added by skaes on 03/01/06 09:56:21.
bug fix for setting allow_concurrency
connection_cache_bug.2.patch (3.3 kB) - added by skaes on 03/01/06 11:44:57.
fix the fix

Change History

02/27/06 12:22:01 changed by skaes

  • attachment lazy_connection_verification.latest.patch added.

02/27/06 19:03:06 changed by bitsweat

Conflicts with #3591.

Timeouts allow just the sort of breakage that connection verification is meant to catch. Custom cache keys are an obscure and unnecessary. I'd rather not introduce these features.

What about connections needs representing:

  • set of connection specs
  • set of active connections (each with a needs_verification? flag)

Since they're one-to-one, we may use a single

  • map of connection spec => active connection

And to bind connections to classes:

  • map of class name => connection spec. Missing classes follow superclass until spec found.

This is a sparse map, so we need to cache missing lookups. This cache needs to be cleared after every request in dev env but never in production env.

  • cache: map of class name => final connection spec

Finally, we cache the active connection for each class. This cache is cleared after every request in both development and production.

  • cache: map class name => active connection

This final cache is where connections are naturally verified. Pseudocode:

# Cached connection for class.
unless conn = @cached_connection_for_class[class_name]

  # Cached spec for class.
  spec = @cached_connection_spec_for_class[class_name] ||= lookup_superclass_spec(self)

  # Activate, verify, and cache connection for spec.
  conn = @active_connections[spec] ||= establish_connection(spec)
  conn.verify! if conn.needs_verification?

  # Cache connection for class.
  @cached_connection_for_class[class_name] = conn
end

02/27/06 20:02:34 changed by skaes

"Conflicts with #3591".

Not true. I wonder why you think it would conflict? If you don't believe me, apply the patch and run the script provided in #3591.

"Timeouts allow just the sort of breakage that connection verification is meant to catch".

I don't understand. The code that went into stable, was meant to catch dropped connections for Mysql. The above patch addresses exactly this problem. As long as connection_cache_timeout is smaller the the value specified for the database, you will not see a dropped connection. If you are really paranoid, set connection_cache_timeout to 0. You will then get the old behaviour, only faster.

"Custom cache keys are obscure and unnecessary".

There are no custom cache keys. connection_cache_key and clear_connection_cache_key can be made private. Noone needs to modfiy these functions.

About obscurity: connection_cache_key returns the name of the class on the inheritance chain where establish_connection was called. This value gets stored on the AR class in question upon first access to the connection function and gets cleared when establish_connection or connection= are called on the class or any of its superclasses. What exactly is obscure about that?

"Finally, we cache the active connection for each class. This cache is cleared after every request in both development and production."

It is exactly this clearing of the cache which causes the serious performance drop over release 0.14.1. If you'd implement your proposal you will end up with the exact same performance as stable or current trunk.

I do agree though, that the current scheme (as in trunk/stable) is a bit obscure. Additionally, I think connection management should be separated from AR::Base. I vote for a ConnectionManager class with a defined interface so that custom connection managers can be implemented easily.

However, I think that 1 week before a planned release is not a good time to add such a new interface. I suggest to postpone this until after 1.1 has been released.

OTH, my patch is only a minor change and if we provide a connection_timout value of 0 per default, then we'd have exactly the old behaviour and I can blog about it too ;-)

02/27/06 20:12:22 changed by skaes

I forgot to mention: if you take down the database, and start it again, the app will recover after one request.

02/27/06 20:33:38 changed by bitsweat

  1. Both patches affect the same set of methods, hence the potential conflict. If not, great.
  2. Using a timeout to know whether to verify a new connection means you'll get dead connections until the timeout expires. Correct?
  3. Ok, I see that it's equivalent to the cache of class_name => connection_spec, but I think it's harder to understand
  4. Incorrect. The class_name => connection cache is cleared now which means N verifies to refill it. Instead we clear the spec => connection cache which means 1 verify per connection.
  5. Refactoring to a connection manager would be great, though I agree it is too late for 1.1

02/27/06 21:10:44 changed by bitsweat

Stefan & I talked about the patch and agreed that it's worth applying now and cleaning up the confusion that is connection management with a later refactoring.

02/27/06 21:37:35 changed by bitsweat

  • status changed from new to closed.
  • resolution set to fixed.

(In [3693]) Speed up class -> connection caching and stale connection verification. Closes #3979.

02/27/06 22:01:13 changed by bitsweat

References #428.

02/28/06 00:14:26 changed by bitsweat

  • status changed from closed to reopened.
  • resolution deleted.

02/28/06 07:47:10 changed by mschoen

  • cc set to schoenm@earthlink.net.

So edge is broken right now, right?

Also, don't quite understand Stefan's comment that "if you take down the database, and start it again, the app will recover after one request." If I've got 40 Rails processes running, it'll take at least 40 requests to fully recover, no?

I'm supportive of the patch, though I do think that a timeout of 0 makes sense as the default, as it provides for more immediate app recovery.

02/28/06 08:38:57 changed by skaes

"So edge is broken right now, right?"

I don't think so, because I can't reproduce the problem with trunk, neither using webrick nor apache.

"If I've got 40 Rails processes running, it'll take at least 40 requests to fully recover, no?"

Yes. But if you take down the DB manually, or the DB crashes, then you have other problems than 40+ requests failing :-) While DB is down, all your requests will fail. 40 more won't even be noticed by the end user.

02/28/06 15:03:05 changed by mschoen

I'm running trunk on Debian/Lighttpd/SCGI/Oracle. Check the mailing list, I think Tom Ward reports the same issue using SQL Server. I spent a few minutes looking, couldn't figure it out, I'll take another look again later today.

On the app recovery, yes, I see your point. The problem is that this makes app recovery less predictable. The 40 Rails processes aren't balanced evenly, so after a db reboot we've got little ticking time bombs -- well after the db recovers users may experience broken requests. Sure, 40 is a small # on the scheme of things, but still...

02/28/06 15:21:13 changed by tomafro

  • cc changed from schoenm@earthlink.net to schoenm@earthlink.net, tom@popdog.net.

Here's an email I just sent to the list, preserving here for posterity or something . . .

I've narrowed it down to the allow_concurrency flag.

When starting webrick in development mode, on the first call to ActiveRecord::Base.connection_cache, allow_concurrency is true, whilst for subsequent calls it is false. The first call (with allow_concurrency = true) creates a hash {<thread_id> => {}} but just returns an empty hash. Subsequent calls (with allow_concurrency = false) returns the entire hash (including the thread_id bit) which verify_connections! can't cope with.

I'm not sure why the flag is true when webrick boots, so that needs to be investigated.

A second line of attack is the allows_concurrency flag itself. I can't think when you'd ever want it to be false. I can't see the harm in having concurrency support, even in single-threaded applications. All turning it off allows you, is to share the same connection across multiple threads, but as provided, this wouldn't be thread-safe in any case. Assuming allow_concurrency is always true fixes this issue, but there may be applications I haven't thought of that will be adversely affected. Thoughts?

02/28/06 16:11:00 changed by skaes

The problem is caused by the line

ActiveRecord::Base.threaded_connections = false

in webrick_server.rb

It shouldn' be there, IMO. For a quick fix, comment out this line. Then your app should work again.

The flag is used for performance optimizations inside AR. It currently can't be changed after establish_connection has been called the first time.

02/28/06 16:20:16 changed by skaes

Could you please verify that the quick fix works for you?

02/28/06 17:12:02 changed by anonymous

The reason allow_concurrency flips from true to false is that the initializer is run before webrick_server.rb is required. This, combined with the fact establish_connection now accesses connection_cache cause the inconsistent cache format. webrick_server.rb is too late to turn off concurrency. The quick solution suggested by Stefan works for me, but the problem seems a little more involved. It seems more correct to remove the option for concurrency in ActiveRecord, and force it to always be on. Maybe this should be done post 1.1 though . .

02/28/06 17:50:56 changed by skaes

Well, I think webrick should not set AR::allow_concurreny. It is neither necessary nor is it correct. The only valid place for setting allow_concurrency is environment.rb.

And no, I don't want to have that flag removed. It does have performance impacts. AR::connection is called very often during a single request.

02/28/06 20:58:02 changed by mschoen

So what still confuses me is that I AM setting this within environment.rb, because Zed noted that w/o it there were some odd issues running under SCGI. So if I'm using the flag properly, why am I still running into this problem?

The line in my environment.rb is:

  # Temporary patch to prevent SCGI db connection leak
  config.active_record.allow_concurrency = false

I've verified that if I take this line out, it works again. But I don't understand the initial need for the line, and if taking it out will leave me with another issue.

03/01/06 09:56:21 changed by skaes

  • attachment connection_cache_bug.patch added.

bug fix for setting allow_concurrency

03/01/06 10:01:50 changed by skaes

Could the people affcted by the bug try the bug fix patch?

BTW, I've measured performance of threaded and unthreaded against the latest patch. There's a performance difference of up to 3% for my app.

perf data file 1: 03-01.all.threaded_connection_cache
  requests=1000, options=-bm=all -mysql_session -fast_routes -fast_readers -OT -lib=trunkc -threads

perf data file 2: 03-01.all.single_threaded_connection_cache
  requests=1000, options=-bm=all -mysql_session -fast_routes -fast_readers -OT -lib=trunkc

page   c1 real   c2 real  c1 r/s  c2 r/s  c1 ms/r  c2 ms/r  c1/c2
 1:    1.33809   1.32052   747.3   757.3     1.34     1.32   1.01
 2:    1.61925   1.59057   617.6   628.7     1.62     1.59   1.02
 3:    1.67651   1.65851   596.5   603.0     1.68     1.66   1.01
 4:    1.66721   1.65083   599.8   605.8     1.67     1.65   1.01
 5:    3.86071   3.75665   259.0   266.2     3.86     3.76   1.03
 6:    4.66126   4.60252   214.5   217.3     4.66     4.60   1.01
 7:    4.76640   4.70728   209.8   212.4     4.77     4.71   1.01
 8:    4.63965   4.59511   215.5   217.6     4.64     4.60   1.01

urls:
 1: /empty/index
 2: /welcome/index
 3: /rezept/index
 4: /rezept/myknzlpzl
 5: /rezept/show/713
 6: /rezept/cat/Hauptspeise
 7: /rezept/cat/Hauptspeise?page=5
 8: /rezept/letter/G

I definitely don't want the flag to go away.

03/01/06 11:44:57 changed by skaes

  • attachment connection_cache_bug.2.patch added.

fix the fix

03/01/06 20:50:33 changed by mschoen

So I've heard from Zed why this flag is needed (or at least the value must be effectively false if the flag were removed) when running under SCGI.

SCGI runs multiple threads to handle everything OTHER than actually running the Rails dispatcher (since Rails isn't thread-safe). W/ this value set to true, each of those threads effectively gets it's own db connection, using many more db connections than make sense, given that each SCGI process will only ever use 1 connection at a time.

W/ Stefan's patch applied, it seems to work fine. So I'm a fan of committing this patch.

03/02/06 10:28:51 changed by skaes

Superseded b #4044. New patch at #4044.

03/03/06 17:48:44 changed by anonymous

  • status changed from reopened to closed.
  • resolution set to duplicate.

07/24/06 03:13:33 changed by anonymous

blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2 blog2