Ruby on Rails | Screencasts | Download | Documentation | Weblog | Community | Source

Ticket #7400 (reopened defect)

Opened 1 year ago

Last modified 7 months ago

[PATCH] Make the highlight helper multibyte safe

Reported by: manfred Assigned to: core
Priority: normal Milestone: 2.x
Component: ActionPack Version: edge
Severity: normal Keywords: multibytebugs
Cc: me@julik.nl

Description

Julian Tarkhanov found out that the highlight helper isn't multibyte safe because the regular expression engine doesn't know how to do /i on multibyte characters.

Here is a patch to remedy that.

Attachments

make_highlight_helper_multibyte_safe.diff (2.0 kB) - added by manfred on 01/26/07 17:39:15.
make_highlight_helper_multibyte_safe.2.diff (1.9 kB) - added by manfred on 01/26/07 18:49:51.
make_highlight_helper_multibyte_safe.3.diff (2.2 kB) - added by manfred on 05/29/07 07:07:09.
make_highlight_helper_multibyte_safe.4.diff (2.2 kB) - added by manfred on 06/19/07 20:35:14.
Updated for revision 7064.
make_highlight_helper_multibyte_safe.5.patch (2.3 kB) - added by norbert on 07/15/07 02:24:35.
Tested against 7187, generated from trunk

Change History

01/26/07 17:39:15 changed by manfred

  • attachment make_highlight_helper_multibyte_safe.diff added.

01/26/07 18:49:51 changed by manfred

  • attachment make_highlight_helper_multibyte_safe.2.diff added.

05/28/07 23:06:56 changed by bitsweat

  • status changed from new to closed.
  • resolution set to incomplete.

The highlight helper takes multiple terms now, so this doesn't apply cleanly.

It also looks like it removes case-insensitivity for non-utf8 strings.

05/29/07 07:07:09 changed by manfred

  • attachment make_highlight_helper_multibyte_safe.3.diff added.

05/29/07 07:08:41 changed by manfred

  • status changed from closed to reopened.
  • resolution deleted.

Here's a new patch that should apply cleanly to trunk. I've also added a test to show that case-insensitivity for non-utf8 strings still works.

06/19/07 20:35:14 changed by manfred

  • attachment make_highlight_helper_multibyte_safe.4.diff added.

Updated for revision 7064.

07/15/07 02:24:35 changed by norbert

  • attachment make_highlight_helper_multibyte_safe.5.patch added.

Tested against 7187, generated from trunk

07/15/07 03:08:57 changed by norbert

Oh yeah, almost forgot, +1. Seems like a really good idea.

07/19/07 12:02:45 changed by alloy

Definitely a must!

+1

07/19/07 12:27:19 changed by manfred

Norbert just told me he had a disccusion with Koz about the patch. Koz raised some concerns about performance with doing u.chars.upcase and u.chars.downcase for every character.

Unfortunately I don't know a faster or better way to do it. The default regular expression engine in Ruby doesn't have a complete upcase / downcase table for every Unicode character. Note that the expensive part of the loop is only performed when the character consists of more than one byte.

Here's a short benchmark to show how slow it is. The benchmarks calculate each operation a 100 times, so the time for a single call would be a 100 times faster.

      user     system      total        real
Old - 20 ASCII characters     :  0.010000   0.000000   0.010000 (  0.000656)
Old - 20 Multibyte characters :  0.000000   0.000000   0.000000 (  0.000633)
Old - 40 ASCII characters     :  0.000000   0.000000   0.000000 (  0.000684)
Old - 40 Multibyte characters :  0.000000   0.000000   0.000000 (  0.000764)
Old - 100 ASCII characters    :  0.000000   0.000000   0.000000 (  0.000820)
Old - 100 Multibyte characters:  0.000000   0.000000   0.000000 (  0.000955)
New - 20 ASCII characters     :  0.010000   0.000000   0.010000 (  0.005869)
New - 20 Multibyte characters :  0.680000   0.030000   0.710000 (  0.731969)
New - 40 ASCII characters     :  0.010000   0.000000   0.010000 (  0.011309)
New - 40 Multibyte characters :  1.100000   0.050000   1.150000 (  1.159216)
New - 100 ASCII characters    :  0.020000   0.000000   0.020000 (  0.026393)
New - 100 Multibyte characters:  2.650000   0.110000   2.760000 (  2.792602)

09/22/07 15:50:53 changed by julik

  • cc set to me@julik.nl.

Xcuzme, but why do you need to casefold per char in this case? Additionally - _yes_ unicode casefolding is slow. Live with it.

10/15/07 03:56:07 changed by bitsweat

Could we use separate codepaths depending on whether the highlight string is multibyte?