Ticket #5105 (closed defect: fixed)

Bug contains 1 commit(s) | SVN Diffs for #5105

 

Opened 3 years ago

Last modified 1 month ago

Case Insensitive Compare is slower than it should be

Reported by: mark.davis(at)icu-project.org Assigned to: claireho
Priority: trivial Milestone: 4.0
Component: strings Version:
Keywords: strings Cc:
Load: Xref: 5072 5062 3395
Java Version: Operating System: all
Project (C/J): ICU4C,ICU4J and ICU4JNI Weeks: 0.1
Review: weiv

Description

In ICU4J, the case insensitive compare is slower than it should be.

In UTF16.StringComparator, it calls:

if (m_ignoreCase_) {

return compareCaseInsensitive(str1, str2);

}

which calls

return NormalizerImpl.cmpEquivFold(s1, s2,

m_foldCase_ |

m_codePointCompare_

|

Normalizer.COMPARE_IGNORE_CASE);

which calls

public static int cmpEquivFold(String s1, String s2,int options){

return cmpEquivFold(s1.toCharArray(),0,s1.length(),

s2.toCharArray(),0,s2.length(), options);

}

// note the often unnecessary extraction of a char array

which calls:

public static int cmpEquivFold(String s1, String s2,int options){

return cmpEquivFold(s1.toCharArray(),0,s1.length(),

s2.toCharArray(),0,s2.length(), options);

}

// internal function public static int cmpEquivFold(char[] s1, int s1Start,int s1Limit,

char[] s2, int s2Start,int s2Limit, int options) {

// current-level start/limit - s1/s2 as current int start1, start2, limit1, limit2; char[] cSource1, cSource2;

cSource1 = s1; cSource2 = s2; // decomposition variables int length;

// stacks of previous-level start/current/limit CmpEquivLevel[] stack1 = new CmpEquivLevel[]{

new CmpEquivLevel(), new CmpEquivLevel()

};

CmpEquivLevel[] stack2 = new CmpEquivLevel[]{

new CmpEquivLevel(), new CmpEquivLevel()

};

// decomposition buffers for Hangul char[] decomp1 = new char[8]; char[] decomp2 = new char[8];

// case folding buffers, only use current-level start/limit char[] fold1 = new char[32]; char[] fold2 = new char[32];

...

All of this is pretty expensive setup when it is most often not needed (eg comparing "mark" to "Mark" or to "fred").

then internally it calls foldCase, which turns stuff back into temporary strings, bunches of them:

private static int foldCase(int c, char[] dest, int destStart, int

destLimit,

int options){

String src = UTF16.valueOf(c); String foldedStr = UCharacter.foldCase(src,options); char[] foldedC = foldedStr.toCharArray();

...

Attachments

Change History

12/31/69 18:23:41 changed by auditor

  • Fri Mar 10 13:31:16 2006 grhoten changed notes2: xref: "" to "5072 5062 3395",
  • Thu Mar 16 09:36:49 2006 grhoten changed notes2: assign: "" to "mark", priority: "" to "small", weeks: "" to "0.1",
  • Thu Mar 16 09:36:49 2006 grhoten moved from incoming to strings

08/01/07 17:39:58 changed by mark

  • load changed.
  • status changed from new to assigned.
  • java changed.
  • revw changed.

09/26/07 16:12:33 changed by mark

  • milestone changed from UNSCH to 4.0.

03/21/08 10:24:17 changed by mark

  • owner changed from mark to claireho.
  • status changed from assigned to new.

06/17/08 11:37:13 changed by claireho

  • revw set to weiv.

1. Verified the ICU4C code, the UChar* is used in all APIs and UChar32 is used in case folding.
2. Add a new private Java function to perform the simple binary codepoint comparison, then case folding for String type. New function fallbacks to cmpEquivFold() call when first hits the surrogate character with different binary value between s1 and s2.

10/15/08 11:47:51 changed by weiv

  • status changed from new to closed.
  • resolution set to fixed.

Add/Change #5105 (Case Insensitive Compare is slower than it should be)




Anti spam check: