Ticket #3315 (closed defect: fixed)

SVN Diffs for #3315

 

Opened 5 years ago

Last modified 1 month ago

StringSearch and whole word matches

Reported by: swquek(at)us.ibm.com Assigned to: eric
Priority: major Milestone: 4.0
Component: collation Version: 2.8
Keywords: collation Cc:
Load: Xref: 5420
Java Version: Operating System: all
Project (C/J): ICU4C,ICU4J and ICU4JNI Weeks: .2
Review: andy

Description

To match or not to match characters in composite characters. See email excerpt below.


Summary of the issue that you are concerned with: Users might be confused when searching for a character pattern, a match could be found in a composite character within a text string. For instance, StringSearch finding a match for the character "\u004e" n the text string "\u00d1", since (NFD(\u00d1) == \u004e\u0303).

Question: If a user is confused by finding a match for the pattern "\u004e" in "\u00d1", would he expect a match for the same pattern in the text "\u004e\u0303"?

When StringSearch was first implemented, it was decided to treat strings and patterns as a sequence of CEs and matches were performed against these sequences. Indexes of the CEs matches were then mapped back to the text strings and returned. Code were written then to handle pattern matching in composite characters and hence the behaviour that you see now. One example that was listed in the user guide shows this feature.

"For example, if the user searches for the pattern "ˋ" (\u02cb) in the string "ÀBC", (\u00c0BC) a match will be found at offsets <0, 1>."

I'll submit a bug/rfe for this and will discuss it with Mark in further detail when he comes back from vacation. However, since ICU is already very late into the 2.8 cycle, any decision we make will only be implemented after 2.8.

One workaround to your problem is to use the same Collator to generate the CEs of the substring that contains the match and compare its size with the CEs of the pattern. If the CEs have different sizes then the match isn't the exact match that you wanted. For optimization purposes, you might want to keep the size of the pattern CEs to avoid generating it again and again.

Attachments

Change History

12/31/69 17:28:34 changed by notes2

Probably not doable during swat, since I'm unfamiliar with the code DB2!

12/31/69 17:28:35 changed by notes

Does break iterator option handle this problem? Assess for 3.8, re-estimate it if problem still exists.

12/31/69 17:28:36 changed by auditor

  • 10/21/03 01:49:03 weiv sent reply 1
  • 11/14/03 20:08:32 schererm changed notes2
  • 02/06/04 14:55:54 weiv changed notes2
  • Tue Sep 27 15:24:08 2005 weiv changed notes2: target: "3.0" to "3.6",
  • Fri Oct 13 17:50:01 2006 andy changed notes2: assign: "weiv" to "andy", target: "3.6" to "3.8 Candidate",
  • Fri Oct 13 17:50:01 2006 andy changed notes
  • Fri Oct 20 21:54:24 2006 andy changed notes2: priority: "high" to "assess", target: "3.8 Candidate" to "3.8", weeks: "2" to ".2",
  • Fri Oct 20 21:54:24 2006 andy changed notes

10/21/03 01:49:03 changed by Vladimir Weinstein <weiv(at)jtcsv.com>

added on Alexis's behalf:

It seems the current ICU search class (by calling the usearch_openFromCollator function) would respect some input collator's attributes such as locale (e.g. UCA400_LEN_RUS would treat "ch" as two independent letters, while UCA400_LES_VTRADITIONAL would treat the same "ch" as one character), but would ignore the collator's other attributes such as strength (e.g. behaviour of UCA400 == UCA400_S3 == UCA400_S1, and all 3 collators will find the pattern "N" in the string "Ñ"!).

As a minimum, we would require the ability to specify case and/or accent sensitive/insensitive searching, hence we need the ICU search class to at least respect the input collator's strength and case_level values.

We also discussed this issue briefly Monday afternoon (10/20/2003): - The main bug is in the fact that we are getting different search results when grepheme boundary detection is on for composed and decomposed forms. For the N-tilda case with grapheme boundary detection, there is no match on decomposed string but there is one for a composed string. The problem is in the disconnect between CEs and characters. Additional checks need to be made after a candidate is found.

08/30/07 11:32:22 changed by andy

  • load changed.
  • xref changed.
  • java changed.
  • revw changed.
  • milestone changed from 3.8 to 4.0.

09/28/07 13:42:15 changed by andy

  • priority changed from assess to major.

01/08/08 15:41:50 changed by eric

  • owner changed from andy to eric.

01/08/08 15:42:05 changed by eric

  • status changed from new to assigned.

05/28/08 11:28:55 changed by eric

  • xref set to 5420.
  • revw set to andy.

This is a fundamental problem with the Boyer-Moore implementation. The value returned by getMaxExpansion is not sufficient in all cases for computing skip distances. Fixes for this are checked into bug 5420.

10/22/08 15:02:44 changed by andy

  • status changed from assigned to closed.
  • resolution set to fixed.

Changes are under ticket 5420


Add/Change #3315 (StringSearch and whole word matches)




Anti spam check: