Ticket #5959 (closed defect: fixed)

SVN Diffs for #5959

 

Opened 1 year ago

Last modified 1 month ago

string search does not find roman numeral using search string of compatibility sequence

Reported by: dmso@... Assigned to: eric
Priority: major Milestone: 4.0
Component: collation Version: 3.8
Keywords: Cc:
Load: Xref: 5420
Java Version: Operating System:
Project (C/J): all Weeks: 3
Review: andy

Description

U+2166 (ROMAN NUMERAL SEVEN) is compatibility equivalent to the sequence U+0056 V, U+0049 I, U+0049 I. Therefore, for a Strength 2 or lower collator, I have verified that their sortkeys are the same for collator "LDE_AN_CX_EX_FX_HX_NX_S2". When U+2166 is used as the search string, usearch_first() correctly matches the sequence U+0056 V, U+0049 I, U+0049 I. However, if the sequence U+0056 V, U+0049 I, U+0049 I is used as the search string, usearch_first() does not match U+2166. This looks like it affects all the other Roman Numerals as well. Here is a standalone program that reproduces this.

#include <stdio.h> #include "unicode/ucol.h" #include "unicode/ubrk.h" #include "unicode/usearch.h"

int main() { UChar search[] = { 0x0056, 0x0049, 0x0049 }; UChar source[] = { 0x0020, 0x2166, 0x0020, }; int32_t searchLen; int32_t sourceLen; UErrorCode icuStatus = U_ZERO_ERROR; UCollator *coll; const char *locale; UBreakIterator *ubrk; UStringSearch *usearch; int32_t match = 0;

searchLen = sizeof(search)/sizeof(UChar); sourceLen = sizeof(source)/sizeof(UChar);

coll = ucol_openFromShortString( "LDE_AN_CX_EX_FX_HX_NX_S2", false, NULL, &icuStatus ); if ( U_FAILURE(icuStatus) ) { printf( "ucol_openFromShortString error\n" ); goto exit; }

locale = ucol_getLocaleByType( coll, ULOC_VALID_LOCALE, &icuStatus ); if ( U_FAILURE(icuStatus) ) { printf( "ucol_getLocaleByType error\n" ); goto exit; }

ubrk = ubrk_open( UBRK_CHARACTER, locale, source, sourceLen, &icuStatus ); if ( U_FAILURE(icuStatus) ) { printf( "ubrk_open error\n" ); goto exit; }

usearch = usearch_openFromCollator( search, searchLen, source, sourceLen, coll, NULL, &icuStatus ); if ( U_FAILURE(icuStatus) ) { printf( "usearch_openFromCollator error\n" ); goto exit; }

usearch_setAttribute( usearch, USEARCH_OVERLAP, USEARCH_ON, &icuStatus ); if ( U_FAILURE(icuStatus) ) { printf( "usearch_setAttribute error\n" ); goto exit; }

match = usearch_first( usearch, &icuStatus ); if ( U_FAILURE(icuStatus) ) { printf( "usearch_first error\n" ); goto exit; }

printf( "match=%d\n", match );

exit: return 0; }

Attachments

Change History

09/25/07 14:23:29 changed by dmso@...

George Rhoten wrote:

Your test case depends on NFKD behavior. I don't think the collator uses NFKD or NFKC.

The other test case that you gave earlier might depend on normalization being turned on, since the string wasn't in NFC or NFD. That might have been the problem with your earlier test case.

You might also want to try out some of the on-line demonstrations. http://demo.icu-project.org/icu-bin/icudemos You might want to look at the Normalization Browser, String Compare and Locale Explorer's collation subdemo. The Normalization browser shows that ? doesn't change under NFC or NFD, and the String Compare demo shows that ? is not equivalent to VII.

George Rhoten IBM Globalization Center of Competency/ICU San Jos?, CA, USA http://www.icu-project.org/

09/25/07 14:24:02 changed by dmso@...

Hi George,

We opened the collator with locale string "LDE_AN_CX_EX_FX_HX_NX_S2", so normalization was off. However, I changed it to "LDE_AN_CX_EX_FX_HX_NO_S2" to turn normalization but I can still reproduce the issue.

Regards,

Dominic So

09/25/07 14:25:02 changed by dmso@...

Hi George,

When U+2167 was used as the search string, it was able to match the source string sequence U+0056 U+0049 U+0049 U+0049, which implies that ICU converted the search string to NFKD or NFKC before scanning through the source string.

No match is produced when the source string sequence is U+0056 U+0049 U+0049 U+0049 and the source string is U+2167, which may mean that ICU does not do any normalization on the source string, only the search string is normalized.

I get the same results whether I use turn normalization on or off when the collator is opened ("NX or NO").

Can you confirm whether the usearch* APIs do any normalization on either the search (aka pattern) or source (aka text)?

For the other issue, ICU ticket #5950, the search string is U+00C2 U+0303, and the source string contains the exact same sequence. However, usearch_first() does not find that sequence. Using the ICU Normalization Browser, I got these results. This also implies that ICU may be normalizing the search string, but not the source string. Again, I got the same result with normalization turned on or off when opening the collator. Could you please check what normalization, if any, is done.

Normalization Results Mode Quick Check Normalized Text Input 00c2 0303 Ẫ NFD NO 0041 0302 0303 Ẫ NFC MAYBE 1eaa Ẫ NFKD NO 0041 0302 0303 Ẫ NFKC MAYBE 1eaa Ẫ FCD YES 00c2 0303 Ẫ

Thanks,

Dominic So

09/26/07 15:43:19 changed by markus

  • owner changed from somebody to andy.
  • weeks set to 3.
  • xref changed.
  • revw changed.
  • milestone changed from UNSCH to 4.0.

01/08/08 15:41:23 changed by eric

  • owner changed from andy to eric.

01/08/08 15:41:28 changed by eric

  • status changed from new to assigned.

05/28/08 11:30:59 changed by eric

  • xref set to 5420.
  • revw set to andy.

This is a fundamental problem with the Boyer-Moore implementation. The value returned by getMaxExpansion is not sufficient in all cases for computing skip distances. Fixes for this are checked into bug 5420.

10/22/08 15:05:03 changed by andy

  • status changed from assigned to closed.
  • resolution set to fixed.

Changes are under ticket 5420


Add/Change #5959 (string search does not find roman numeral using search string of compatibility sequence)




Anti spam check: