Ticket #5950 (closed defect: fixed)

Bug contains 4 commit(s) | SVN Diffs for #5950

 

Opened 1 year ago

Last modified 6 months ago

string search does not return correct position when search pattern is U+00C2 U+0303

Reported by: Dominic So Assigned to: michaelow
Priority: major Milestone: 4.0
Component: collation Version: 3.8
Keywords: Cc:
Load: Xref:
Java Version: Operating System:
Project (C/J): ICU4C Weeks: 1
Review: yoshito

Description (Last modified by srl)

I found a bug in string search code. We are using ICU 3.2.1 but I confirmed the bug exists in ICU 3.8. Here is a self-contained program that reproduces the problem. I expect match to be 1, but it get a value of 4. Also, this bug only occurs when I use a Strength 1 collator. If I used a Strength 2 or greater collator, I get the correct value of 1.

#include <stdio.h>
#include "unicode/ucol.h"
#include "unicode/ubrk.h"
#include "unicode/usearch.h"

int main()
{
   UChar search[] = { 0x00C2, 0x0303 };
   UChar source[] = { 0x0020,
                      0x00C2, 0x0303, 0x0020, 0x0041, 0x0061,
                      0x1EAA, 0x0041, 0x0302, 0x0303, 0x00C2,
                      0x0303, 0x1EAB, 0x0061, 0x0302, 0x0303,
                      0x00E2, 0x0303, 0xD806, 0xDC01, 0x0300,
                      0x0020, };
   int32_t searchLen;
   int32_t sourceLen;
   UErrorCode icuStatus = U_ZERO_ERROR;
   UCollator *coll;
   const char *locale;
   UBreakIterator *ubrk;
   UStringSearch *usearch;
   int32_t match = 0;

   searchLen = sizeof(search)/sizeof(UChar);
   sourceLen = sizeof(source)/sizeof(UChar);

   coll = ucol_openFromShortString( "LDE_AN_CX_EX_FX_HX_NX_S1",
                                    false,
                                    NULL,
                                    &icuStatus );
   if ( U_FAILURE(icuStatus) )
   {
      printf( "ucol_openFromShortString error\n" );
      goto exit;
   }

   locale = ucol_getLocaleByType( coll,
                                  ULOC_VALID_LOCALE,
                                  &icuStatus );
   if ( U_FAILURE(icuStatus) )
   {
      printf( "ucol_getLocaleByType error\n" );
      goto exit;
   }

   ubrk = ubrk_open( UBRK_CHARACTER,
                     locale,
                     source,
                     sourceLen,
                     &icuStatus );
   if ( U_FAILURE(icuStatus) )
   {
      printf( "ubrk_open error\n" );
      goto exit;
   }

   usearch = usearch_openFromCollator( search,
                                       searchLen,
                                       source,
                                       sourceLen,
                                       coll,
                                       NULL,
                                       &icuStatus );
   if ( U_FAILURE(icuStatus) )
   {
      printf( "usearch_openFromCollator error\n" );
      goto exit;
   }

   usearch_setAttribute( usearch,
                         USEARCH_OVERLAP,
                         USEARCH_ON,
                         &icuStatus );
   if ( U_FAILURE(icuStatus) )
   {
      printf( "usearch_setAttribute error\n" );
      goto exit;
   }

   match = usearch_first( usearch,
                          &icuStatus );
   if ( U_FAILURE(icuStatus) )
   {
      printf( "usearch_first error\n" );
      goto exit;
   }

   printf( "match=%d\n", match );

exit:
   return 0;
}

Attachments

Change History

09/19/07 13:40:29 changed by grhoten

  • weeks changed.
  • xref changed.
  • revw changed.
  • reporter changed from anonymous to Dominic So.

09/20/07 13:36:50 changed by srl

  • owner changed from somebody to srl.

09/20/07 13:59:10 changed by srl

  • description changed.

(follow-up: ↓ 5 ) 09/20/07 22:03:21 changed by srl

Some more information:

the break iterator and locale code above can be removed. the collator is equal to 'en_US' at L1.

The search string given at pri weight is equivalent to \u00C2, the \u0303 is ignored.

Oddly enough, with source "\u00C2" and search string "\u00C2" there is a match in every strength except L2. hasAccentsAfterMatch() is activated and as an uneducated guess I suspect something wrong in there. I note a getFCD call, perhaps \u00c2\u0303 is a denormalized form.

It seems like there should be a test to make sure that "\u00c2" finds "\u00c2", etc for all weights. Or, perhaps there is some interaction with the normalization mode.

(in reply to: ↑ 4 ) 09/20/07 22:25:54 changed by grhoten

I think this might be working as expected. I think by default, normalization is disabled in en_US for performance reasons. Only some locales turn normalization on.

\u00c2\u0303 is denormalized. The NFC form is \u1eaa, and NFD is \u0041\u0302\u0303.

So you might want to try it with normalization on, or try the NFC or NFD forms to see if things change.

09/22/07 16:27:18 changed by srl

Normalization on for the collator did not 'fix' the problem.

#5954 has a simplified test case for what is probably a related issue.

09/25/07 11:35:10 changed by dmso@...

More information I found from experimenting.

When the source string contains an unnormalized sequence (U+00C2 U+0303), the sequence is not matched by ICU when the search string is either the unnormalized sequence (U+00C2 U+0303), NFC/NFKC(U+1EAA), or NFD/NFKD(U+0041 U+0302 U+0303). It only affects a strength 1 collator (tried LDE and LEN). For strength 1 collator, we should be case and accent insensitive, so we should match the unnormalized sequence (U+00C2 U+0303), but ICU does not. It is the same regardless of whether normalization checking is on (NO) or off (NX) when the collator is opened. Note that the ICU does find a match when the source string contains NFC/NFKC(U+1EAA) or NFD/NFKD(U+0041 U+0302 U+0303), which suggests ICU does some normalization on the search string but not the source string.

09/26/07 01:32:55 changed by srl

  • weeks set to 1.

09/26/07 13:06:09 changed by srl

  • owner changed from srl to michaelow.

10/08/07 15:12:30 changed by michaelow

Part of the cause of the problem is due to the prefix and suffix accent detection during the pattern initialization phase of the string search process. An accent is detected on 0x00C2 (even though accents are ignored in strength 1 collator) and so it incorrectly skips the first occurrence of 0x0041 when shiftForward is called the second time during the actual search phase. This explains why this issue only comes up with a strength 1 collator.

10/18/07 15:32:04 changed by michaelow

  • revw set to srl.

11/06/07 11:48:41 changed by grhoten

  • milestone changed from UNSCH to 4.0.

11/06/07 15:44:41 changed by michaelow

  • status changed from new to assigned.

06/25/08 07:57:39 changed by srl

  • revw changed from srl to yoshito.

06/30/08 17:24:36 changed by yoshito

  • status changed from assigned to closed.
  • resolution set to fixed.

Add/Change #5950 (string search does not return correct position when search pattern is U+00C2 U+0303)




Anti spam check: