Ticket #6004 (closed defect: fixed)

Bug contains 9 commit(s) | SVN Diffs for #6004

 

Opened 2 years ago

Last modified 2 years ago

Ignorable code points get an incorrect weight at strength 4 with partial sort keys

Reported by: dmso@... Assigned to: ajmacher
Priority: major Milestone: 4.0
Component: collation Version: 3.8
Keywords: Cc:
Load: Xref:
Java Version: Operating System:
Project (C/J): ICU4C Weeks: 2
Review: grhoten

Description

When using ucol_nextSortkeyPart() to generate weights with a strength 4 collation, ignorable characters are given weight 0xFF as their strength 4 weight. This problem does not occur with ucol_getSortKey(). This causes strings to be determined as collating equal when using ucol_getSortKey(), but collating unequal when using ucol_nextSortkeyPart().

Here is an example program that demonstrates this.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unicode/ucol.h>
#include <unicode/uiter.h>

#define CHECK(m) \
   if (U_FAILURE(icuRC)) \
   { \
      printf("Failed on '%s'\n\n", m); \
      exit(-1); \
   }

int main(int argc, char* argv[])
{
   UErrorCode icuRC = U_ZERO_ERROR;
   UCollator* ucol;
   UChar data[] = { 0xFFFD, 0x0006, 0x0006, 0x0006 };
   int i, j;
   static const int bufSize = 50;
   unsigned char buf[bufSize];

   ucol = ucol_openFromShortString("LEN_S4", false, NULL, &icuRC);
   CHECK(ucol_openFromShortString);

   for (i=0; i<4; ++i)
   {
      UCharIterator uiter;
      uint32_t state[2] = { 0, 0 };
      int32_t keySize;
      int32_t dataLen = i+1;

      printf("String:");
      for (j=0; j<dataLen; ++j)
      {
         printf(" %04X", data[j]);
      }
      printf("\n");

      // Full sort key
      keySize = ucol_getSortKey(ucol,
                                data,
                                dataLen,
                                buf,
                                bufSize);
      CHECK(ucol_getSortKey);

      printf("\tFull key:    ");
      for (j=0; j<keySize; ++j)
      {
         printf("%02x", buf[j]);
      }
      printf("\n");

      // Partial sort key
      uiter_setString(&uiter, data, dataLen);
      keySize = ucol_nextSortKeyPart(ucol,
                                     &uiter,
                                     state,
                                     buf,
                                     bufSize,
                                     &icuRC);
      CHECK(ucol_nextSortKeyPart);

      printf("\tPartial key: ");
      for (j=0; j<keySize; ++j)
      {
         printf("%02x", buf[j]);
      }
      printf("\n\n");
   }

   //=============================================
   ucol_close(ucol);
   return(0);
}

Output on ICU 3.2.1:

String: FFFD
        Full key:    1fb301050105012100
        Partial key: 1fb30105010501ff00

String: FFFD 0006
        Full key:    1fb301050105012100
        Partial key: 1fb30105010501ffff00

String: FFFD 0006 0006
        Full key:    1fb301050105012100
        Partial key: 1fb30105010501ffffff00

String: FFFD 0006 0006 0006
        Full key:    1fb301050105012100
        Partial key: 1fb30105010501ffffffff00

Output on ICU 3.8:

String: FFFD
        Full key:    225d01050105012400
        Partial key: 225d0105010501ff00

String: FFFD 0006
        Full key:    225d01050105012400
        Partial key: 225d0105010501ffff00

String: FFFD 0006 0006
        Full key:    225d01050105012400
        Partial key: 225d0105010501ffffff00

String: FFFD 0006 0006 0006
        Full key:    225d01050105012400
        Partial key: 225d0105010501ffffffff00

Attachments

Change History

11/08/07 15:15:18 changed by grhoten

  • owner changed from somebody to ajmacher.
  • weeks set to 2.
  • xref changed.
  • revw changed.
  • milestone changed from UNSCH to 4.0.

(in reply to: ↑ description ) 11/09/07 13:18:43 changed by weiv

I looked at ICU's code and I think I have figured out the issue - on the quaternary level, if currently processed is a completely ignorable CE (one that equals zero), there should be no addition to the quaternary level. Right now, a 0xFF or a Hiragana quaternary value always gets added.

Likely fix is to change line 6074 of ucol.cpp so that ignorable characters don't get added.

11/09/07 14:35:25 changed by ajmacher

  • status changed from new to assigned.
  • revw set to grhoten.

12/26/07 11:28:12 changed by grhoten

  • status changed from assigned to closed.
  • resolution set to fixed.

Add/Change #6004 (Ignorable code points get an incorrect weight at strength 4 with partial sort keys)




Anti spam check: