Ticket #6826 (new defect)

SVN Diffs for #6826

 

Opened 10 months ago

Last modified 3 weeks ago

uregex_split() (possibly?) splitting incorrectly, split results differ from perl and expected results for some regexes

Reported by: john.engelhart@... Assigned to: andy
Priority: assess Milestone: UNSCH
Component: regexp Version: 3.6
Keywords: Cc:
Load: Xref:
Java Version: Operating System: Mac OS X 10.5
Project (C/J): ICU4C Weeks:
Review:

Description

I've found some differences in how I expected uregex_split to behave and the results that it returns. The expected behavior is the way that perl creates its results using the given regex. Below, the line labeled 'ICU:' are the results returned by uregex_split() against the example string using the specified regex. The first example makes use of the ICU's enhanced \b thai word-breaking functionality. Also included (where possible) are the results returned by perl, including the command used to create the results.

string  : "\u0e09\u0e31\u0e19\u0e01\u0e34\u0e19\u0e02\u0e49\u0e32\u0e27 I eat rice"
regex   : "(?w)\\b\\s*"
ICU     : "\u0e09\u0e31\u0e19", "\u0e01\u0e34\u0e19", "\u0e02\u0e49\u0e32\u0e27", "", "I", "", "eat", "", "rice"
Expected: "\u0e09\u0e31\u0e19", "\u0e01\u0e34\u0e19", "\u0e02\u0e49\u0e32\u0e27", "I", "eat", "rice"
perl cmd: shell% perl -e 'print "\"" . join("\", \"", split(/\b\s*/, "I|at|ice I eat rice")) . "\""; print("\n");'
string  : "I|at|ice I eat rice"
regex   : "\\b\\s*"
ICU     : "", "I", "|", "at", "|", "ice", "", "I", "", "eat", "", "rice"
PERL    : "I", "|", "at", "|", "ice", "I", "eat", "rice"
Expected: "I", "|", "at", "|", "ice", "I", "eat", "rice"
perl cmd: shell% perl -e 'print "\"" . join("\", \"", split(/\b\s*/, "dog cat   giraffe mouse |rat| apple| orange rabbit|grape")) . "\""; print("\n");'
string  : "dog cat   giraffe mouse |rat| apple| orange rabbit|grape"
regex   : "\\b\\s*"
ICU     : "", "dog", "", "cat", "", "giraffe", "", "mouse", "|", "rat", "| ", "apple", "| ", "orange", "", "rabbit", "|", "grape"
PERL    : "dog", "cat", "giraffe", "mouse", "|", "rat", "| ", "apple", "| ", "orange", "rabbit", "|", "grape"
Expected: "dog", "cat", "giraffe", "mouse", "|", "rat", "| ", "apple", "| ", "orange", "rabbit", "|", "grape"
perl cmd: shell% perl -e 'print "\"" . join("\", \"", split(/\s*\b\s*/, "dog cat   giraffe mouse |rat| apple| orange rabbit|grape")) . "\""; print("\n");'
string  : "dog cat   giraffe mouse |rat| apple| orange rabbit|grape"
regex   : "\\s*\\b\\s*"
ICU     : "", "dog", "", "cat", "", "giraffe", "", "mouse", "|", "rat", "|", "", "apple", "|", "", "orange", "", "rabbit", "|", "grape"
PERL    : "dog", "cat", "giraffe", "mouse", "|", "rat", "|", "apple", "|", "orange", "rabbit", "|", "grape"
Expected: "dog", "cat", "giraffe", "mouse", "|", "rat", "|", "apple", "|", "orange", "rabbit", "|", "grape"

Suggested (approximate) solution (if it is decided that ICU's results should match those of perls):

(NOTE: This is completely and totally untested. I have written this inside the trac bug report. This is based on code I used in an external to ICU re-implementation of split() functionality.)

// modification to uregex.cpp / uregex_split()

UBool didFind;

do {
  didFind = regexp->fMatcher->find();
} while(((regexp->fMatcher->end(*status) - regexp->fMatcher->start(*status)) == 0) && ((regexp->fMatcher->start(*status) - nextOutputStringStart) == 0) && (didFind == 1));

if(didFind) { 
  // We found another delimiter.  Move everything from where we started looking
  //  up until the start of the delimiter into the next output string.
  int32_t fieldLen = regexp->fMatcher->start(*status) - nextOutputStringStart;
  /* ... */

Attachments

Change History

04/15/09 11:37:03 changed by yoshito

  • owner changed from somebody to andy.
  • weeks changed.
  • xref changed.
  • revw changed.

09/15/09 06:16:33 changed by anonymous

10/01/09 18:49:32 changed by anonymous

Thanks for this Suggested solution (if it is decided that ICU's results should match those of perls) coupon codes|discount codes|sears coupon codes

10/15/09 21:55:53 changed by anonymous

11/25/09 19:36:19 changed by venn99

Thanks for the info. Learn how to <a href="http://www.youronlinetipsource.com"> Make Money Online </a>

<a href="http://www.cashgiftingcritter.com"> cash gifting </a>

11/25/09 19:38:16 changed by venn99

01/15/10 16:59:46 changed by anonymous

got one of the greatest movie library in web, where you can find more than a 1000 titles. We grants you screenshots in full download movies best of 2009 and 2010 each movie, description, and two-minute video and audio previews. There is no additional software or browser plug-ins required, as all downloads are direct and available instantly after the payment.


Add/Change #6826 (uregex_split() (possibly?) splitting incorrectly, split results differ from perl and expected results for some regexes)




Anti spam check: