I've found some differences in how I expected uregex_split to behave and the results that it returns. The expected behavior is the way that perl creates its results using the given regex. Below, the line labeled 'ICU:' are the results returned by uregex_split() against the example string using the specified regex. The first example makes use of the ICU's enhanced \b thai word-breaking functionality. Also included (where possible) are the results returned by perl, including the command used to create the results.
string : "\u0e09\u0e31\u0e19\u0e01\u0e34\u0e19\u0e02\u0e49\u0e32\u0e27 I eat rice"
regex : "(?w)\\b\\s*"
ICU : "\u0e09\u0e31\u0e19", "\u0e01\u0e34\u0e19", "\u0e02\u0e49\u0e32\u0e27", "", "I", "", "eat", "", "rice"
Expected: "\u0e09\u0e31\u0e19", "\u0e01\u0e34\u0e19", "\u0e02\u0e49\u0e32\u0e27", "I", "eat", "rice"
perl cmd: shell% perl -e 'print "\"" . join("\", \"", split(/\b\s*/, "I|at|ice I eat rice")) . "\""; print("\n");'
string : "I|at|ice I eat rice"
regex : "\\b\\s*"
ICU : "", "I", "|", "at", "|", "ice", "", "I", "", "eat", "", "rice"
PERL : "I", "|", "at", "|", "ice", "I", "eat", "rice"
Expected: "I", "|", "at", "|", "ice", "I", "eat", "rice"
perl cmd: shell% perl -e 'print "\"" . join("\", \"", split(/\b\s*/, "dog cat giraffe mouse |rat| apple| orange rabbit|grape")) . "\""; print("\n");'
string : "dog cat giraffe mouse |rat| apple| orange rabbit|grape"
regex : "\\b\\s*"
ICU : "", "dog", "", "cat", "", "giraffe", "", "mouse", "|", "rat", "| ", "apple", "| ", "orange", "", "rabbit", "|", "grape"
PERL : "dog", "cat", "giraffe", "mouse", "|", "rat", "| ", "apple", "| ", "orange", "rabbit", "|", "grape"
Expected: "dog", "cat", "giraffe", "mouse", "|", "rat", "| ", "apple", "| ", "orange", "rabbit", "|", "grape"
perl cmd: shell% perl -e 'print "\"" . join("\", \"", split(/\s*\b\s*/, "dog cat giraffe mouse |rat| apple| orange rabbit|grape")) . "\""; print("\n");'
string : "dog cat giraffe mouse |rat| apple| orange rabbit|grape"
regex : "\\s*\\b\\s*"
ICU : "", "dog", "", "cat", "", "giraffe", "", "mouse", "|", "rat", "|", "", "apple", "|", "", "orange", "", "rabbit", "|", "grape"
PERL : "dog", "cat", "giraffe", "mouse", "|", "rat", "|", "apple", "|", "orange", "rabbit", "|", "grape"
Expected: "dog", "cat", "giraffe", "mouse", "|", "rat", "|", "apple", "|", "orange", "rabbit", "|", "grape"
Suggested (approximate) solution (if it is decided that ICU's results should match those of perls):
(NOTE: This is completely and totally untested. I have written this inside the trac bug report. This is based on code I used in an external to ICU re-implementation of split() functionality.)
// modification to uregex.cpp / uregex_split()
UBool didFind;
do {
didFind = regexp->fMatcher->find();
} while(((regexp->fMatcher->end(*status) - regexp->fMatcher->start(*status)) == 0) && ((regexp->fMatcher->start(*status) - nextOutputStringStart) == 0) && (didFind == 1));
if(didFind) {
// We found another delimiter. Move everything from where we started looking
// up until the start of the delimiter into the next output string.
int32_t fieldLen = regexp->fMatcher->start(*status) - nextOutputStringStart;
/* ... */