To match or not to match characters in composite characters.
See email excerpt below.
Summary of the issue that you are concerned with:
Users might be confused when searching for a character pattern, a match could be
found in a composite character within a text string. For instance, StringSearch
finding a match for the character "\u004e" n the text string "\u00d1", since
(NFD(\u00d1) == \u004e\u0303).
Question:
If a user is confused by finding a match for the pattern "\u004e" in "\u00d1",
would he expect a match for the same pattern in the text "\u004e\u0303"?
When StringSearch was first implemented, it was decided to treat strings and
patterns as a sequence of CEs and matches were performed against these
sequences. Indexes of the CEs matches were then mapped back to the text strings
and returned. Code were written then to handle pattern matching in composite
characters and hence the behaviour that you see now. One example that was listed
in the user guide shows this feature.
"For example, if the user searches for the pattern "ˋ" (\u02cb) in the string
"ÀBC", (\u00c0BC) a match will be found at offsets <0, 1>."
I'll submit a bug/rfe for this and will discuss it with Mark in further detail
when he comes back from vacation. However, since ICU is already very late into
the 2.8 cycle, any decision we make will only be implemented after 2.8.
One workaround to your problem is to use the same Collator to generate the CEs
of the substring that contains the match and compare its size with the CEs of
the pattern. If the CEs have different sizes then the match isn't the exact
match that you wanted. For optimization purposes, you might want to keep the
size of the pattern CEs to avoid generating it again and again.