See also CLDR defect 1514.
Here's a note from Mark Davis on 2007-09-04 giving more details. I tend to favour his proposed solution 2.
ICU could do the right thing by fully normalizing, but at a definite performance hit for any affected language like Vietnamese. By adding some extra information, we concluded that we could both do the right thing and keep the performance up. So that's why Vladimir filed the bug on CLDR, to add that extra data.
Let me recount the issue. The desired ordering is:
a << a-dot < a-hat << a-dot-hat
The weights would be:
WA << WA+WD < WAH << WAH+WD
where WA means Weight of A, WAH means Weight of A-hat (a primary difference), and WD means Weight of Dot (a primary ignorable)
That is, the dot is a secondary difference and the hat is a primary (letter) difference. Let's expand this out by adding all of the following equivalences. I mark the cases where we transform to FCD also, and show the desired weights after == for the first item in each equivalency group.
A == WA
<<
A-dot == WA+WD
A+dot
<
A-hat == WAH
A + hat
<<
A + dot + hat == WAH + WD
A + hat + dot => A + dot + hat (FCD)
A-dot + hat
A-hat + dot => A + dot + hat (FCD)
A-dot-hat
Because of FCD, the last cases devolved to 3:
1. A + dot + hat
2. A-dot + hat
3. A-dot-hat
At build time, we build data for the canonical equivalents of all the characters that are tailored. So we build a table for
A + hat
A-hat
but none of the other cases. Now, in processing, we deal with #1 correctly, by doing a discontiguous contraction, joining A-hat, to get WAH, then having the weight for the dot come after. But the other two cases don't get touched, so we get a difference that shouldn't exist.
Here are some of the alternatives that would solve the problem for Vietnamese.
- Have a flag to fully decompose (NFD, not just FCD). Use that to fully decompose Vietnamese. Expensive.
- Note at build time when we see a contraction of X + combining mark, and mark all of the characters containing the base letter of X as needing full decomposition. Thus when we hit a-dot, or a-ring, or anything containing 'a' or 'A', AND that character is followed by a combining mark, we'd decompose fully. Faster, since we don't always decompose, but lookahead, plus new code for a different kind of operation.
- Add all the characters that are significant to Vietnamese to the ones that get pre-built table entries. At that point, we'd build and cache all the canonically equivalent cases. Thus A-dot-hat would have a pre-build entry (WAH + WD) -- and by equivalence, also A-dot + hat, which would be a contraction that produced WAH + WD also. Much cheaper, and little (though not zero) code. It is not completely general, since it wouldn't handle, say, a+bar_below + hat correctly unless we knew when the rules were built that this combination was important.
(The bug doesn't need the CLDR data for fixing in the short term -- that can be done wholly in ICU as a special case. But for the longer term, we should add to CLDR.)