ICU converters using conversion tables (.ucm/.cnv files) distinguish in the data
structure and in behavior between roundtrip mappings and fallback/reverse
fallback mappings.
Briefly, roundtrip mappings are between Unicode code points and byte codes for
the same characters. Fallback mappings map between codes for more or less
similar (-> different) characters and are generally useful for displaying but
can be harmful for text processing because they lose information. "Reverse"
fallbacks are specifically from a codepage into Unicode, while we usually just
talk about "fallbacks" from Unicode to a codepage.
ICU allows to turn on or off fallbacks, but that affects only fallbacks from
non-PUA Unicode code points to codepage bytes. Currently, ICU converters always
use PUA and reverse fallbacks.
This is based on the following theory:
- Reverse fallbacks are rare because Unicode was designed to include
all characters from all common charsets.
- Reverse fallbacks should be defined only when Unicode actually considers
the two codepage characters to be the same in its text model.
(Therefore, reverse fallbacks should be ok - same characters.)
- If a character is actually missing from Unicode, it should have a
roundtrip mapping to/from the Unicode PUA.
- If a conversion table for such a charset is modified after that character was
added to Unicode, then the old PUA code point may reasonably get a fallback
mapping, while the newly assigned code point takes over the roundtrip
mapping.
(Therefore PUA fallbacks should be ok - same characters.)
Conversion table design may not always be this rational. It may be desirable to
guarantee that only roundtrip mappings are used.
Proposal to add API to set the fallback behavior to one of the following
levels:
- none = use only roundtrip mappings
- reverse only = use only reverse but no "forward" fallbacks
- reverse + PUA = current behavior for "fallbacks off" as above
- all = use all fallbacks, current behavior for "fallbacks on"