ICU conversion has mixed behavior for how many bytes to include in illegal sequences (such that the fundamental encoding scheme is violated) that are reported via callbacks and ucnv_getInvalidChars() - which also determines where the conversion stops with the stop callback.
When converting from UTF-8 to Unicode, the "offending" byte (non-trail byte in trail byte position) is excluded; the illegal sequence stops before that byte, and if the callback resets the error code, then the converter restarts with that byte. For example, in <E3 80 22> the illegal sequence will be <E3 80> and the converter will either stop before/on the 22 or restart there. (ucnv_fromUnicode() behaves like this as well, for the UTF-16 input.)
When converting from table-based (.cnv-based) MBCS charsets to Unicode, the "offending" byte is included in the illegal sequence. For example, in Shift-JIS, <81 22 61> would result in an illegal sequence of <81 22> and the converter stops before/on the 61 or restarts there.
Converters should be consistent.
There seem to be three reasonable behaviors:
- Include the offending byte in the illegal sequence.
- Exclude the offending byte, with a multi-byte illegal sequence up to just before it.
- Make all illegal sequences single bytes; that is, regardless of how many bytes were read, go back to after the first one if there is an error.
Transmission errors are rare with modern protocols. The most common error is assumed to be faulty software that truncates in the middle of a character at the end of some buffer, which might then be concatenated with other buffers. In other words, we assume that the most common cause of encoding errors is too few trail bytes in a multi-byte sequence.
Option b (exclude but go up to offending byte) works best for that because it restarts with the following single or lead byte. Option a (include offending byte) would swallow a single-byte character (usually an ASCII character, which might destroy XML/HTML syntax), while option c (restart after first byte of bad sequence) will sometimes (not for UTF-8 but for MBCSes where lead and trail byte value ranges overlap) restart on a trail byte that can be mistaken for a lead byte.
In addition, option c cannot be implemented in ICU: With streaming conversion, especially in the extreme case of only passing one byte at a time into the converter, the source pointer would have to be moved to before the current buffer in order to stop after the first byte of a 3-byte or longer sequence. For <E3 80 22>, each byte may be passed in in a separate call to ucnv_toUnicode(), and when reading the third byte (22), the pointer can be either set to after the 22 (option a) or before the 22 (option b) but not after the E3 because that would be index -1 of the current buffer.
In summary, it would be best to make all converters consistently implement option b, stopping just before an "offending" byte. Consistent behavior would then also allow a callback to choose to move the source pointer forward by one unit to emulate option a, or to do consistent error reporting.