Ticket #5691 (closed defect: fixed)

Bug contains 7 commit(s) | SVN Diffs for #5691

 

Opened 3 years ago

Last modified 3 months ago

conversion: consistent illegal sequences

Reported by: markus Assigned to: markus
Priority: major Milestone: 4.1.2
Component: conversion Version:
Keywords: security Cc: mark, michaelow
Load: google:50 Xref: 6583 6587 6511
Java Version: Operating System:
Project (C/J): ICU4C Weeks: 1.5
Review: pedberg

Description

ICU conversion has mixed behavior for how many bytes to include in illegal sequences (such that the fundamental encoding scheme is violated) that are reported via callbacks and ucnv_getInvalidChars() - which also determines where the conversion stops with the stop callback.

When converting from UTF-8 to Unicode, the "offending" byte (non-trail byte in trail byte position) is excluded; the illegal sequence stops before that byte, and if the callback resets the error code, then the converter restarts with that byte. For example, in <E3 80 22> the illegal sequence will be <E3 80> and the converter will either stop before/on the 22 or restart there. (ucnv_fromUnicode() behaves like this as well, for the UTF-16 input.)

When converting from table-based (.cnv-based) MBCS charsets to Unicode, the "offending" byte is included in the illegal sequence. For example, in Shift-JIS, <81 22 61> would result in an illegal sequence of <81 22> and the converter stops before/on the 61 or restarts there.

Converters should be consistent.

There seem to be three reasonable behaviors:

  1. Include the offending byte in the illegal sequence.
  2. Exclude the offending byte, with a multi-byte illegal sequence up to just before it.
  3. Make all illegal sequences single bytes; that is, regardless of how many bytes were read, go back to after the first one if there is an error.

Transmission errors are rare with modern protocols. The most common error is assumed to be faulty software that truncates in the middle of a character at the end of some buffer, which might then be concatenated with other buffers. In other words, we assume that the most common cause of encoding errors is too few trail bytes in a multi-byte sequence.

Option b (exclude but go up to offending byte) works best for that because it restarts with the following single or lead byte. Option a (include offending byte) would swallow a single-byte character (usually an ASCII character, which might destroy XML/HTML syntax), while option c (restart after first byte of bad sequence) will sometimes (not for UTF-8 but for MBCSes where lead and trail byte value ranges overlap) restart on a trail byte that can be mistaken for a lead byte.

In addition, option c cannot be implemented in ICU: With streaming conversion, especially in the extreme case of only passing one byte at a time into the converter, the source pointer would have to be moved to before the current buffer in order to stop after the first byte of a 3-byte or longer sequence. For <E3 80 22>, each byte may be passed in in a separate call to ucnv_toUnicode(), and when reading the third byte (22), the pointer can be either set to after the 22 (option a) or before the 22 (option b) but not after the E3 because that would be index -1 of the current buffer.

In summary, it would be best to make all converters consistently implement option b, stopping just before an "offending" byte. Consistent behavior would then also allow a callback to choose to move the source pointer forward by one unit to emulate option a, or to do consistent error reporting.

Attachments

Change History

04/17/07 22:39:15 changed by grhoten

"Common error" is a relative term here. Other common scenarios can include the following:

  1. Take two files that are encoded differently and concatenate them together. For example:
    1. Take a UTF-8 encoded HTML file.
    2. Do an ssi include of an ISO-8859-1 encoded file with a copyright symbol, trademark symbol or other similar characters.
    3. Convert the new mutant file.
  2. Using a similar example as above, instead of using UTF-8/ISO-8859-1, use UTF-8/Shift-JIS.
  3. The text is mislabeled with a different codepage.
    1. The HTTP server says one encoding, and the HTML file states the real encoding.
    2. The user opens a file as Big5, and proceeds to enter Big5-HKSCS characters.

I've seen scenario 1 happen on the front page of a popular search engine a few years ago (you had to go to some of the localized versions to see the misinclusion, but it's already been fixed). Some ads on sf.net have this issue. I've also accidentally done it myself.

I'd vote for option b, and I favor consistency. That way when an ISO-8859-1 character is encountered and interpreted as an MBCS encoding, not too much is read. For example, consider the following

ß<br />

If ß (\xDF) is interpreted as UTF-8 or Shift-JIS, then the < character shouldn't be consumed and really mess up the HTML.

I'm not sure how the various combinations of UTF-8 and Shift-JIS converters interpreting each others data would pan out. Shift-JIS might get misaligned with either option a or b.

04/25/07 14:15:06 changed by grhoten

  • owner changed from somebody to markus.

07/15/08 16:46:06 changed by mark

  • load set to google:50.
  • milestone changed from UNSCH to 4.2.

09/15/08 13:59:50 changed by markus

  • keywords set to security.
  • priority changed from minor to major.
  • status changed from new to assigned.

10/02/08 16:56:18 changed by markus

I am planning to implement option b, with modifications due to interesting cases. The interesting cases are:

  1. The offending byte is completely illegal, that is, it cannot be a single/lead/trail byte. For example, FF in most multi-byte encodings.
  2. An intermediate byte (after the lead byte and before the "offending" byte) could start a character (could be a single/lead byte).

The full algorithm of what I plan to implement:

  • Let's say that we have seen "length" bytes and collected them in the array b[length]. The "offending" byte that made us go into exception handling is b[length-1].
  • We include at least the first byte b[0] in the illegal sequence.
  • We never include the offending byte b[length-1] in the illegal sequence.
  • If any of the intermediate bytes could be the start of a character, we stop the illegal bytes before the first one of those.

Pseudo-code:

  • for(i=1; i<(length-1) && !isSingleOrLead(b[i]); ++i) {}
  • Report an illegal sequence of length "i".

Examples:

  • Shift-JIS
    • 81 40 valid
    • 81 3C illegal 81 followed by valid 3C
    • 81 FF illegal 81 followed by illegal FF
  • EUC-JP
    • 8F A1 FE valid
    • 8F A1 3C illegal 8F A1 followed by valid 3C
    • 8F 3C FE A1 illegal 8F followed by valid 3C and FE A1
  • GB 18030
    • 81 30 FE 39 valid
    • 81 30 FE A1 illegal 81 followed by valid 30 and FE A1
    • 81 30 3C 3E illegal 81 followed by valid 30, 3C, and 3E
    • 81 3C FE A1 illegal 81 followed by valid 3C and FE A1
  • UTF-8
    • F1 80 81 82 valid
    • F1 80 81 3C illegal F1 80 81 followed by valid 3C
    • F1 80 81 FF illegal F1 80 81 followed by illegal FF
    • F1 80 FF 3C illegal F1 80 followed by illegal FF and valid 3C
    • F1 FF 3C 3E illegal F1 followed by illegal FF and valid 3C and 3E

This achieves

  1. Consistent behavior.
  2. The converter never swallows a byte that could start a new character. (Security issue.)
  3. No change to the UTF-8 converter behavior.

Unfortunately, this does share with option c the implementation complication of potentially having to back out bytes from previous buffers and function calls. I hope that I can use the "replay" mechanism from the longest-match m:n conversion feature.

10/14/08 13:35:00 changed by yoshito

  • project changed from all to ICU4C.
  • xref set to 6583.

Created a ticket for J porting. After the design is agreed and implemented in C, ticket:6583 is used for J porting.

10/15/08 20:38:05 changed by markus

  • revw set to pedberg.
  • milestone changed from 4.2 to 4.1.2.

The final policy implemented here is:

  • We include at least the first byte in the illegal sequence which we report via the callback.
  • If any of the non-initial bytes could be the start of a character, or the start of an escape sequence etc., we stop the reported illegal sequence before the first one of those, and restart the conversion with the potential starter byte.

The fix covers the most commonly used converters (the ones most commonly found on the web): Table-based MBCS/DBCS (including EBCDIC), ISO-2022-*, and HZ.

I tried to not change the behavior where it wasn't necessary. In particular, the (logically) simplest error handling would always report only the first byte, but that seems very unnatural and would be a big change in converter behavior. Instead, sometimes the "offending byte" is included in the illegal sequence like it used to be (if it can't start a new character) and sometimes not (if it can).

If the whole sequence is 3 or 4 bytes (EUC-JP/GB 18030), the converter may go back to stop before an even earlier byte. Sometimes it has to back out bytes from a previous buffer. I was happy to see that I could reuse the "replay" mechanism from the m:n conversion feature, and that it worked fine when used from converters that don't otherwise support that feature.

I added some extra code in ucnvmbcs.c to detect if what the state table says is a lead byte can truly start a character. This is mostly for our DBCS conversion tables where the state tables always produce two-byte sequences, but many lead bytes only ever start illegal sequences. I found it more natural, and it reduced the need to modify existing tests: Otherwise, just by looking at the state table the converter would see that FF is a valid lead byte when in fact it never starts a legal sequence in DBCS charsets.

Other converters:

  • SBCS converters are trivially unaffected.
  • The UTF-8 converter has always stopped before a non-trail-byte. (Same for CESU-8.)
  • We should not modify UTF-16 and UTF-32 (and their variants) because they are not really byte-based encodings: Users expect to be able to memcpy() them, at most with a byte swap, so we need to continue to always deliver pairs or quads of bytes.
  • I did not test or change or think much about any other converters: UTF-7, IMAP-mailbox-name, SCSU, BOCU-1, ISCII, LMBCS

10/15/08 20:48:52 changed by markus

  • xref changed from 6583 to 6583 6587.

10/20/08 11:23:24 changed by pedberg

  • xref changed from 6583 6587 to 6583 6587 6511.

Note, this also addressed many of the problems in ticket #6511.

12/05/08 08:55:19 changed by michaelow

  • cc changed from mark to mark, michaelow.

02/09/09 10:47:08 changed by pedberg

  • status changed from assigned to closed.
  • resolution set to fixed.

Add/Change #5691 (conversion: consistent illegal sequences)




Anti spam check: