Ticket #5869 (assigned defect)

Bug contains 2 commit(s) | SVN Diffs for #5869

 

Opened 1 year ago

Last modified 1 month ago

UnicodeString::doExtract writes only one single trailing null always

Reported by: mhaubi@... Assigned to: markus (accepted)
Priority: major Milestone: 4.1.1
Component: conversion Version: 3.6
Keywords: Cc:
Load: Xref:
Java Version: Operating System:
Project (C/J): ICU4C Weeks: 0.2
Review: srl

Description

When extracting from UnicodeString using an UConverter created with ucnv_open("UTF32", &uerr), only the first byte of the trailing null is set to zero in the target UChar32 based buffer.

Problem is that in Unicode::doExtract() trailing null is written with u_terminateChars().

ucnv_fromUChars() has the same problem.

See also http://sourceforge.net/mailarchive/forum.php?thread_name=1187779328.6102.32.camel%40salomon-22&forum_name=icu-support

Attachments

Change History

08/22/07 13:42:23 changed by grhoten

  • xref changed.
  • project changed from all to ICU4C.
  • milestone changed from UNSCH to 3.8.
  • owner changed from somebody to grhoten.
  • weeks set to 0.2.
  • revw changed.

08/22/07 16:37:44 changed by grhoten

  • revw set to eric.

The patch for this bug will allow non-byte based encodings, like UTF-16 and UTF-32, to be properly NULL terminated when the result is interpreted as a non-byte based encoding.

08/23/07 11:42:42 changed by eric

  • status changed from new to closed.
  • resolution set to fixed.

08/27/07 22:12:10 changed by markus

  • status changed from closed to reopened.
  • resolution deleted.
  • revw deleted.

The change breaks user code, as seen in our test code (see ticket 5882). Need to either re-fix or revert.

The APIs did work as designed, even if that wasn't the best design.

08/27/07 22:12:31 changed by markus

  • owner changed from grhoten to markus.
  • status changed from reopened to new.

08/27/07 22:12:49 changed by markus

  • status changed from new to assigned.

08/31/07 13:06:38 changed by markus

  • milestone changed from 3.8 to 4.0.

I reverted the change.

svn merge -c -22476 svn+ssh://source.icu-project.org/repos/icu/icu/trunk
(fix copyrights to 2007)
svn commit -m "..."
Committed revision 22607.

This change is too risky this close to a release. It needs more discussion for whether we want the change at all, and if so, how it should work.

For backward compatibility, in every API that does not have a destCapacity parameter and that used to write exactly one 0x00 byte we must continue to write exactly one 0x00 byte. (This is what ticket #5882 is about.)

For backward compatibility, we must write at least one 0x00 if it fits, even if sizeOfNull bytes won't fit. r22476 wrote all or nothing. If we make this sort of change, we need to write either 0 (if capacity<=length) or 1 (if length<capacity<length+sizeOfNULL) or sizeOfNull (if length+sizeOfNULL<=capacity) bytes.

We need to consider when to set U_STRING_NOT_TERMINATED_WARNING. r22476 set it if there was not space for sizeOfNull bytes, even if there was space for one and the old code wrote the 0x00 and didn't set the warning. I think we need to keep the old behavior so that we don't mess with existing user code. We could discuss changing the semantics such that we set the warning if not all of the sizeOfNull bytes fit, but we should consider it carefully.

  • If we change it (set the warning if not all of sizeOfNull fit) then we might break existing code.
  • If we don't change it (set the warning only if not even one 0x00 fits) then the warning becomes meaningless for sizeOfNull>1, but at least we are not changing existing behavior.

For backward compatibility, we cannot change the input side of functions like ucnv_toUChars() (ticket #5870): If we changed them to look for a NUL with sizeOfNull bytes, then we would not find the single NUL that we used to find. Someone might depend on the old behavior. (The old behavior is useless for UTF-16 but works fine for old-style DBCSes that don't use 0x00 in double-byte characters.

If we change the output side (carefully) but cannot change the input side, then we might confuse as many users as now because of the mismatch. We could decide that that's ok though.

It would be preferrable to not change the signature of u_terminateChars() but to introduce a sibling function with the additional parameter, so that we need not change as many call sites.

Just in terms of review and comfort level, I would like to take the time and examine each API (or many) that indirectly calls u_terminateChars(), not just directly.

07/15/08 15:09:21 changed by mark

  • milestone changed from 4.0 to 4.1.1.

10/15/08 10:27:55 changed by markus

  • revw set to srl.

Reviewer: There should be no net change here since the change should have been backed out.


Add/Change #5869 (UnicodeString::doExtract writes only one single trailing null always)




Anti spam check: