Ticket #3880 (new defect)

SVN Diffs for #3880

 

Opened 4 years ago

Last modified 4 months ago

ArabicShaping.java and ushape.c must not use 065C-065F for lam-alef ligatures (and other comments)

Reported by: kentk(at)cs.chalmers.se Assigned to: mati
Priority: assess Milestone: UNSCH
Component: others Version: 3.0
Keywords: Cc:
Load: Xref:
Java Version: Operating System: all
Project (C/J): ICU4C and ICU4J Weeks: 0.5
Review:

Description (Last modified by grhoten)

I'm looking at ushape.c and ArabicShaping.java and I find that both of them misuse U+065C-U+065F as "internal characters" for lam-alef ligatures, both for "shaping" and for "unshaping".

The code positions U+065C-U+065F where unallocated up to Unicode 4.0. But in the pipeline are allocations of characters to all but one of these positions:

065A..065C 3 ARABIC VOWEL SIGN SMALL V ABOVE

ARABIC VOWEL SIGN INVERTED SMALL V ABOVE ARABIC VOWEL SIGN DOT BELOW

065D..065E 2 ARABIC REVERSED DAMMA

ARABIC FATHA WITH TWO DOTS

Anyone using these (yet to be) allocated characters (assuming that they stay at those positions) together with uchape.c's or ArabicShaping.java's shaping will get very unpleasantly surprised! Likewise for unshaping. I realise that ushape.c and ArabicShaping.java are essentially only for legacy use, but getting a lam-alef in place of abovementioned characters is too harsh a penalty for using "new" characters with these routines.

The easy patch is to use four BMP non-characters. But then how do we know that those aren't used internally elsewhere in the system, and are given in the input to the shaper/unshaper?

I also see that 0xFFFF (with the misleading comment that this code would be in the PUA) used for the shaper's internal purposes; the above question applies.

The TASHKEEL option seems strange. Combining characters don't have "isolated" or "medial" forms... They can be applied to a TATWEEL, but that is something else.

Attachments

Change History

12/31/69 17:42:37 changed by notes2

check with authors about upcoming code updates

12/31/69 17:42:38 changed by notes

We might change the code to use UChar32 internally, then we can use out of range code points. Or it might be possible to rework the code so that it doesn't need place holder code points.

12/31/69 17:42:39 changed by auditor

  • 07/07/04 14:32:51 dougfelt changed notes2
  • 07/07/04 14:32:51 dougfelt moved from incoming to others
  • 07/09/04 13:20:18 schererm changed notes2
  • Fri Dec 3 17:41:19 2004 dougfelt changed notes2: target: "3.2" to "3.4",
  • Tue Sep 27 13:27:27 2005 weiv changed notes2: (via expression '$PgoTl3.5') target: "3.4" to "",
  • Tue Oct 17 00:15:08 2006 emader changed notes2: target: "UNSCH" to "3.8 candidate",
  • Tue Oct 31 22:50:01 2006 emader changed notes2: priority: "small" to "high", target: "3.8 candidate" to "3.8",
  • Tue Oct 31 22:50:01 2006 emader changed notes

02/02/07 14:36:29 changed by andy

  • xref changed.
  • java changed.
  • revw changed.
  • milestone changed from 3.8 to 3.8 M2.

10/04/07 11:41:29 changed by grhoten

  • keywords deleted.
  • load changed.
  • description changed.
  • milestone changed from 3.8 M2 to UNSCH.

07/07/08 12:37:29 changed by srl

  • owner changed from doug to eric.
  • priority changed from major to assess.

07/21/08 08:58:21 changed by hchapman

  • owner changed from eric to mati.

Add/Change #3880 (ArabicShaping.java and ushape.c must not use 065C-065F for lam-alef ligatures (and other comments))




Anti spam check: