Ticket #5588 (new defect)

SVN Diffs for #5588

 

Opened 2 years ago

Last modified 4 months ago

ICU LayoutEngine ignored ISCII syllable detection and splitting rules in ligature formation

Reported by: Jasdeep.Sawhney@... Assigned to: eric
Priority: minor Milestone: 4.2
Component: layout Version: 3.4
Keywords: Cc: Tim.Band@Symbian.com, Myles.Benett@Symbian.com
Load: Xref: 4995 5906
Java Version: Operating System: All
Project (C/J): ICU4C Weeks: 0.2
Review:

Description

When the ICU LayoutEngine checks for syllable boundary before doing the reordering, it ignores one of ISCII's standards for maximum number of VIRAMA's in a syllable. For e.g.

'Ka + Virama + Ka + Virama + Ka + Virama + Ka + Virama + Ka' should be split in the following way:

'Ka + Virama + Ka + Virama + Ka + Virama + Ka + Virama' - First syllable and 'Ka' - Second syllable

Proposed change: This can happen in the state table in IndicRepordering.cpp, which can look like this:

xx vm sm iv i2 ct cn nu dv s1 s2 s3 vr zw

{ 1, 1, 1, 5, 8, 3, 2, 1, 1, 9, 5, 1, 1, 1}, // 0 - ground state

{-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1}, // 1 - exit state

{-1, 6, 1, -1, -1, -1, -1, -1, 5, 9, 5, 5, 4, -1}, // 2 - consonant with nukta

{-1, 6, 1, -1, -1, -1, -1, 2, 5, 9, 5, 5, 4, -1}, // 3 – consonant

{-1, -1, -1, -1, -1, 12, 11, -1, -1, -1, -1, -1, -1, 7}, // 4 - consonant virama

{-1, 6, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1}, // 5 - dependent vowels

{-1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1}, // 6 - vowel mark

{-1, -1, -1, -1, -1, 3, 2, -1, -1, -1, -1, -1, -1, -1}, // 7 - ZWJ, ZWNJ

{-1, 6, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 4, -1}, // 8 - independent vowels that can take a virama

{-1, 6, 1, -1, -1, -1, -1, -1, -1, -1, 10, 5, -1, -1}, // 9 - first part of split vowel

{-1, 6, 1, -1, -1, -1, -1, -1, -1, -1, -1, 5, -1, -1}, // 10 - second part of split vowel

{-1, 6, 1, -1, -1, -1, -1, -1, 5, 9, 5, 5, 13, -1}, // 11 - ct vr ct nu

{-1, 6, 1, -1, -1, -1, -1, 11, 5, 9, 5, 5, 13, -1}, // 12 - ct vr ct

{-1, -1, -1, -1, -1, 15, 14, -1, -1, -1, -1, -1, -1, 7}, // 13 - ct vr ct vr

{-1, 6, 1, -1, -1, -1, -1, -1, 5, 9, 5, 5, 16, -1}, // 14 - ct vr ct vr ct nu

{-1, 6, 1, -1, -1, -1, -1, 14, 5, 9, 5, 5, 16, -1}, // 15 - ct vr ct vr ct

{-1, -1, -1, -1, -1, 18, 17, -1, -1, -1, -1, -1, -1, 7}, // 16 - ct vr ct vr ct vr

{-1, 6, 1, -1, -1, -1, -1, -1, 5, 9, 5, 5, 19, -1}, // 17 - ct vr ct vr ct vr ct nu

{-1, 6, 1, -1, -1, -1, -1, 17, 5, 9, 5, 5, 19, -1}, // 18 - ct vr ct vr ct vr ct

{-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 7} // 19 - ct vr ct vr ct vr ct vr

States 11-18 are new states that do not allow more than 4 VIRAMA's in a syllable, and the 4th VIRAMA is explicit.

There is still, however one problem with this. Ligature formation, for some reason, does not use the syllable boundary information, and forms any ligatures it can as traverses through the input Unicode string. This problem is highlighted in the following example:

Input: Pa + Virama + Ka + Virama + Ssa(0937) + Virama + Ka + Virama + Ssa + Vowel Sign Aa

Expected syllable split: 'Pa + Virama + Ka + Virama + Ssa(0937) + Virama + Ka + Virama' - First syllable 'Ssa + Vowel Sign Aa' - Second syllable

Expected ligature result: 'Half Pa-Ligature KSsa-Ka-Explicit Virama' - First syllable 'Ssa Aa' - Second syllable

However, even after adding the new states responses to the state table, and splitting the syllable, the ligature formation code doesn't use the syllable information, and discards it. The ligatures are still formed as and when the characters are encoutnered in the string. Hence the result ends up like this:

'Half Pa-Ligature KSsa-Ligature KSsa' - First syllable 'Vowel Sign A' - Second syllable

The combination of these two issues is clearly a defect, or missing functionality.

I am not sure about the purpose of syllable detection in the ICU LayoutEngine. I can see that it is used in the reordering funciton, but why that information is discarded during ligature formation, I have no idea.

Is it that clients of the ICU LayoutEngine are expected to detect syllables themselves, and only feed the LayoutEngine text syllable-by-syllable? i.e. the offset should be the syllable boundary?

Regards Jasdeep

Attachments

Change History

04/04/07 11:39:11 changed by grhoten

  • load changed.
  • xref changed.
  • owner changed from somebody to eric.
  • milestone changed from UNSCH to 3.8.
  • keywords deleted.
  • weeks changed.
  • revw changed.

04/04/07 11:39:28 changed by grhoten

  • weeks set to 0.2.

06/27/07 16:27:04 changed by grhoten

  • xref set to 4995.
  • milestone changed from 3.8 to UNSCH.

Duplicate of #4995?

07/02/07 02:39:16 changed by tim.band@...

Re: duplicate of 4995?

Yes, Jaspdeep and I agree that these two defects reports do indeed refer to the same problem (and the solution we put is the same, too!)

10/04/07 18:37:57 changed by eric

  • xref changed from 4995 to 4995 5906.

The problem of applying ligatures across syllable boundaries is also the cause of the problem reported in ticket:5906.

07/10/08 10:40:18 changed by yoshito

  • priority changed from major to assess.

07/21/08 09:06:19 changed by hchapman

  • priority changed from assess to minor.
  • milestone changed from UNSCH to 4.2.

Add/Change #5588 (ICU LayoutEngine ignored ISCII syllable detection and splitting rules in ligature formation)




Anti spam check: