Ticket #5906 (closed defect: fixed)

Bug contains 5 commit(s) | SVN Diffs for #5906

 

Opened 1 year ago

Last modified 7 months ago

Some words in Telugu are not processed correctly by ICU

Reported by: sylwekbala@... Assigned to: eric
Priority: major Milestone: 4.0
Component: layout Version: 3.8
Keywords: Cc:
Load: Xref: 5588
Java Version: Operating System: Windows XP and Linux RedHat 8.0
Project (C/J): ICU4C Weeks: 1
Review: doug

Description

I used ICU-C 3.6 and 3.8.d02 versions and none of them return correct indexes for some glyphs in Telugu. To test it I used "Sample/layout" program which is delivered with ICU-C source code. I compared it with OpenOffice and interesting thing is OO uses the ICU-C 3.6 as well but the text written there is correct. I suppose that there are some patches for this. I used fonts such as: - Gautami - TLOT-Hemalatha Normal - TLOT-Hemalatha Italic - TLOT-Hemalatha Bold - TLOT-Hemalatha Bold Italic - and many others For all these fonts ICU returns inappropriate last index glyph. Below I put the sequence of unicodes which I input to achieve wrong results this is: U+0C2A;U+0C4D;U+0C30;U+0C15;U+0C3E;U+0C37;U+0C4D;

according to UNICODE standards they are: 0C2A;TELUGU LETTER PA;Lo;0;L;;;;;N;;;;; 0C4D;TELUGU SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;; 0C30;TELUGU LETTER RA;Lo;0;L;;;;;N;;;;; 0C15;TELUGU LETTER KA;Lo;0;L;;;;;N;;;;; 0C3E;TELUGU VOWEL SIGN AA;Mn;0;NSM;;;;;N;;;;; 0C37;TELUGU LETTER SSA;Lo;0;L;;;;;N;;;;; 0C4D;TELUGU SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;

I attached: - FontMap.GDI, Sample.txt - input for "Sample/layout" program - LayoutSample.PNG - final text rendering by using "Sample/layout" program (incorrect) - OpenOffice.PNG - final text rendering by using OpenOffice (correct)

Attachments

LayoutSample.PNG (4.4 kB) - added by sylwekbala@poczta.onet.pl on 09/01/07 06:52:09.
incorrect result
OpenOffice.PNG (12.6 kB) - added by sylwekbala@poczta.onet.pl on 09/01/07 07:09:08.
correct result
OpenOffice.2.PNG (12.6 kB) - added by sylwekbala@poczta.onet.pl on 09/01/07 07:11:51.
Sample.txt (26 bytes) - added by sylwekbala@poczta.onet.pl on 09/01/07 07:15:21.
FontMap.GDI (324 bytes) - added by sylwekbala@poczta.onet.pl on 09/01/07 07:15:36.
ImproperTeluguCharacters.ods (13.2 kB) - added by sylwekbala@poczta.onet.pl on 09/17/07 02:36:06.
Additional Telugu sequences wrong interpreted by ICU (OpenOffice Calc document)

Change History

09/01/07 06:52:09 changed by sylwekbala@...

  • attachment LayoutSample.PNG added.

incorrect result

09/01/07 07:09:08 changed by sylwekbala@...

  • attachment OpenOffice.PNG added.

correct result

09/01/07 07:11:51 changed by sylwekbala@...

  • attachment OpenOffice.2.PNG added.

09/01/07 07:15:21 changed by sylwekbala@...

  • attachment Sample.txt added.

09/01/07 07:15:36 changed by sylwekbala@...

  • attachment FontMap.GDI added.

09/06/07 01:07:26 changed by grhoten

  • xref changed.
  • component changed from unknown to layout.
  • version changed from Current to 3.8.
  • milestone changed from UNSCH to 4.0.
  • owner changed from somebody to eric.
  • weeks set to 0.5.
  • revw changed.

I don't think OpenOffice.org uses the ICU layout engine on Windows.

09/10/07 01:20:42 changed by sylwekbala@...

Yes, you have right I have noticed lately that OpenOffice on Windows doesn't use ICU layout engine. I tried the same version OO on Linux and there is the same problem as in ICU - then I suppose the ICU layout engine is used on Linux. I reported this problem to OpenOffice as well.

09/17/07 02:36:06 changed by sylwekbala@...

  • attachment ImproperTeluguCharacters.ods added.

Additional Telugu sequences wrong interpreted by ICU (OpenOffice Calc document)

09/21/07 01:27:56 changed by sylwekbala@...

My latest tests shows that the problem is related to OpenType fonts only. In case of I use TrueType fonts then every thing seems to be okay. Anyway these OpenType fonts seem to be correct because Windows interprets them correctly.

10/04/07 11:24:12 changed by eric

  • status changed from new to assigned.
  • weeks changed from 0.5 to 1.

KA + VIRAMA + SSA is an akhand ligature. This sequence gets reordered to KA + SSA + VIRAMA. An *input* sequence of KA + SSA + VIRAMA is two syllables, but all three glyphs get tagged with the 'AKHN' feature, so the akhand ligature will form, even though the glyphs aren't all in the same syllable. (This is a case where UniScribe's approach of processing one syllable at a time will do the right thing.)

The input sequence in the bug has an AA matra after the KA. The ligature still forms because the Ligature Substitution subtable in the fonts ignores all marks except for VIRAMA. A case could be made that it shouldn't, which would fix this particular case, but the same input sequence without the matra would still fail.

I'm not sure how to fix this. My best guess is to not apply features like 'AKHN' to syllables that are too short to match. (i.e. an akhand ligature will be at least three glyphs long)

10/04/07 18:50:04 changed by eric

  • xref set to 5588.

Ticket:5588 describes the same problem. It also seems that this problem can occur across the boundary of two "long" syllables, so syllable length cannot be used to solve this problem. Perhaps encoding a syllable number with each glyph and restricting all (Should contextual lookups be restricted to a single syllable?) lookups to the same syllable. (We could use just the low-order bit of the syllable number - perhaps steal a bit from the feature flags)

11/14/07 11:13:58 changed by eric

  • revw set to srl.

03/14/08 10:58:28 changed by eric

  • revw changed from srl to doug.

05/21/08 10:55:21 changed by doug

Please remove the commented-out code in OTLE.cpp at lines 336 and 348. I didn't see tests for this, are there any?

05/27/08 11:04:23 changed by doug

  • status changed from assigned to closed.
  • resolution set to fixed.

06/09/08 06:28:24 changed by sylwekbala@...

I tested it on ICU 4.0.d02 version and some other sequences still doesn't work properly. Sequences like: ప్రా గ్రా


Add/Change #5906 (Some words in Telugu are not processed correctly by ICU)




Anti spam check: