Ticket #5456 (assigned defect)

SVN Diffs for #5456

 

Opened 2 years ago

Last modified 8 months ago

Uppercase formatting option results in accented capital letters - Invalid for Greek

Reported by: yliang(at)actuate.com Assigned to: mark (accepted)
Priority: assess Milestone: UNSCH
Component: properties Version: 3.4
Keywords: properties Cc: markus
Load: Xref:
Java Version: sunjdk1.4.x Operating System: all
Project (C/J): ICU4C,ICU4J and ICU4JNI Weeks:
Review:

Description

We have used com.ibm.icu.lang.UCharacter.toUpperCase to uppercase a Greek string. And the result is wrong. Capital letters in Greek cannot be accented.

Consider the following Greek words written in lower letters as an example for my explanation: Üäéêïò, êåßìåíï, ßñéäá

In Greek, the acute accent (') is placed on top of the vowel letter (stressed) of the syllable of the word, which is pronounced the loudest i.e. Ü-äéêïò, êåß-ìåíï, ß-ñéäá

1) If the initial vowel of a word is capitalised and stressed, then the acute accent (') should be placed on the upper left corner of the vowel, e.g. ¢äéêïò, ºñéäá. For instance in ISO 8859-7 encoding: Üäéêïò->¢äéêïò Ü (hex value: DC) should be replaced with ¢ (hex value: B6) and ßñéäá->ºñéäá ß (hex value: DF) should be replaced with º (hex value: BA)

2) If the whole word is capitalised, then the acute accent SHOULD NOT be used, e.g. ÁÄÉÊÏÓ, ÉÑÉÄÁ, ÊÅÉÌÅÍÏ. For instance in ISO 8859-7 encoding: Üäéêïò->ÁÄÉÊÏÓ Ü (hex value: DC) should be replaced with Á (hex value: C1) ä (hex value: E4) should be replaced with Ä (hex value: C4) é (hex value: E9) should be replaced with É (hex value: C9) ê (hex value: EA) should be replaced with Ê (hex value: CA) ï (hex value: EF) should be replaced with Ï (hex value: CF) ò (hex value: F2) should be replaced with Ó (hex value: D3)

êåßìåíï-ÊÅÉÌÅÍÏ ê (hex value: EA) should be replaced with Ê (hex value: CA) å (hex value: E5) should be replaced with Å (hex value: C5) ß (hex value: DF) should be replaced with É (hex value: C9) ì (hex value: EC) should be replaced with Ì (hex value: CC) å (hex value: E5) should be replaced with Å (hex value: C5) í (hex value: ED) should be replaced with Í (hex value: CD) ï (hex value: EF) should be replaced with Ï (hex value: CF)

ßñéäá->ÉÑÉÄÁ ß (hex value: DF) should be replaced with É (hex value: C9) ñ (hex value: F1) should be replaced with Ñ (hex value: D1) é (hex value: E9) should be replaced with É (hex value: C9) ä (hex value: E4) should be replaced with Ä (hex value: C4) á (hex value: E1) should be replaced with Á (hex value: C1)

There is only one exception to the second rule. Before getting into this, allow me to mention another rule which relates to our issue. In Greek, monosyllabic words aren't accented because there is only one syllable. There are exceptions to this rule. One of these exceptions is the word 'Þ' (the equivalent of 'or' in English) which is one of the monosyllabic words that SHOULD be accented when written in lower letters 'Þ' (This occurs in order to distinguish it from the article 'ç' which by default is not accented.). In addition, it is the only one word that SHOULD be accented when written in capital letters '¹' (again to distinguish it from the article when written in capitals). For instance in ISO 8859-7 encoding: Þ->¹ Þ (hex value: DE) should be replaced with ¹ (hex value: B9)

Attachments

Change History

12/31/69 17:44:47 changed by notes

See reply for actual bug report.

12/31/69 17:44:48 changed by auditor

  • Mon Oct 9 05:12:36 2006 grhoten sent reply 1
  • Mon Oct 9 05:12:52 2006 grhoten changed notes2: target: "UNSCH" to "",
  • Mon Oct 9 05:12:52 2006 grhoten changed notes
  • Mon Oct 9 05:22:38 2006 schererm changed notes2: assign: "" to "mark, markus",
  • Mon Oct 9 05:23:26 2006 grhoten changed notes2: priority: "" to "assess",
  • Mon Oct 9 05:23:26 2006 grhoten moved from incoming to properties
  • Thu Nov 9 18:03:19 2006 emmons changed notes2: target: "UNSCH" to "3.8 Candidate",

10/09/06 05:12:36 changed by George Rhoten <grhoten(at)gmail.com>

(resubmitting bug report with proper Unicode characters)

We have used com.ibm.icu.lang.UCharacter.toUpperCase to uppercase a Greek string. And the result is wrong. Capital letters in Greek cannot be accented.

Consider the following Greek words written in lower letters as an example for my explanation: άδικος, κείμενο, ίριδα

In Greek, the acute accent (') is placed on top of the vowel letter (stressed) of the syllable of the word, which is pronounced the loudest i.e. ά-δικος, κεί-μενο, ί-ριδα

1) If the initial vowel of a word is capitalised and stressed, then the acute accent (') should be placed on the upper left corner of the vowel, e.g. ’δικος, Ίριδα. For instance in ISO 8859-7 encoding: άδικος->’δικος ά (hex value: DC) should be replaced with ’ (hex value: B6) and ίριδα->Ίριδα ί (hex value: DF) should be replaced with Ί (hex value: BA)

2) If the whole word is capitalised, then the acute accent SHOULD NOT be used, e.g. ΑΔΙΚΟΣ, ΙΡΙΔΑ, ΚΕΙΜΕΝΟ. For instance in ISO 8859-7 encoding: άδικος->ΑΔΙΚΟΣ ά (hex value: DC) should be replaced with Α (hex value: C1) δ (hex value: E4) should be replaced with Δ (hex value: C4) ι (hex value: E9) should be replaced with Ι (hex value: C9) κ (hex value: EA) should be replaced with Κ (hex value: CA) ο (hex value: EF) should be replaced with Ο (hex value: CF) ς (hex value: F2) should be replaced with Σ (hex value: D3)

κείμενο-ΚΕΙΜΕΝΟ κ (hex value: EA) should be replaced with Κ (hex value: CA) ε (hex value: E5) should be replaced with Ε (hex value: C5) ί (hex value: DF) should be replaced with Ι (hex value: C9) μ (hex value: EC) should be replaced with Μ (hex value: CC) ε (hex value: E5) should be replaced with Ε (hex value: C5) ν (hex value: ED) should be replaced with Ν (hex value: CD) ο (hex value: EF) should be replaced with Ο (hex value: CF)

ίριδα->ΙΡΙΔΑ ί (hex value: DF) should be replaced with Ι (hex value: C9) ρ (hex value: F1) should be replaced with Ρ (hex value: D1) ι (hex value: E9) should be replaced with Ι (hex value: C9) δ (hex value: E4) should be replaced with Δ (hex value: C4) α (hex value: E1) should be replaced with Α (hex value: C1)

There is only one exception to the second rule. Before getting into this, allow me to mention another rule which relates to our issue. In Greek, monosyllabic words aren't accented because there is only one syllable. There are exceptions to this rule. One of these exceptions is the word 'ή' (the equivalent of 'or' in English) which is one of the monosyllabic words that SHOULD be accented when written in lower letters 'ή' (This occurs in order to distinguish it from the article 'η' which by default is not accented.). In addition, it is the only one word that SHOULD be accented when written in capital letters 'Ή' (again to distinguish it from the article when written in capitals). For instance in ISO 8859-7 encoding: ή->Ή ή (hex value: DE) should be replaced with Ή (hex value: B9)

08/01/07 17:49:00 changed by mark

  • load changed.
  • weeks changed.
  • xref changed.
  • revw changed.

There are a number of cases where there are specialized casing. For example, in French sometimes capitalization removes accents. We may want to support this through a transliterator: so filed a bug at CLDR.

09/17/07 10:43:30 changed by mark

  • milestone changed from 3.8 candidate to 4.0.

09/17/07 10:45:48 changed by mark

  • status changed from new to assigned.

03/21/08 10:22:34 changed by mark

  • milestone changed from 4.0 to UNSCH.

Add/Change #5456 (Uppercase formatting option results in accented capital letters - Invalid for Greek)




Anti spam check: