Ticket #1971 (new enhancement)

Bug contains 1 commit(s) | SVN Diffs for #1971

 

Opened 7 years ago

Last modified 1 year ago

Translit word breaks

Reported by: mark.davis(at)us.ibm.com Assigned to: andy
Priority: minor Milestone: UNSCH
Component: transliterate Version:
Keywords: transliterate Cc:
Load: Xref: 3921 1960
Java Version: Operating System: all
Project (C/J): ICU4C,ICU4J and ICU4JNI Weeks: 2
Review:

Description

Add syntax for Transliteration to be able to recognize word boundaries. Example:

[:Thai:] {([:break:])} [:Thai:] > ' ';

to insert a space between Thai words. This will simplify code for transliterating Thai, and be more generally applicable to other cases.

[:break:] consumes no characters: it matches iff there is a word break at that position.

Additional syntax:

\b for matching a word boundary \B for matching not at a word boundary

Note: There are some possible ways to generalize this that we should consider:

1. Allow any break iterator: character, word, line, sentence (word is default). 2. Allow testing the status also (for the case of \b). 3. Allow adding a specific locale.

Example: \b{ja word=3}

Syntax:

"[:" ""? locale? ("character" | "word" | "line" | "sentence")? "break" ("=" status )? ":]"

("\b" | "\B") ("{" locale? ("character" | "word" | "line" | "sentence")? ("=" status )? "}")?

Note: For 2.4 the plan is to have Thai word break be 'always on' (if we encounter Thai characters), but for some locales (e.g. Japanese) it could make a difference, and for other break iterators it will make a difference.

Attachments

Change History

12/31/69 18:27:05 changed by notes2

Helena: Can 1960 and this be consolidated? Andy: time estimate is padded, assuming that I'd be working with unfamiliar code.

12/31/69 18:27:06 changed by notes

Some changes from j1917 were commited under this jitterbug

12/31/69 18:27:07 changed by auditor

  • 07/26/02 17:58:39 mark moved from incoming to transliterate
  • 08/13/02 21:42:44 grhoten changed notes
  • 08/13/02 21:42:52 grhoten changed notes
  • 10/29/02 16:00:42 hshih changed notes2
  • 07/12/04 18:24:09 andy changed notes2
  • 07/12/04 19:07:38 andy changed notes2

09/28/07 12:33:26 changed by andy

  • load changed.
  • java changed.
  • revw changed.
  • summary changed from RFE: Translit word breaks to Translit word breaks.

Add/Change #1971 (Translit word breaks)




Anti spam check: